Country Prediction in Detail: Bayes, Confidence, and Calibration

February 6, 2026

This is the companion to the high-level overview. That post explains the intuition. This one shows the math.

Naive Bayes in log-space

Given a forename and surname, we compute the posterior probability for each of the 200+ countries in the model:

P(country | forename, surname) ∝ P(forename | country) × P(surname | country) × P(country)

We treat forename and surname as conditionally independent given the country. Not literally true (there are naming conventions that link them), but it works well enough and keeps the model compact. The alternative is modeling joint (forename, surname) pairs, which would need orders of magnitude more training data to avoid sparsity.

All the multiplications happen in log-space. With 200+ countries and some very small probabilities, multiplying directly would underflow to zero. Adding logs avoids that. We normalize at the end using the max-log trick: subtract the maximum log-probability before exponentiating, which keeps everything numerically stable.

Population priors

The P(country) term comes from 2023 World Bank/UN population estimates. China gets a prior of roughly 17.8%, India 17.7%, the USA about 4.2%. Luxembourg gets 0.009%.

This matters more than people expect. A name common in both China and Luxembourg would get equal weight without priors, but the base rates differ by a factor of 2,000. The prior is a statement about the world, not about the name: before looking at any name evidence, a random person drawn from the world population is far more likely to be from a large country. The name evidence then updates that prior up or down.

For countries missing from the population data, we fall back to a uniform prior (1 / num_countries). This is conservative: it doesn't favor or penalize the unknown country.

N-gram fallback for unknown names

When a name isn't in the training data, we fall back to character n-gram similarity. The idea is simple: similar-looking names probably have similar country distributions.

We decompose both the unknown name and every known name into bigrams and trigrams. For "Giovanni": {gi, io, ov, va, an, nn, ni, gio, iov, ova, van, ann, nni}. Then we compute the Jaccard similarity between the two sets:

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

We keep the 5 most similar known names, provided they meet a minimum similarity threshold of 0.2. Their country distributions are blended, weighted by similarity score. Confidence is set to the highest similarity found, which means n-gram predictions always have lower confidence than exact matches. That's intentional.

"Giovnni" (a typo) shares almost all of Giovanni's n-grams and gets a Jaccard score above 0.8. The result is nearly identical to the exact match. "Mikhail" and "Mikail" share enough n-grams that the Russian/Turkish signal bleeds through correctly. But "Jianwei" vs. "Chien-wei" (same Chinese name, different romanization) won't match because they share almost no character sequences. N-grams are a patch for close variants, not a general transliteration solution.

Agreement-aware confidence

I spent more time on confidence than on the Bayes formula itself. Getting the prediction is the easy part. Knowing when to trust it is harder.

The raw confidence is the product of three factors. First, coverage: did we find the forename in the data, the surname, or both? Two exact matches is best. One exact match and one n-gram fallback is worse. Two n-gram fallbacks is worst. Coverage captures how much evidence we actually have.

Second, distribution peak: the maximum probability in the combined country distribution. "Giovanni" puts about 70% of its mass on Italy. Sharp. "Maria" puts 15% on Brazil, 14% on Mexico, 13% on the USA, and tails off slowly. A flat distribution means the name doesn't discriminate well between countries, and confidence should reflect that.

Third, agreement between forename and surname. This is the one I find most interesting. We measure it as the square root of the overlap between the two distributions:

overlap = Σ min(Pforename[c], Psurname[c]) for each country c

agreement = √(overlap)

If both names point at the same countries, the overlap is close to 1.0 and agreement is high. If they point at different countries, the overlap is tiny. The square root softens the penalty so that partial overlap (say, both distributions include the US but at different ranks) still gets some credit.

When only one name is provided, agreement defaults to 1.0. No penalty, but no boost either. The coverage factor already accounts for the missing information.

raw_confidence = coverage × peak × agreement

final_confidence = calibrate(raw_confidence)

Isotonic calibration

Raw confidence is useful for ranking (higher is better) but the actual number doesn't mean much yet. A raw 0.8 might correspond to 73% real accuracy or 88%. To fix this, we fit isotonic regression on a held-out validation set using the Pool Adjacent Violators algorithm.

The procedure: sort all validation examples by raw confidence, then adjust the mapping so that reported confidence matches observed accuracy in each bin, subject to the constraint that the mapping is monotonically non-decreasing. The result is a piecewise linear function stored as a small list of knot points. At inference time, we interpolate.

After calibration, a reported 0.8 really does mean roughly 80% accuracy in the validation data. We flag anything below 0.6 as low_confidence in the API response.

Two examples, side by side

"Giovanni Rossi." Both names found in the data (full coverage). Both distributions peak at Italy. The overlap between them is close to 1.0, so agreement is high. All three factors are strong. After calibration, final confidence lands around 0.9.

"Ashley Chen." Both names found (full coverage again), but "Ashley" distributes across the USA, UK, and Australia while "Chen" distributes across China, Taiwan, and Singapore. The posterior still picks a winner, probably China because the population prior is large. But the agreement overlap is tiny. Confidence drops to somewhere around 0.4. The score says "I picked one, but I'm not confident."

That drop is the system working as designed. When the forename and surname tell different stories, there's genuine ambiguity, and the confidence should reflect it.

Known weaknesses

Naive Bayes with population priors handles the average case well. It struggles in a few predictable situations. Diaspora names are the big one: an Indian-origin person living in the UK for three generations has British-sounding forenames but an Indian surname. The model will pick up on both signals, but the confidence will be middling and the top country could go either way.

Pan-regional names like "Mohamed" appear in dozens of countries. The distribution is inherently flat, and the prior ends up doing most of the work. You'll get a result, but it won't be very informative.

The Nordics are a particular headache. Swedish and Norwegian naming traditions overlap heavily, and neither country is large enough for the population prior to break the tie. I don't have a good answer for this yet. If you're working with Scandinavian data, I'd suggest treating the Nordic countries as a cluster rather than trusting fine-grained distinctions between them.