How Nomacore Predicts Country of Origin

January 14, 2026

You give us a name. We give you a list of countries with probabilities. "Giovanni Rossi" comes back as 89% Italy, 4% Switzerland, and so on. This post explains the idea at a high level. If you want the actual math, I wrote a separate technical post that gets into formulas, n-gram similarity, and calibration.

The basic idea

Different names are popular in different countries. "Giovanni" is overwhelmingly Italian. "Yuki" is overwhelmingly Japanese. "Maria" shows up everywhere. If you know which names are common where, you can work backwards: given a name, which country is it most likely from?

That's what we do. We built lookup tables from public name registries, census data, and population records. For each name, we know the distribution across countries. The API combines the forename signal and the surname signal, weighted by how many people live in each country, and returns a probability distribution.

Why both names are better than one

"Giovanni" points strongly at Italy on its own. "Rossi" also points at Italy. Put them together and you get very high confidence. The two signals agree.

Now try "Ashley Chen." "Ashley" is common in the US, UK, and Australia. "Chen" is common in China, Taiwan, and Singapore. The signals disagree, and the confidence drops. That's the right behavior. The API isn't just guessing a country; it's telling you how sure it is.

A single name still works, just with less information. "Smith" alone could be American, British, Australian, or Canadian. You'll get a wider distribution and lower confidence, which accurately reflects the ambiguity.

Population matters

Suppose a name is equally common in China and Luxembourg. Without any other information, the person is far more likely to be Chinese, because China has 1.4 billion people and Luxembourg has 650,000. We factor population into every prediction. Otherwise the model would treat a tiny country the same as a huge one, which doesn't match reality.

Typos and unfamiliar names

Not every name is in our data. Typos happen. Transliterations vary. When we don't recognize a name exactly, we look for the closest known names by character similarity. "Giovnni" (missing an 'a') looks a lot like "Giovanni," so it inherits most of Giovanni's country distribution. You still get Italy.

This works well for minor misspellings and close transliteration variants. It doesn't work for names that are completely absent from the training data with no similar neighbors. In those cases, the API still returns a result, but the confidence will be low. When we don't know, we say so.

What the confidence score means

Every prediction comes with a confidence score between 0 and 1. We calibrate it so that the number actually means something: if we report 0.8, roughly 80% of predictions at that level were correct in our validation data. It's not just "how sure the model feels." It tracks real accuracy.

We flag anything below 0.6 as low confidence. If you're building on top of the API, that flag is worth filtering on.

Where it struggles

The model is good at names with clear geographic signal. It's weaker on diaspora names (a family of Indian origin living in the UK for generations), on names shared across many countries ("Mohamed" appears in dozens), and on countries with overlapping naming traditions like the Nordics. I wrote more about these failure modes in What Names Can't Tell You.

If you want to understand the internals (Bayes' theorem, the confidence formula, how calibration works), the technical deep-dive covers all of it.