Why Andrea Is Male in Italy

April 11, 2026

Andrea means "manly." It comes from the Greek andreios, and in Italy it's a men's name. About 95% of Italian Andreas are male. But in the United States, Andrea is around 90% female. Germany, same story. A global model that averages across all countries would report something like 40% male, 60% female, which is wrong everywhere.

A single worldwide number for "Andrea" is useless. You need to know where the person is from.

Andrea isn't alone

"Jean" is male in France (Jean-Paul Sartre) and female in English (Jean Harlow). "Nicola" is a man's name in Italy but a woman's name in the UK. "Simone" follows the same pattern: Simone de Beauvoir was French and female, but Simone in Italy is almost always male.

Then there are names that are genuinely ambiguous even within a single country. "Yuki" is used for both genders in Japan, with the ratio depending on which kanji is used. In romanized form, that distinction disappears. "Alex" and "Jordan" are close to 50/50 in most English-speaking countries.

These aren't edge cases. If your data includes Italian customers, a global-only gender model will misclassify every Andrea, every Nicola, and every Simone.

How the API handles it

The /predict/gender endpoint accepts an optional country parameter. Pass it and you get the country-specific distribution:

# Andrea in Italy
POST /predict/gender
{"forename": "Andrea", "country": "italy"}

→ {"male": 0.95, "female": 0.05, "confidence": 0.92}

# Andrea in the United States
POST /predict/gender
{"forename": "Andrea", "country": "united-states"}

→ {"male": 0.09, "female": 0.91, "confidence": 0.88}

If you omit the country, you get the global aggregate. For unambiguous names like "Mohammed" or "Elizabeth," the global number is fine. For names like Andrea, it's misleading.

Fallback with halved confidence

The training data stores gender distributions per (forename, country) pair. When you request a specific country and we have data for that combination, you get exact numbers. But what happens when we know the name, just not for the country you asked about?

Say you ask for "Andrea" with country="south-korea". We have data on Andrea globally, but not specifically for South Korea. In this case, the API falls back to the global distribution and halves the confidence (raw_confidence × 0.5). The prediction still comes back, but the reduced confidence signals: "we recognized this name, we just don't have country-specific data for it here."

The confidence signal

Country context doesn't just change the prediction. It changes the confidence. "Andrea" with country="italy" returns a confidence above 0.9, because the distribution is extremely peaked: 95% on one side. Without a country, the same name returns something closer to 0.4/0.6, and the confidence drops because the distribution is flat. The confidence score reflects how peaked the distribution is, so ambiguous results naturally get flagged as uncertain.

This is useful for filtering. If you're enriching a CRM and you want to act only on high-confidence predictions, the ambiguous names remove themselves. You don't need a separate list of "tricky names" to watch out for.

Let the API figure out the country

You might not always know the person's country. That's what /predict/all is for. When you call it with both forename and surname, the country prediction runs first. Its top result is then used as context for the gender prediction.

POST /predict/all
{"forename": "Andrea", "surname": "Rossi"}

# Internally: predicts Italy from "Rossi" → uses Italy for gender
→ gender: {"male": 0.95, "female": 0.05}

"Andrea Rossi" gets the right answer because "Rossi" is a strong signal for Italy, and once Italy is the context, Andrea resolves cleanly to male. "Andrea Smith" would predict the US or UK, and resolve to female.

If you already know the country from your own data, like a billing address or phone prefix, passing it explicitly is better. The surname-based country guess is good but not perfect, and your own data doesn't have to guess.