Who are you?

That simple question can have many answers depending on how you interpret it. Who are you? Spiritually? Professionally? Psychologically? As a human?  Emotionally? As a parent?  Metaphysically? But there’s an even simpler answer. Almost all of us would answer that question the same way.  Why?  Because all humans do share at least one thing : we have a name!

I recently started working on some gender inference package.  At first glance, an easy task : determine the gender of a person based on its first name.  Not too hard if you first consider the western world but, pretty quickly, it’s not as easy as it looks…

But would would need that?  Why would you want to determine the gender of someone from its first name?  For lots and lots of reasons! If you’re doing research or profiling in sociology, politics, human resources, demographics, marketing or in any other domain, there’s a lot of data out there but, often times, only bits and pieces of it is available.  And often times, gender is not something that is directly available.

But let’s go back a bit.  The study of proper names is actually a science : it’s called onomastics. To be more precise, in our case (the study of the names of human beings), that science is a branch of onomastics that is called anthroponomastics (or anthroponymy).

And as always, whenever I’m starting to work on something, I like to ponder about it all by myself, from scratch. After that, I like to read on the subject and confront my ideas on the subject with what I read.  So that’s what I did.

At first, I was struck by the simplicity of what is out there!  Most gender prediction services/applications were way too naive and simple.  And in almost all cases, useful information was simply stripped away in the sanitizing process.

But first, here’s a list of gender prediction program/packages/services :

Gender API
Gender Guesser
Gender Detector (formerly known as SexMachine)
Gender Predictor
Name Gender Guesser
Gender Guesser API
GendRE Gender APP
Gender Checker
Name Genderization
Kantrowitz Gender Program
Gender package on CRAN
PD Nickname

Now, here’s a brief list of what I found problematic with the current gender prediction services/applications…

North American Bias

Most programs/services I studied use, at least partially, data that comes from the US Census or the SSA. That’s fine as long as you only have to deal with North Americans but those programs miserably fail when used against names from outside North America.  Even worse, in some cases that data was also used to increment the count of occurrences of some names (thus making the gender prediction appear as more precise).  That has the effect of making it almost impossible for some European names to come out with the proper gender in their respective country as it differs from the one that prevails in the United States. In this case, a first name like Michele comes out as being 99% female while, in Italy, it’s mostly a male first name. Besides, those two data sources have a more important problem…

Normalized Data

The US Census and the SSA data sets have one major problem : the data has been sanitized a lot making it almost useless outside North America.  Accentuated characters have been stripped and name particles have been eliminated.  For instance, in a lot of countries, Andréa is a female first name while Andrea is unisex.  Same thing for Michèle : it’s 100% a female first name.  Unfortunately, Michele (without the accentuated character) can either be a male name (in Italy for instance) or a female name (in US for instance). Unfortunately, since the « é » (or « è ») has been normalized to an « e » in both data sets, that distinction is impossible to do now.  Crucial information has been lost in the sanitizing process.

Same thing applies to the removal of some particles that are essential to identify the gender from a name.  In many languages, parts of the name include linguistic hints that indicate the gender of the person or the relationship of the person with it’s parents : « son of », « daughter of », etc.  All those have also been removed from the SSA and US Census data sets.

Black or white results

Some of the services/applications do not answer the gender prediction with a probability.  All you get is either male, female, unisex or unknown without any other detail.  That is not very helpful if you want to filter the gender predictions with a certain level of confidence.  There are situations where you need to know if the prediction is 95% accurate as compared to 52% !

Extra information is not used

In many cases, extra info that could be essential to identify gender or at least useful to determine it in some countries is simply not used by the programs/services.  Even if most programs allow you to specify a country, in many case that information could very well be supplemented by the last name.  For instance, Michele is mostly female.  But even if you’d ask for the gender of Michele in Canada, it would be essential to know that if the last name is Forgione, (an Italian last name), you’re very most likely dealing with a male first name!  The same critical information is even more obvious with Russian names : once you know how female last names are formed in Russia, you don’t even have to know the first name to determine the gender if you are dealing with someone named Kournikova!  Same kind of detail can be inferred from the year of birth of the person you are trying to determine the gender from : it is well known that some of today’s unisex first names have, at some point in time, gone from one side to the other and then eventually became unisex.  Year of birth information in those cases can precise the probability of the gender a lot!

Lots of data is inferred

Lots of these programs crawl social media to gather more data for their database.  The main problem is that the collected data is « validated » based upon the same 2 data sets (US Census and SSA) thus polluting the data they are collecting at the same time!  That is just wrong!  You’re collecting data for a gender prediction program and the data you collect is also « predicted » or inferred!  Nothing will replace official lists, based on official sources that specify the gender.  And the more local (one per country ideally) those lists are, the better.

In other cases, some name collecting methods have to rely on a multitude of imprecise methods to estimate the gender of the names they collect, thus making the precision of the inferred gender even lower.  When you have to go through a face recognition algorithm, then deduction of the first name by parsing a Twitter nickname and finally processing all that data through a SVM classifier means one and only one thing : the more steps you have, the more error-prone you are.

Now what?

Well, in just 2 days I was able to collect 2.5 million (not unique, of course) names from 220+ countries from 60+ data sets.  I even haven’t used the US Census nor the SSA data set!  What am I going to do differently to deal with the issues I’m describing in this post?

Well, I’ll keep that for another article! I’m not ready to reveal my secrets… yet.

P.S. If you know other gender prediction programs, leave a comment!  I’ll update this post if necessary.  Take note that I tried to list genuine gender prediction program/services, not wrappers to an existing web service!



2 Responses to Onomastics

  1. emgc dit :

    Hello! I have been checking many of the tools you mentioned, and I have found some problem with many of them, I found sexmachine especially accurate for British names, but it recognized very few names. Namsor recognize a lot of names but it can have crazy results with a very high accuracy probability. I think you haven’t posted how you addressed this problem, I would appreciate if you share your experience.


    • endormitoire dit :

      I came to the conclusion that to be able to efficiently deal with region specific variations, you have to collect the geographical region (if possible, otherwise the country) besides gender information and the names. Also, the data must be kept in its original format since « cleaning it up » often destroys relevant hints (such as accentuated characters). I also collect lastnames since often times I can infer the ethnic origin (see my example with Andrea in Italy in my post). The more information you collect, the better. For instance, in some cases the middle name identifies the gender while in some othe languages a suffix will give you a hint on the gender (sur as Russian). Just collecting a bunch of firstname with the gender can work with most occidental first names but once you cross English first names, things quickly get complicated!


Laisser un commentaire

Entrez vos coordonnées ci-dessous ou cliquez sur une icône pour vous connecter:

Logo WordPress.com

Vous commentez à l'aide de votre compte WordPress.com. Déconnexion /  Changer )

Photo Google+

Vous commentez à l'aide de votre compte Google+. Déconnexion /  Changer )

Image Twitter

Vous commentez à l'aide de votre compte Twitter. Déconnexion /  Changer )

Photo Facebook

Vous commentez à l'aide de votre compte Facebook. Déconnexion /  Changer )

Connexion à %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.