For a demonstration of how this type of profiling works we recently set out to a build a system that could extract Twitter users gender and main interests from the data that is publicly available through the Twitter API (all of the data described above). The lesson in this post comes from the attempt to identify users’ genders. We first attempted to build a text mining solution by collecting a large set of Twitter usernames that we knew belonged to men, collecting a similar sized set that we knew belonged to women, and training a naive Bayes classifier to recognize the difference between them based on the presence or absence of certain words in the Tweets they posted and those posted by the people they followed. This didn’t really work. So, next we tried a similar approach based on the accounts present in the network of people that a user followed. This didn’t really work either.
At this point someone made a very good suggestion. Most Twitter users supply a full name and people’s first name provides a very good indication as to their gender. In fact the Central Statistics Office in most countries provides statistics on the most popular names for baby boys and girls each year. Using this data it is very easy to calculate the probability that someone with a particular first name is a male or female. There are even nice APIs that provide this classification as a service based on a name and a country, for example Gender API. This performs much better than the much more complicated machine learning based approaches.