12 Days of Data Analytics: Day 1 – Remember Ockham’s Razor

The process of signing up for a Twitter account is interesting for its brevity. All that users need to provide is an email address, a a password, and an optional full name. So, on the day that you join Twitter they know very little about you.

This is in stark contrast with the amount of detail that Twitter claim to know about their users when advertisers are choosing the target audience for new campaigns. Advertisers can choose to put their ads in front of Twitter users located in specific regions, of specific genders, and with very particular interests (see below). So how do they go from knowing just a user’s email address and full name to knowing that they are a woman located in Ireland and interested in offroad vehicles? Twitter, like most other similar companies, make extensive use of profiling to learn these nuggets of demographic and preference information about their users. Earlier this year Facebook, for example, released a list of 98 personal data points that they use for this type of profiling.

For a demonstration of how this type of profiling works we recently set out to a build a system that could extract Twitter users gender and main interests from the data that is publicly available through the Twitter API (all of the data described above). The lesson in this post comes from the attempt to identify users’ genders. We first attempted to build a text mining solution by collecting a large set of Twitter usernames that we knew belonged to men, collecting a similar sized set that we knew belonged to women, and training a naive Bayes classifier to recognize the difference between them based on the presence or absence of certain words in the Tweets they posted and those posted by the people they followed. This didn’t really work. So, next we tried a similar approach based on the accounts present in the network of people that a user followed. This didn’t really work either.

At this point someone made a very good suggestion. Most Twitter users supply a full name and people’s first name provides a very good indication as to their gender. In fact the Central Statistics Office in most countries provides statistics on the most popular names for baby boys and girls each year. Using this data it is very easy to calculate the probability that someone with a particular first name is a male or female. There are even nice APIs that provide this classification as a service based on a name and a country, for example Gender API. This performs much better than the much more complicated machine learning based approaches.

I think this is a great example of why it is important to always look for simple solutions before diving straight into more complicated ones. Or as William Ockham more eloquently put it in what came to be known as Ockham’s Razor:

Frustra fit per plura quod potest fieri per pauciora
(It is futile to do with more things that which can be done with fewer)