12 Days of Data Analytics: Day 5 – Prediction Means a Lot of Things

When writing our book, Fundamentals of Machine Learning for Predictive Data Analytics (www.machinelearningbook.com), we worked really hard to make the point that prediction can mean a lot of different things.  We defined a prediction as “the assignment of a value to any unknown variable“. This is important because it is this broad definition of prediction that allows predictive analytics to be used for so many different applications in so many different industries.

We can illustrate this broad definition of prediction by looking at Kaggle, an online  platform for analytics competitions, dataset sharing and discussion about approaches to data analytics problems (and a great resource for honing your predictive data analytics skills). The screenshot below shows a set of competitions from the Kaggle platform that nicely show that different kinds of things we can think about when we think about prediction.

The first competition, Ultrasound Nerve Segmentation, involves building a prediction model that can determine whether or not an ultrasound image contains nerves. The input to this prediction model is an image and the output is a binary value indicating whether or not the image contains nerves. This is a nice example of a labelling problem. The predictions being made in this case assign a value to an unknown variable representing the category of an image. The point of this model is to reduce the number of ultrasound images that a doctor has to actually look at by filtering out thaose that do not contain nerves and so are not of interest to the doctor.

The Job Salary Prediction competition, although very different in many ways is really an example of the same type of labelling prediction. In this competition the task is to build a prediction model to predict the salary that should be associated with a job given just the description of that job. the unknown variable in this case is job salary and it is being assigned a value predicted by the model. The point of this model is to assign salaries to job ads that don;t contain salary information to help candidates decide whether or not they are likely to be of interest and so filter out jobs that they should not waste their time thinking about.

The Bike Sharing Demand contest is slightly different. In this case the predictions being made have a temporal element – we want to predict what the demand will be on a city’s free bicycle sharing scheme on some day in the future. We do this by building a model based on past data that will be able to make this kind of prediction. This type of prediction can be very easily described as a forecasting problem.

The last competition shown, the Springleaf Marketing Response competition, asks participants to build a model that will predict the likelihood a customer will respond to a marketing campaign. The goal here is to determine who to send marketing messages to and so this can be thought of as an example of a ranking problem. Springleaf would use the model built for this competition to rank their customers from most likely to respond to least likely to respond and only assign resources to contacting those customers with a high propensity of responding.

For each of these three types of prediction problems – labelling, forecasting, and ranking – there are different techniques that are most suitable. The reason we have described these competitions in terms of the problems they are built to solve (rather than using terms like classification, regression etc) is one of the other things that we focused on in our book. Predictive analytics projects should always be focused on generating insight from data to help people make better decisions (as illustrated below). Thinking about the type if decision that you are trying to make helps you think about the type of model that you should build. so if you are trying to decide which customers, products, or organisations to assign resources to you are probably looking at a ranking problem. On the other hand if you are trying to plan for the future based on expected demand, then you are probably thinking about a forecasting problem.