12 Days of Data Analytics: Day 3 – Unlock The Power In Your Data
Let’s start today’s post in space. The image below shows an astronaut on the surface of the moon (this image, and many of the others in this post come from the very excellent book Digital Image Processing by Gonzalez & Woods). The image, however, has been corrupted with cosmic noise and so it is very hard to make much sense of it. Luckily noise removal is one of the core jobs in image processing and should help us to clean up this image.
If we zoom right in on this image we can see that the basic data representation for a grey-scale image like this is a grid of pixels each containing a shade of grey.
Using this simple representation we can perform a simple convolution operation on the image to attempt to remove the noise. This is a standard technique in image processing and the main tool in the image processing toolbox for working with pixel representations like this. Unfortunately, as shown below, although the image is cleaned up it is not quite as sharp as we might like.
We could work more on trying out different types of convolution kernels and this would probably lead to some improvement, but we will hit a wall in terms of what can be achieved with this type of representation. Rather than trying to push against this wall, we can instead change the data representation. The fast Fourier transform is another standard tool in the digital image processing toolbox. This transforms the representation of an image from a set of pixel values into a set of frequencies, or converts the image into the frequency domain. Below we show our image of the astronaut on the surface of the moon represented in the frequency domain. The details of exactly what is going on here are not terribly important (although Chapter 4 of Digital Image Processing by Gonzalez & Woods goes into this in great detail and is fascinating stuff). What is important is that when we look at the frequency domain image below a set of bright points arranged in a ring around the image jump out. The noise that is corrupting this image is sinusoidal in nature (most likely cause by the camera being near a generator or other equipment like that) and so really pops out once the representation of the image is changed to the frequency domain.
Even better removing the noise is now very easy. We can apply what is called a band pass filter which removes particular frequencies from the image – int his case the frequencies represented by the bright spots in the ring. After applying the band pass filter we get the lovely, sharp, clean image below.
Now, that was a lot of image processing for a post about data analytics but a lesson can be taken directly over to analytics projects. Often times changing our data representation is the most powerful thing we can do to generate better insights from our data. Here is a really simple example from the world of sabremetrics. The small dataset below shows the number of passes made and the number of passes completed in a season by 30 American football quarterbacks. Also shown in the value that the press corps placed on each quarterback at the end of the season.
|Player||Team||Passes Attempted||Passes Completed||Player Value|
It would be useful to understand how the statistics we can measure about a player influence the value the press corps place on that player. the images below show scatter plots illustrating the relationships between player value and passes completed and player value and pass attempted. The correlation coefficients between each of these player statistics and player value are 0.15 and 0.453 respectively which suggest pretty weak associations.
A simple change to the data representation, however, can add significantly more value to this dataset. Simply dividing a player’s passes completed statistic by their passes attempted statistic yields the player’s percentage of passes completed.
|Player||Team||Passes Attempted||Passes Completed||Pass Completion %||Player Value|
Percentage of passes completed is a much more useful measure in trying to determine the player value. This is evident in the scatter plot below and the correlation coefficient of 0.87 between player value of percent passes completed.
This is a pretty simple example and it is unlikely that any American football teams will be beating down our door based on this insight. Determining good metrics in sports, however, is serious business. Moneyball is one of our favourite movies (and books) here at The Analytics Store and one of our favourite scenes is when Billy Bean repeatedly turns to Peter Brand in the scouting meeting for the refrain “because he gets on base“. The number of times a player got on base was the key metric for capturing the value of a player.
Although he probably never actually said it (see here), Albert Einstein is often attributed with the quote: “If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.” In data analytics we say:
“If I had an hour to analyse a dataset I’d spend 55 minutes working on the data representation and 5 minutes running the analysis.”