Website Visitors Part 2: Dealing with Corrupt Data

This was going to be a How to draw lovely jubbly pie charts, but while writing the code, I found a serious problem (one that can take hours to fix, if you have never seen it before).

So instead, this video has become how to deal with bad data. Obviously, it won’t teach you how to deal with every type of corrupt data, but serves as a good intro to data cleaning.

Repo is the same as before.

The problem I found here is quite subtle, and easy to fix once you know how. But get hit with it on 5O’Clock on a Friday evening, and you will be screaming Why won’t you work!. The code won’t make sense, as everything should work, but like magic, it won’t (okay, not like magic then).

The file is operating-systems.csv, which contains the operating systems of all visitors to my website. Read the file and have a look at the data. You will see a Windows 98 and Windows 2000 in there. I don’t know if someone was playing a practical joke on me, or if people are running the latest Python 3.x on Windows 98.

We read the file:

Item Value Value Percent Conversions Conversion Percent URL Stats URL
0 Windows 7 3290 26.2 NaN NaN NaN http://clicky.com/stats/visitors?site_id=10075…
1 Mac OS X 2362 18.8 NaN NaN NaN http://clicky.com/stats/visitors?site_id=10075…
2 Android 1922 15.3 NaN NaN NaN http://clicky.com/stats/visitors?site_id=10075…
3 Windows 8.1 1510 12.0 NaN NaN NaN http://clicky.com/stats/visitors?site_id=10075…
4 iPhone 1431 11.4 NaN NaN

Let’s look at all the columns:

We can see the data of any single column:

Now, try to print the Item column:

We get an error: Keyerror: Item, which means Python can’t find the Item column. (In the Ipython notebook, I have commented this line out, so that the rest of the code can run).

Now, we can see the Item column exists, because we can see it. We can even open the csv file in a text editor, and confirm it is there.

website_os3

And yet Pandas can’t find it!

We can do data[‘Value’] or data[‘Value Percent’], but not data[‘Item’].

data[‘Item’] is like a ghost item: We can see it, but we can’t touch it in any way. It’s an invisible field in our data structure.

This is one of those problems that can take forever to fix. You will try to check the spelling, caps etc, but nothing will work (not even screaming at the screen).

Before we go ahead, remember where we printed the columns? Copy paste the Item field from there. This won’t work in WordPress, so I’ll paste a screen capture (but you don’t need it, since you have the notebook open in another tab, right?)

website_os4

Python/Pandas expects some sort of a special character. Open up the csv file in a text editor. You won’t see any special characters. So where is it coming from?

I want you to do one more thing. Open the csv file in a hex editor (and if you don’t have one, I suggest you get HxD.)

website_os5This time we can see some extra characters. Look at the parts I’ve highlighted in the text above.

It’s the hex code 0xEF BB BF. I’ll save you some Googling and tell you what it is. It’s the Byte Order Mark, which you’ve (most likely) never heard of. File it under Trivia knowledge you will never need again.

Right, to the important part. How do we get rid of it?

Well, let’s just rename the field, as Item doesn’t make any sense.

Again, look at the notebook, as the special characters will be stripped above.

Operating System Value Value Percent Conversions Conversion Percent URL Stats URL
0 Windows 7 3290 26.2 NaN NaN NaN http://clicky.com/stats/visitors?site_id=10075…
1 Mac OS X 2362 18.8 NaN NaN NaN http://clicky.com/stats/visitors?site_id=10075…
2 Android 1922 15.3 NaN NaN NaN http://clicky.com/stats/visitors?site_id=10075…

That is easier to understand. Now, let’s get rid of all unneeded columns. We only need the operating system and the percentage of our pie chart.

We set the operating system as the index:

In the video, I wonder why this is needed, as we are not plotting a bar / line chart (which requires an x and y axis). The pie chart doesn’t contain an x axis, so we shouldn’t need to set the index. But that’s what the Gods of Pandas want, so that’s what we give them, to get this structure:

Value Percent
Operating System
Windows 7 26.2
Mac OS X 18.8
Android 15.3
Windows 8.1 12.0
iPhone 11.4

All that is left to do is to draw our pie chart. There are a lot of options, but I have literally taken this from the official docs.

I’ll go over the main options.  autopct tells Matplotlib to give the values as floating point values with two numbers before the decimal. shadow sets a shadow, explode makes one part of the pie chart stick out. I’ve switched off legend as it was making the code messy, and I’ve also changed the colours, are the default colours are Fugly.

The only thing I didn’t understand was this: *

Since this is a pie chart, we shouldn’t need a y  axis, but like I said before, this causes problems, so I put it in just to keep Pandas happy.

website_osAnd there you go.

The actual code was quite simple, except for the byte mark order thing that stumped us. But in Real Life(TM), this is quite common. Data is messy and unorganised.  We will look more at data cleaning in future examples.