Analyse visitors to a website

This is still a introductory video (ie, the content is fairly simple).

All we will do is analyse and graph the visitors to a website. The key skill here is merging. The repo is here, and the file we are working with is  Website visitors.ipynb.


Another simple example, the goal here is to learn merging in Pandas.

The repo has two files: visitors.csv and visitors-new.csv. Both files are similar. The first contains all visitors to a crappy unknown website (mine). The second contains the new visitors only (visiting the website for the first time). The format of the two files is also the same.

The first five lines are a comment- we will need to remove them when we read the file.

The actual data starts below, and is in the format: date, visitors. So on 9 Feb, 2015, we got 59 visitors.

We start with importing the required libraries:

The data doesn’t contain any headers, so we have to define our own.

The only thing new above is: skiprows = 4. That’s because we need to skip the first 4 lines in the file, as they contain comments.

Why 4? If you count the lines, you will see there are actually 5 unusable lines. This confused me for sometime. We skip 4 rows, because Pandas assumes the 5th line is for the header(which we are providing).

date visitors
0 2015-02-09 59
1 2015-02-08 79
2 2015-02-07 73
3 2015-02-06 89
4 2015-02-05 80

The data looks correct. Let’s read the other csv file as well:

date visitors_new
0 2015-02-09 55
1 2015-02-08 64
2 2015-02-07 61
3 2015-02-06 79
4 2015-02-05 60

We have read both the csv files. You will note they contain a common field: date.

The dates are actually the same. If you look at both the fields, the first date is 2015-02-09.

We will now merge these two dataframes, so we can look at the combined data.

The pd.merge() function merged the two dataframes. We just pass in the names of the dataframes, and Pandas is smart enough to figure out that date is common between them, so it merges on date. You can also manually specify which field you want to merge on.

date visitors visitors_new
0 2015-02-09 59 55
1 2015-02-08 79 64
2 2015-02-07 73 61
3 2015-02-06 89 79
4 2015-02-05 80 60

We see that the two separate structures have been merges, so we can now see visitors and new visitors on the same page.

Let’s sort the date now, so that they are in ascending format (I think they are in descending format at the moment).

We now want to plot the data, and we want the date to be the x axis. As before, we will set the date as the index.

visitors visitors_new
2014-07-14 5 4
2014-07-15 58 55
2014-07-16 18 15
2014-07-17 14 10
2014-07-18 11 9

This is our final sorted structure. We can now just plot it.

visitorsThe number of visitors closely matches new visitors, mainly because this was a fresh website, and most visitors were new.

Next: We look at pie chart plotting, and a very tricky problem that took me hours to fix.