**Installing the libraries required for the book**

**Create a Word Counter in Python**

**An introduction to Numpy and Matplotlib**

**Introduction to Pandas with Practical Examples (New)**

**Main Book**

**Image and Video Processing in Python**

**Audio and Digital Signal Processing (DSP)**

**Control Your Raspberry Pi From Your Phone / Tablet**

**Machine Learning Section**

**Machine Learning with an Amazon like Recommendation Engine**

**Machine Learning New Stuff**

** Machine Learning For Complete Beginners: **Learn how to predict how many Titanic survivors using machine learning. No previous knowledge needed!

**Cross Validation and Model Selection**: In which we look at cross validation, and how to choose between different machine learning algorithms. Working with the Iris flower dataset and the Pima diabetes dataset.

**Natural Language Processing**

**0. Introduction to NLP and Sentiment Analysis**

**1. Natural Language Processing with NTLK**

**3. Build a sentiment analysis program**

**4. Sentiment Analysis with Twitter**

**5. Analysing the Enron Email Corpus**: The Enron Email corpus has half a million files spread over 2.5 GB. When looking at data this size, the question is, where do you even start?

**6. Build a Spam Filter using the Enron Corpus**

If you have been to Amazon, you must have seen the “also bought” section, which recommends which other movies / books you will like, based on the movies your currently bought/rated.

Amazon is a special case of this, as it hires hundreds of engineers whose job is to tweak this algorithm. And it works great for things like books. I have bought many great books based on Amazon’s recommendation. It works for movies as well. Where it fails is for items that are not usually rated/reviewed, at least as much as books/movies. Things like household products. I recently bought a drain cleaner, and was immediately bombarded with toilet cleaning products, even though the cleaner had been a one time buy.

That said, their algorithm works well in the general case, and is the main reason Amazon has become such a powerhouse. Amazon always recommends the items it thinks you will like. If you think that is a big deal, try going to Amazon’s rivals for books, like B&N, Apple or Kobo. They always have the same 4-5 books they always recommend, a list that is updated once a month, and usually represents books by big publishers who have paid top dollar to be featured.

Like I said, Amazon’s algorithm is highly tweaked (and secret), and is based on years of experience. While we can’t replicate what they did, we can understand the theory of how the algorithm works.

At its heart, the algorithm looks at items you have rated, and tries to find people who rated the same items in the same way as you. It then checks which other books/movies they liked that you haven’t bought yet, and recommends it to you.

So if you constantly rate action movies as high, the algorithm will show you other highly recommended action movies. But what if you like both action and scifi? Based on my experience, you will get recommendations for both.

So how does the algorithm know which movies to recommend? In technical terms, it looks which movies are most co-related to the ones you rated.

**Pearson Correlation**

The Pearson Correlation coefficient measures how strongly related two items are, ie, will increasing one increase the other as well? The easiest way to understand this is via a diagram.

The first image is of positive correlation- the values are moving up in step. The second is an example of negative correlation, ie, increasing one will decrease the other. That means they are inversely related. The third is random data- it shows no correlation. This is actually the most common case, as not everything has correlation to everything else (unless you believe in weird psychic phenomena).

How does this work in practice? Simple. Again, numpy already has functions for Pearson correlation. All you need to do is call them with different inputs.

In fact, numpy has two functions for the Pearson coefficient. One is really slow, and one is fast. I don’t know why they have two functions, must be a historical thing. I will only use the faster function. Let me give you a few quick examples on how it works.

1 2 3 4 5 6 7 8 |
a = [1, 2, 3, 4, 5] b = [1, 3, 9, 20, 22] np.corrcoef(a,b)[1][0] Out[8]: 0.969954025101608 |

*np.corrcoef()* is the function we are using. It returns a value from -1 to 1. -1 is strong negative correlation, while +1 is strong positive correlation. 0 means no correlation. You may also note that it returns an array, and I’m reading the *[1][0]*th value. That is because this function can be used to compare multiple arrays, and so it returns a matrix of correlated values. For our case with only two arrays, we just need one of the values returned, and we read this one.

I declare two arrays, *a* and *b*. Notice that both contain increasing numbers. When I call the *np.corrcoef()* function on them, I get a value of 0.9, very high correlation.

Now let’s try a more random input:

1 2 3 4 5 |
a = [1, 2, 3, 4, 5] c = [2 , 7, 9, 1, 0] np.corrcoef(a,c)[1][0] Out[9]: -0.39904344223381105 |

This time I get -0.3, which makes sense, as there is no correlation between random data.

This will become more clear as we look at the example.

## Dive into the code

Okay, now that we know the theory, let’s dive into the code.

I have generated some random data for a few movie ratings. Let’s have a look at it (ml_data1.py):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
{'Terminator': {'Tom': 4.0, 'Jack': 1.5, 'Lisa': 3.0, 'Sally': 2.0}, 'Terminator 2': {'Tom': 5.0, 'Jack' : 1.0, 'Lisa': 3.5, 'Sally': 2.0}, 'It happened one night': {'Tom': 3.5, 'Jack': 3.5, 'Tiger': 4.0, 'Lisa': 5.0, 'Michele': 3.0, 'Sally': 4.0,}, '27 Dresses': {'Tom': 3.0, 'Jack': 3.5, 'Tiger': 3.0, 'Lisa': 5.0, 'Michele': 4.0, 'Sally': 4.0}, 'Poirot': {'Tom': 4.0, 'Jack': 3.0, 'Tiger': 5.0, 'Lisa': 4.0, 'Michele': 3.5, 'Sally': 3.0, }, 'Sherlock Holmes': {'Tom': 4.0, 'Jack': 3.0, 'Tiger': 3.5, 'Lisa': 3.5, 'Sally': 2.0, }} |

The movie data is stored as a dictionary. Each dictionary has its own sub-dictionary. Let’s have a look at the first movie:

1 2 3 4 5 6 |
'Terminator': {'Tom': 4.0, 'Jack': 1.5, 'Lisa': 3.0, 'Sally': 2.0}, |

The movie Terminator has been rated by four people: Tom has given it a score of 4.0, while Jack has given it 1.5, and so on. These numbers are random (ie, I just made them up).

You will notice that not all people have rated all the movies. This is something we will need to take into account when we are calculating the correlation.

Let’s open up the file *calc_correlation.py* Ignore the function for now, let’s see the main code:

1 2 3 |
if len(sys.argv) < 2: print("Usage: python calc_correlation.py <data file.py>") exit(1) |

We want to give the script a data file to calculate the correlation on, in this case, ml_data1.py, the file we looked at before. If the file is not given, we print the usage and exit.

1 2 3 4 |
with open(sys.argv[1], 'r') as f: temp = f.read() movies_list = ast.literal_eval(temp) print(movies_list) |

If you have never used the *with* function in Python, it’s a cool fairly new feature. Normally, when you open a file, you have to close it, deal with any errors etc. *with* does all that for you. It will open the file, close it at the end, and handle any errors that may arise.

Looking at the code line by line.

1 2 |
with open(sys.argv[1], 'r') as f: |

We open the file passed as first argument as read only.

1 2 3 |
temp = f.read() movies_list = ast.literal_eval(temp) |

The first thing we do is read the file into a *temp* variable. The next line may require some explanation.

A nifty feature in Python is the *eval()* function. It allows you to generate Python code in real time and run it. Obviously, this is very dangerous, as you don’t know what you are reading. Someone may put in code to delete all your files. For this reason, *eval* is almost never used in practice.

The *literal_eval()* function in the *ast* library gets around the risks of *eval*. *literal_eval* will only allow Python data structures like lists, dictionaries, or any other Python data structure to be read. If you try to read anything else, it will throw an error.

But why do I need it anyway? The file I am reading, *ml_data1.py* contains a Python dictionary. However, when we read the file, it is read as a text file. I need to convert it to a Python dictionary. That is what *literal_eval* will do. It takes the data it read and converts it to a Python dictionary I can use. If I try to pass in something dangerous, the function will throw an error.

There are other ways to do this, like using the *json* module, which I will show you later.

Anyway, now we have read the data into a file.

1 2 3 |
correlated_dict = {} for movie in movies_list: correlated_dict[movie] = find_correlation(movies_list, movie) |

*movies_list* is what we read in from the file. We loop over that and find the correlation for each movie. Let’s looks at how the function *find_correlation()* does that.

1 2 3 4 5 6 7 8 9 |
def find_correlation(movie_list, movie_for_correlation): ''' Input: movie_list - List of movies movie_for_correlation: The movie to calculate the correlation for Return: Dictionary of correlation for movie_for_correlation ''' |

The function takes two parameters: A list of movies, and the movie to find the correlation for (within that list). It returns a dictionary of correlated values.

1 2 3 |
correlate_dict = {} for movie in movie_list: |

We declare *correlate_dict*, the final dictionary we will return. We then loop over the movie list.

1 2 3 4 5 |
# Don't include current movie in correlation, as you can't compare a movie to itself! if movie != movie_for_correlation: movie_for_correlation_list = [] movie_to_compare_list = [] |

When doing the calculations, we don’t want to compare a movie to itself (as that would always show perfect correlation). So we check for that, and then declare a few empty arrays.

Remember our data file?

1 2 3 4 5 |
{'Terminator': {'Tom': 4.0, 'Jack': 1.5, 'Lisa': 3.0, 'Sally': 2.0}, |

See that we have the movie, and then the people who reviewed it? Now, we loop over the reviewers.

1 2 3 4 5 6 7 |
for reviewer_name in movie_list[movie_for_correlation]: # Check that the reviewer has reviewed the current movie. # If so, calculate the correlation coefficient. # If they haven't reviewed the movie, then it makes no sense doing a correlation. if reviewer_name in movie_list[movie]: |

One check we do (in the last line above) is to check that *this* reviewer (the one we are looping over), reviewed the current movie. If s/he didn’t, we move to the next one.

1 2 3 |
if reviewer_name in movie_list[movie]: movie_for_correlation_list.append(movie_list[movie_for_correlation][reviewer_name]) movie_to_compare_list.append(movie_list[movie][reviewer_name]) |

We are still in the for loop. If the reviewer reviewed the movie, we save their score for the current movie, as well as their score for the movie under consideration.

To make it clear, let’s look at the whole for loop:

1 2 3 4 5 6 7 8 9 10 |
# Loop through the people who reviewed the movie for reviewer_name in movie_list[movie_for_correlation]: # Check that the reviewer has reviewed the current movie. # If so, calculate the correlation coefficient. # If they haven't reviewed the movie, then it makes no sense doing a correlation. if reviewer_name in movie_list[movie]: movie_for_correlation_list.append(movie_list[movie_for_correlation][reviewer_name]) movie_to_compare_list.append(movie_list[movie][reviewer_name]) |

This for loop will loop over the reviewers, and if they reviewed both the movies, will store their scores in two arrays. That’s all we are doing. The complicated code is to compensate for the fact that not all people reviewed all movies.

1 2 3 4 |
correlate_dict[movie] = np.corrcoef(movie_for_correlation_list,movie_to_compare_list)[1][0] return correlate_dict |

Finally, we calculate the correlation using the *np.corrcoef()* function we saw earlier. The function creates a dictionary with correlation scores for the current movie. For example, if the movie is *Terminator*, the dictionary would store what correlation score *Terminator* has to *Poirot*, to *Sherlock Holmes* etc.

Each movie will have its own list. So *Poirot* will have its own dictionary which will store its relation to *Terminator* and other movies.

We return the dictionary.

Back in the main loop:

1 2 3 4 |
for movie in movies_list: correlated_dict[movie] = find_correlation(movies_list, movie) print(correlated_dict) |

As I said, we are creating a correlation dictionary for each movie. The goal is to create a correlation index we can use later to make recommendations.

1 2 |
json.dump(correlated_dict, open("corr_dict.py",'w')) |

Remember I said I’d show you another way to write Python dictionaries to file? This is the easier way, using the json module. It opens the file and dumps our dictionary in one go. Let’s open our file *corr_dict.py* and look at it. I’ve cleaned it a bit.

1 2 |
"27 Dresses": {"Poirot": -0.30434782608695654, "Terminator 2": -0.12547286652195427, "It happened one night": 0.5791405068790082, "Sherlock Holmes": -0.27583864218368526, "Terminator": -0.15404159684748153}, |

So *27 Dresses* has a -0.3 correlation to *Poirot*, but a +0.5 to *It happened one night*, which makes sense as both as romantic movies (or not, as I deliberately made up these examples to have this correlation).

## Building the recommendation engine

To understand what I’m doing, let’s work through a few examples.

**Using correlation to rate movies**

Say you have two movies, A and B. You have given a rating of 4.0 to A.

The correlation between A and B is 1.0. Since 1.0 is the highest value, that means the movies are perfectly correlated, and the person who likes one will like the other.

So your estimated score for B will be:

*4.0 x 1.0 = 4.0*

So you would give a rating of 4.0 to movie B as well (at least according to the algorithm).

What if the correlation was 0.5? Your calculated score for B would be:

*4.0 x 0.5 = 2.0*

And what if it was -0.5? Your score would be *-2.0* , which means you would hate the movie.

Remember, what I’m doing is looking at the movies I’ve rated, what their correlation is to movies I haven’t rated, and then try to find my anticipated score. And I do this for all the 3 movies I’ve rated.

Before we start on that, I’ve cleaned up corr_dict.py for humans (as it’s originally only used by machines). The cleaned copy is in *corr_dict_cleaned.py*:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
{ "27 Dresses": {"Poirot": -0.30434782608695654, "Terminator 2": -0.13998740998253317, "It happened one night": 0.5791405068790082, "Sherlock Holmes": -0.27583864218368526, "Terminator": -0.16796775328675631}, "It happened one night": {"Terminator 2": 0.25275763912268084, "27 Dresses": 0.5791405068790082, "Terminator": 0.42437890059987993, "Sherlock Holmes": 0.0, "Poirot": 0.2895702534395041}, "Terminator 2": {"Poirot": 0.87845527687491742, "27 Dresses": -0.13998740998253317, "It happened one night": 0.25275763912268084, "Sherlock Holmes": 0.73889576951817515, "Terminator": 0.92184471452179317}, "Terminator": {"Terminator 2": 0.92184471452179317, "27 Dresses": -0.16796775328675631, "It happened one night": 0.42437890059987993, "Sherlock Holmes": 0.77020798423740755, "Poirot": 0.72664794872022465}, "Poirot": {"Terminator 2": 0.87845527687491742, "27 Dresses": -0.30434782608695654, "It happened one night": 0.2895702534395041, "Sherlock Holmes": 0.6698938453032357, "Terminator": 0.72664794872022465}, "Sherlock Holmes": {"Poirot": 0.6698938453032357, "Terminator 2": 0.73889576951817515, "27 Dresses": -0.27583864218368526, "It happened one night": 0.0, "Terminator": 0.77020798423740755} } |

Let’s start with 27 dresses, and try to calculate its anticipated score:

My Movie (A) | My rating for movie (B) | Correlation of (B) with 27 dresses (C) | My calculated score: (B) * (C) |
---|---|---|---|

Terminator | 5.0 | -0.16 | -0.8 |

Sherlock Holmes | 4.0 | -0.27 | -1.08 |

Poirot | 4.5 | -0.3 | -1.35 |

Total: -3.23 |

So table A lists the movies I rated, B is the score I gave them. C is the correlation the current movie (27 Dresses) to the movie I rated (Terminator). I multiply B by C to get the score I would be expected to give to the current movie (27 dresses).

So for the very first movie, Terminator, we get a score of *5.0 x -0.16 = -0.8*.

I then do this for all three movies I have rated, and total them up to get one score.

The total score is -3.23. There are many ways to get the average score, but I will use the simplest: Average. I divide the above with number of movies I rated (three) to get -1.07. This is my expected rating for *27 Dresses*. Clearly, this movie is not recommended.

Let’s do the same for Terminator 2:

My Movie (A) | My rating for movie (B) | Correlation of (B) with 27 dresses (C) | My calculated score: (B) * (C) |
---|---|---|---|

Terminator | 5.0 | 0.92 | 4.6 |

Sherlock Holmes | 4.0 | 0.73 | 2.92 |

Poirot | 4.5 | 0.87 | 3.9 |

Total: 11.43 |

Dividing by 3, the average score is 3.8 for *Terminator 2*. This movie would be strong recommended.

Let’s now look at the code in *ml_main.py*:

1 2 3 4 5 6 7 8 9 |
import pdb import json # My ratings for movies. my_movies = {'Terminator': 5.0, 'Sherlock Holmes' : 4.0, 'Poirot' : 4.5 } |

After importing the modules we need, we declare a dictionary called *my_movies*, which contains a few movies that I have rated. This will be used to drive our recommendation engine.

1 2 3 4 5 6 7 8 9 |
# Read the data from the correlation dictionary we calculated earlier correlated_dict = json.load(open("corr_dict.py")) # A dictionary to store the total of my calculated votes total_my_votes = {} # A running total to store intermediate results. running_total = 0 |

We open the dictionary we created, *corr_dict.py*, and declare some variables. We are using the *json* library we used last time.

1 2 3 |
# Loop over rated movies for movie_key in my_movies.keys(): |

We loop over the movies we rated.

1 2 3 |
# Loop over the dictionary of correlation coefficients for movie_to_compare in correlated_dict[movie_key]: running_total = 0 |

For the current movie we are looping over, we look at the correlation coefficients. To remind you what that means, for our first movie *Terminator*, these are the coefficients:

1 2 3 4 5 6 7 8 9 10 |
"Terminator": { "Terminator 2": 0.92184471452179317, "27 Dresses": -0.16796775328675631, "It happened one night": 0.42437890059987993, "Sherlock Holmes": 0.77020798423740755, "Poirot": 0.72664794872022465 } |

The above shows that Terminator has a 0.92 correlation with Terminator 2. Since 1.0 is the max value, this shows a very strong correlation.

In the next line, we loop over all the movies we have a correlation for. Obviously, this will include movies we have already rated. So our next step is to get rid of them, as we only want the correlation for movies we have not seen or rated.

1 2 |
if movie_to_compare not in my_movies.keys(): |

Next line:

1 2 3 4 5 6 7 8 9 |
# If this is the first time we are running the code, we won't have anything stored. # In that case, create a new dictionary element. if movie_to_compare not in total_my_votes: # Line below creates a new dictionary element for total_my_votes and gives it a value. total_my_votes.setdefault(movie_to_compare, (correlated_dict[movie_key][movie_to_compare] * my_movies[movie_key]) ) else: # If this is not the first time, merely update the values we have created before total_my_votes[movie_to_compare] += correlated_dict[movie_key][movie_to_compare] * my_movies[movie_key] |

What I am doing is calculating the correlation for all the movies. The first time we enter the loop, this code is called:

1 2 3 |
if movie_to_compare not in total_my_votes: # Line below creates a new dictionary element for total_my_votes and gives it a value. total_my_votes.setdefault(movie_to_compare, (correlated_dict[movie_key][movie_to_compare] * my_movies[movie_key]) ) |

It merely says that if this is the first time, create a new dictionary element for *total_my_votes[movie_to_compare]* , and give it the value of *correlated_dict[movie_key][movie_to_compare] x my_movies[movie_key]*, which is the same as the calculations I showed you earlier. That’s all the function *setdefault* does: It creates a dictionary element if none exists.

The second time we enter the loop, we update the dictionary value:

1 2 3 |
else: # If this is not the first time, merely update the values we have created before total_my_votes[movie_to_compare] += correlated_dict[movie_key][movie_to_compare] * my_movies[movie_key] |

As you can see, I am keeping the total. As I explained in the example above, I keep the total, and later average it. We do it like this:

1 2 3 |
recommended_movies = {} for movie_key in total_my_votes.keys(): recommended_movies[movie_key] = total_my_votes[movie_key] / len(total_my_votes.keys()) |

Finally, we run our code to see which movies would be recommended. I’m using > 3.0 as strong recommendation, >0.0 as normal recommendation, 0 or less as not recommended.

1 2 3 4 5 6 7 |
for movie_key in recommended_movies: if recommended_movies[movie_key] > 3.0: print ("Strongly recommended for you: ", movie_key) elif recommended_movies[movie_key] > 0.0: print("Recommended for you: ", movie_key) else : print("Not recommended: ", movie_key) |

Running the code:

1 2 3 4 5 6 7 8 9 10 11 |
$$ python ml_main.py total_my_votes = {u'Terminator 2': 11.517855396618794, u'27 Dresses': -3.312758552559827, u'It happened one night': 3.424960643477168} {u'Terminator 2': 3.8392851322062644, u'27 Dresses': -1.1042528508532756, u'It happened one night': 1.1416535478257226} Strongly recommended for you: Terminator 2 Not recommended: 27 Dresses Recommended for you: It happened one night |

We are getting the same results for Terminator 2 and 27 dresses as we calculated by hand.

## In real life

You will have seen that the I kept the code to calculate the correlation coefficients separate from the main code. That’s because these would change every day and every hour, based on what users were buying.

You would then have to run *calc_correlation.py* regularly with the updated data. This would typically be done at night, when the servers were not loaded. If you are someone like Amazon, you’d have millions and millions of entries in your dataset, and this could take a fair bit of time. Of course, then you’d be using optimised database technologies to handle this large amount of data.

And, that’s it. Hopefully, you know a bit about machine learning now. If you want some challenge, or want to improve your machine learning skills, the next step is to take some real life data, like the MoviesLens database, and build a recommendation engine based on that.