Build a Sentiment Analysis app with Movie Reviews

0. Introduction to NLP and Sentiment Analysis

1. Natural Language Processing with NTLK

2. Intro to NTLK, Part 2

3. Build a sentiment analysis program: We finally use all we learnt above to make a program that analyses sentiment of movie reviews

4. Sentiment Analysis with Twitter: A practice session for you, with a bit of learning.


So now we use everything we have learnt to build a Sentiment Analysis app.

Sentiment Analysis means finding the mood of the public about things like movies, politicians, stocks, or even current events. We will analyse the sentiment of the movie reviews corpus we saw earlier.

Sentiment Analysis.ipynb is the file we are working with.

Only interested in videos? Go here for the next part: Sentiment Analysis with Twitter

Let’s import our libraries:

We will be using the Naive Bayes classifier for this example. The Naive Bayes is a fairly simple machine learning algorithm, that works mainly with probabilities. Stack Overflow has a great (if slightly long) explanation of how it works. The top 2 answers are worth reading.

Before we start, there is something that had me stumped for a long time. I saw it in all the examples, but it didn’t make sense. But the Naive Bayes classifier, especially in the Nltk library, expects the input to be in this format: Every word must be followed by true. So for example, if you have these words:

you need to pass it in as:

It’s just a quirk of the nltk library. What bothers me is none of the dozens of tutorials/videos I looked at make this clear. They just write this weird code to do so, and expect you to figure it out for yourselves. (Hurray for us).

I’ll show you the function I wrote, and hopefully, you will understand why we need to do it this way. Here is the function:

Let’s go over it line by line.

The first thing we do is remove all stopwords. This is what we did in the last lesson. This step is optional.

For each word, we create a dictionary with all the words and True. Why a dictionary? So that words are not repeated. If a word already exists, it won’t be added to the dictionary.

Let’s see how this works:

We call our function with the string “the quick brown quick a fox”.

You can see that a) The stop words are removed  b) Repeat words are removed  c) There is a True with each word.

Again, this is just the format the Naive Bayes classifier in nltk expects.

Okay, let’s start with the code. Remember, the sentiment analysis code is just a machine learning algorithm that has been trained to identify positive/negative reviews.

We create an empty list called neg_reviews. Next, we loop over all the files in the neg folder.

We get all the words in that file.

Then we use the function we wrote earlier to create word features in the format nltk expects. Here is a sample of the output:

So there are a 1000 negative reviews.

Let’s do the same for the positive reviews. The code is exactly the same:

So we have a 1000 negative and 1000 positive reviews, for a total of 2000. We will now create our test and train samples, this time manually:

We end up with 1500 training samples and 500 test.

Let’s create our Naive Bayes Classifier, and train it with our training set.

And let’s use our test set to find the accuracy:

Ac accuracy of 72%. Could you improve it? How?

For now, I want to show you how to classify a review as negative or positive. But before that, a warning.

The problem with sentiment analysis, as with any machine learning approach, is that your algorithm is only as good as your data. If your data is crap, your algorithm will be crap.

Not only that, the algorithm depends on the type of input you train it with. So if you train your data with long movie reviews, it will not work with Twitter data, which is much shorter.

This particular dataset is, imo, a bit short. Also, the reviews are very informal, using a lot of swear words etc. Which is why I found it not very accurate when comparing it to Imdb reviews, where swearing is discouraged and reviews are (slightly) more formal.

Anyway, I was looking for negative and positive reviews. Our algorithm is more accurate when the review contains stronger words (horrible instead of bad). For the bad reviews, I found this gem of a movie. A real masterpiece:

We need to word_tokenize the text, call our function it, and then use the classify() function to let our algorithm decide if this is a positive or negative review.

That was correct, but only because the review was really scathing.

For the positive review, I chose one of my favourite movies, Spirited Away, a very beautiful movie:

Repeat the steps:

Correct again, but I’d like to repeat, the  classifier isn’t very accurate overall, I suspect because the original sample is very small and not  very representative for Imdb reviews. But it’s good enough for learning.

Okay, the next video is not just a practice session, but also contains some learning exercises, so I strongly recommend you do it. We will build a Sentiment analysis engine with Twitter data.

Sentiment Analysis with Twitter