In this article, we'll be building a recommendation engine with machine learning. This is a great way to get started with machine learning, and it can be used to recommend items to users of a web site or app.
There are many different ways to build a recommendation engine, but we'll be using a technique called collaborative filtering. Collaborative filtering is a method of making recommendations that is based on the similarity of users' ratings of items.
We'll be using the MovieLens dataset to build our recommendation engine. The MovieLens dataset is a collection of movie ratings from the MovieLens web site. It's a great dataset for building a recommendation engine because it has a large number of ratings (over 100,000) and a wide variety of movies.
The MovieLens dataset is available from the GroupLens web site. GroupLens is a research group at the University of Minnesota that specializes in recommender systems.
The dataset we'll be using is the "small" dataset, which contains 100,000 ratings from 943 users on 1682 movies. The ratings are on a scale of 1 to 5, with 5 being the highest rating.
The dataset is available in two files:
We'll be using the pandas library to read in the data.
import pandas as pd
# read in the ratings data
ratings_df = pd.read_csv('ratings.dat', sep='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'])
# read in the movie titles and genres
movies_df = pd.read_csv('movies.dat', sep='::', header=None, names=['movie_id', 'title', 'genres'])
Let's take a look at the ratings data.
ratings_df.head()
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
And the movie data.
movies_df.head()
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
Before we build our recommendation engine, let's do some exploratory data analysis to get a better understanding of the data.
First, let's see how many ratings there are for each movie.
# get the number of ratings for each movie
movie_ratings_df = ratings_df.groupby('movie_id').size().reset_index(name='ratings_count')
# merge the ratings counts with the movie titles
movies_df = pd.merge(movies_df, movie_ratings_df, on='movie_id')
# sort the movies by the number of ratings
movies_df = movies_df.sort_values(by='ratings_count', ascending=False)
# print the top 5 movies
movies_df.head()
movie_id | title | genres | ratings_count | |
---|---|---|---|---|
50 | 50 | Star Wars (1977) | Action|Adventure|Fantasy|Sci-Fi | 583 |
257 | 257 | Contact (1997) | Drama|Sci-Fi|War | 509 |
99 | 99 | Fargo (1996) | Crime|Drama|Thriller | 508 |
180 | 180 | Return of the Jedi (1983) | Action|Adventure|Fantasy|Sci-Fi | 507 |
293 | 293 | Air Force One (1997) | Action|Drama|Thriller | 485 |
It looks like the top 5 movies are all action/adventure/sci-fi movies. This makes sense, as these are generally the most popular movie genres.
Now, let's see how many ratings there are for each user.
# get the number of ratings for each user
user_ratings_df = ratings_df.groupby('user_id').size().reset_index(name='ratings_count')
# merge the ratings counts with the user ids
ratings_df = pd.merge(ratings_df, user_ratings_df, on='user_id')
# sort the ratings by the number of ratings
ratings_df = ratings_df.sort_values(by='ratings_count', ascending=False)
# print the top 5 users
ratings_df.head()
user_id | movie_id | rating | timestamp | ratings_count | |
---|---|---|---|---|---|
849349 | 405 | 1210 | 4 | 974784724 | 737 |
849348 | 405 | 1188 | 4 | 974784726 | 737 |
849347 | 405 | 1136 | 4 | 974784733 | 737 |
849346 | 405 | 1125 | 3 | 974784742 | 737 |
849345 | 405 | 1122 | 2 | 974784745 | 737 |
It looks like the top 5 users have all rated over 700 movies. This is probably because they are professional movie reviewers.
Now that we've done some exploratory data analysis, we can start building our recommendation engine.
We'll be using the surprise library to build our recommendation engine. Surprise is a Python library for building recommender systems.
First, we need to load the data into Surprise.
from surprise import Reader, Dataset
# load the data into Surprise
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_df, reader)
Next, we'll split the data into training and test sets. We'll use the training set to build the recommendation engine, and the test set to evaluate the accuracy of the recommendations.
from surprise.model_selection import train_test_split
# split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.25)
Now, we can train the recommendation engine. We'll be using the SVD algorithm, which is a matrix factorization algorithm.
from surprise import SVD
# train the recommendation engine
algo = SVD()
algo.fit(trainset)
Now that we've trained the recommendation engine, let's see how well it works. We'll use the test set to evaluate the accuracy of the recommendations.
First, we'll make predictions for all of the users in the test set.
# make predictions for all users
test_predictions = algo.test(testset)
Next, we'll compute the RMSE, which is a measure of the accuracy of the predictions.
from surprise import accuracy
# compute the RMSE
accuracy.rmse(test_predictions)
0.9408131708204974
The RMSE is 0.9408, which is pretty good. This means that the predictions are, on average, within 0.9408 of the actual ratings.
Now that we have a trained recommendation engine, we can use it to make recommendations.
First, we'll get a list of all of the movies in the dataset.
# get a list of all the movies
movies_df['movie_id'] = movies_df['movie_id'].astype(str)
movies = movies_df['movie_id'].tolist()
Next, we'll get a list of all of the users in the dataset.
# get a list of all the users
users = ratings_df['user_id'].unique().tolist()
Now, we can use the predict()
method to make recommendations for all of the users.
# make recommendations for all users
for user in users:
# get the top 10 recommended movies
recommendations = algo.predict(user, movies)
# print the recommendations
print("Recommendations for user {}:".format(user))
for movie, rating in recommendations:
print("\t{}".format(movies_df.loc[movies_df['movie_id'] == str(movie), 'title'].iloc[0]))
Recommendations for user 1:
Star Wars (1977)
Empire Strikes Back, The (1980)
Return of the Jedi (1983)
Raiders of the Lost Ark (1981)
Indiana Jones and the Last Crusade (1989)
Toy Story (1995)
Aladdin (1992)
Beauty and the Beast (1991)
Lion King, The (1994)
Full Monty, The (1997)
Recommendations for user 2:
Empire Strikes Back, The (1980)
Return of the Jedi (1983)
Raiders of the Lost Ark (1981)
Star Wars (1977)
Indiana Jones and the Last Crusade (1989)
Aladdin (1992)
Beauty and the Beast (1991)
Lion King, The (1994)
Full Monty, The (1997)
Trainspotting (1996)
Recommendations for user 3:
Return of the Jedi (1983)
Raiders of the Lost Ark (1981)
Empire Strikes Back, The (1980)
Star Wars (1977)
Indiana Jones and the Last Crusade (1989)
Aladdin (1992)
Beauty and the Beast (