Recommendation Systems Practice Project: Book Recommendation¶
Context¶
Over 3.5 billion people use the internet for a variety of reasons. Online retail sales are expected to grow steadily in the coming years. One of the most important requirements of E-commerce portals is a book recommendation for ease of reading and referencing. A book recommendation system is a type of recommendation system in which the reader is recommended similar books based on his or her interests.
Book Recommendation Systems are used by the vast majority of E-commerce businesses such as Amazon, Barnes and Noble, Flipkart, Goodreads, and other online retailers to recommend books that customers may be tempted to buy based on their preferences. This feature can assist in increasing shopping value while reducing shopping time. Logical recommendations not only assist customers in making purchases but also increase total sales value.
Objective¶
In this case study, we will build three types of recommendation systems:
- Knowledge/Rank Based recommendation system
- Similarity-Based Collaborative filtering
- Matrix Factorization Based Collaborative Filtering
Dataset¶
The ratings dataset contains the following attributes:
- user-Id: Unique ID for each user
- ISBN: International Standard Book Number. Books are identified by their respective ISBN
- Book-rating: Rating for each book expressed on a scale from 0-10
We will also use the books dataset to obtain book titles and other information. It contains the following attributes:
- ISBN: International Standard Book Number
- Book-title: Title of the book
- Book-author: Name of the author
- Year-of-Publication: Publication Year
- Publisher: Name of the publisher of the book
- Image-Url-S: Small image of the book (Amazon link)
- Image-Url-M: Medium size image of the book (Amazon link)
- Image-Url-L: Large size image of the book (Amazon link)
Sometimes, the installation of the surprise library, which is used to build recommendation systems, faces issues in Jupyter. To avoid any issues, it is advised to use Google Colab for this project.
Let's start by mounting the Google drive on Colab.
# Mount your drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Installing surprise library
!pip install surprise
Collecting surprise Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB) Collecting scikit-surprise (from surprise) Downloading scikit-surprise-1.1.3.tar.gz (771 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 772.0/772.0 kB 7.9 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: joblib>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.4.0) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.25.2) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.11.4) Building wheels for collected packages: scikit-surprise Building wheel for scikit-surprise (setup.py) ... done Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3162984 sha256=1bae9b47cd681ceab80feedaafbe6d74d8612a9ea6a1c0dd9e92c3b0ff724495 Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445 Successfully built scikit-surprise Installing collected packages: scikit-surprise, surprise Successfully installed scikit-surprise-1.1.3 surprise-0.1
# Basic python libraries
import numpy as np
import pandas as pd
# Python libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD
from collections import defaultdict
# For implementing cross validation
from surprise.model_selection import KFold
import warnings
warnings.filterwarnings('ignore')
Loading the data¶
# Reading the datasets
book = pd.read_csv("filepath/Books.csv")
rating = pd.read_csv("filepath/Ratings.csv")
user = pd.read_csv("filepath/Users.csv")
Exploring the ratings data¶
rating.head()
User-ID | ISBN | Book-Rating | |
---|---|---|---|
0 | 276725 | 034545104X | 0 |
1 | 276726 | 0155061224 | 5 |
2 | 276727 | 0446520802 | 0 |
3 | 276729 | 052165615X | 3 |
4 | 276729 | 0521795028 | 6 |
book.head()
ISBN | Book-Title | Book-Author | Year-Of-Publication | Publisher | Image-URL-S | Image-URL-M | Image-URL-L | |
---|---|---|---|---|---|---|---|---|
0 | 0195153448 | Classical Mythology | Mark P. O. Morford | 2002 | Oxford University Press | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... |
1 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... |
2 | 0060973129 | Decision in Normandy | Carlo D'Este | 1991 | HarperPerennial | http://images.amazon.com/images/P/0060973129.0... | http://images.amazon.com/images/P/0060973129.0... | http://images.amazon.com/images/P/0060973129.0... |
3 | 0374157065 | Flu: The Story of the Great Influenza Pandemic... | Gina Bari Kolata | 1999 | Farrar Straus Giroux | http://images.amazon.com/images/P/0374157065.0... | http://images.amazon.com/images/P/0374157065.0... | http://images.amazon.com/images/P/0374157065.0... |
4 | 0393045218 | The Mummies of Urumchi | E. J. W. Barber | 1999 | W. W. Norton & Company | http://images.amazon.com/images/P/0393045218.0... | http://images.amazon.com/images/P/0393045218.0... | http://images.amazon.com/images/P/0393045218.0... |
Let's merge the 'rating' and 'book' datasets and then we can choose only the columns relevant to our task.
df = pd.merge(rating, book.drop_duplicates(['ISBN']), on="ISBN", how="left")
df.drop(['Image-URL-S','Image-URL-M','Image-URL-L'], axis =1, inplace = True)
# Rename the column names of the dataframe
df.rename(columns = {'User-ID':'user_id', 'ISBN':'book_id', "Book-Rating":"rating"}, inplace = True)
df.head()
user_id | book_id | rating | Book-Title | Book-Author | Year-Of-Publication | Publisher | |
---|---|---|---|---|---|---|---|
0 | 276725 | 034545104X | 0 | Flesh Tones: A Novel | M. J. Rose | 2002 | Ballantine Books |
1 | 276726 | 0155061224 | 5 | Rites of Passage | Judith Rae | 2001 | Heinle |
2 | 276727 | 0446520802 | 0 | The Notebook | Nicholas Sparks | 1996 | Warner Books |
3 | 276729 | 052165615X | 3 | Help!: Level 1 | Philip Prowse | 1999 | Cambridge University Press |
4 | 276729 | 0521795028 | 6 | The Amsterdam Connection : Level 4 (Cambridge ... | Sue Leather | 2001 | Cambridge University Press |
# Checking the info of the data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1149780 entries, 0 to 1149779 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 1149780 non-null int64 1 book_id 1149780 non-null object 2 rating 1149780 non-null int64 3 Book-Title 1031136 non-null object 4 Book-Author 1031134 non-null object 5 Year-Of-Publication 1031136 non-null object 6 Publisher 1031134 non-null object dtypes: int64(2), object(5) memory usage: 61.4+ MB
Observations:
- The rating data contains 11,49,780 observations and 7 columns.
- The 'user id' and 'rating' columns are both of numeric data type.
- The book id column has the data type object. We'll convert it to string format. As we will be using this column in subsequent steps.
# Many book_id contains combination of letters & digits. So we will convert the column to type 'string'
df['book_id']= df['book_id'].astype(str)
Checking the distribution of ratings¶
# Distribution of ratings
plt.figure(figsize = (12, 4))
sns.countplot(x="rating", data=df)
plt.tick_params(labelsize = 10)
plt.title("Distribution of Ratings ", fontsize = 10)
plt.xlabel("Ratings", fontsize = 10)
plt.ylabel("Number of Ratings", fontsize = 10)
plt.show()
Observations:
- As per the histogram, the rating '0' has the highest count of ratings (~700K) and accounts for the majority of the ratings. Since the scale of ratings is from 1 to 10, 0 can be considered as the missing value.
- The best option here would be to drop all the ratings of class '0', which will remove the bias towards a single class.
- Following this, Rating '8' and '10' with ~100K and ~80K observations, respecitvely.
Dropping rows with rating equal to 0¶
df.drop(df.index[df['rating'] == 0], inplace = True)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 433671 entries, 1 to 1149779 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 433671 non-null int64 1 book_id 433671 non-null object 2 rating 433671 non-null int64 3 Book-Title 383842 non-null object 4 Book-Author 383840 non-null object 5 Year-Of-Publication 383842 non-null object 6 Publisher 383840 non-null object dtypes: int64(2), object(5) memory usage: 26.5+ MB
- After removing the entries with rating equal to 0, now the dataset contains a total of 433671 rows.
Let's check the distribution of ratings again.
Checking updated distribution of ratings¶
# Distribution of ratings
plt.figure(figsize = (12, 4))
sns.countplot(x="rating", data=df)
plt.tick_params(labelsize = 10)
plt.title("Distribution of Ratings ", fontsize = 10)
plt.xlabel("Ratings", fontsize = 10)
plt.ylabel("Number of Ratings", fontsize = 10)
plt.show()
Observations:
- As per the histogram, rating 8 has the highest count of ratings (~100K), followed by Rating 10 & 7, which have around 80K observations.
- The are very few ratings for class '1', '2', '3', and '4'.
# Finding the number of unique users
df['user_id'].nunique()
77805
# Finding the number of unique books
df['book_id'].nunique()
185973
Observations:
- There are 185973 books in the dataset.
- As per the number of unique users and books, there is a possibility of 77805 * 185973 = 14,469,629,265 ratings in the dataset. But we only have 433,671 ratings, i.e., not every user has rated every book in the dataset. So, we can build a recommendation system to recommend books to users which they have not interacted with.
df.groupby(['user_id', 'book_id']).count()
rating | Book-Title | Book-Author | Year-Of-Publication | Publisher | ||
---|---|---|---|---|---|---|
user_id | book_id | |||||
8 | 0002005018 | 1 | 1 | 1 | 1 | 1 |
074322678X | 1 | 1 | 1 | 1 | 1 | |
0887841740 | 1 | 1 | 1 | 1 | 1 | |
1552041778 | 1 | 1 | 1 | 1 | 1 | |
1567407781 | 1 | 1 | 1 | 1 | 1 | |
... | ... | ... | ... | ... | ... | ... |
278854 | 0375703063 | 1 | 1 | 1 | 1 | 1 |
042516098X | 1 | 1 | 1 | 1 | 1 | |
0425163393 | 1 | 1 | 1 | 1 | 1 | |
0553275739 | 1 | 1 | 1 | 1 | 1 | |
0553579606 | 1 | 1 | 1 | 1 | 1 |
433671 rows × 5 columns
df.groupby(['user_id', 'book_id']).count()['rating'].sum()
433671
Observation:
- The sum is equal to the total number of observations, which implies that there is only one interaction between a book and a user.
Which book has the highest number of reviews / ratings in the dataset?¶
# Finding the most rated books in the dataset
df['book_id'].value_counts()
book_id 0316666343 707 0971880107 581 0385504209 487 0312195516 383 0679781587 333 ... 0140441905 1 0886777267 1 0671697951 1 0553560956 1 05162443314 1 Name: count, Length: 185973, dtype: int64
Observations:
The book with book_id: 0316666343 has been interacted by most users (707).
But still, there is a possibility of 77805-707 = 77098 more interactions as we have 77805 unique users in our datasets. For those 77098 remaining users, we can build a recommendation system to predict who is most likely to interact with this book.
Also, out of these 707 interactions, we need to consider the distribution of ratings as well to check whether this book is the most liked or most disliked book.
# Plotting distributions of ratings for the most interacted book
plt.figure(figsize=(7,7))
df[df['book_id'] == '0316666343']['rating'].value_counts().plot(kind='bar')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()
Observations:
- We can see that the majority of the ratings for this book are 8, followed by 9, 10, and then 7.
- Because the count of ratings 1, 2, 3, and 4 is much lower than ratings 8, 9, and 10, this implies that the book is liked by the majority of users.
df['user_id'].value_counts()
user_id 11676 8524 98391 5802 153662 1969 189835 1906 23902 1395 ... 114079 1 114081 1 114096 1 114115 1 276723 1 Name: count, Length: 77805, dtype: int64
Observations:
- The user with user_id: 11676 has interacted with the most number of books, i.e., 8524 times.
- But still, there is a possibility of 185973-8524 = 177449 more interactions as we have 185973 unique books in the dataset.
Data Preparation¶
As this dataset is still quite large and has 433671 observations, it is not be computationally efficient to build a model using this. Moreover, there are many users who have only rated a few books and also there are also books which are rated by very less users. Hence we can reduce the dataset by considering certain Logical assumption.
Here, we will be taking users who have given at least 50 rating, as we prefer to have some number of rating of a book and the book which has at least 10 rating, as when we shop online we prefer to have some number of rating of that product.
# Get the column containing the users
users = df.user_id
# Create a dictionary from users to find their number of books
ratings_count = dict()
for user in users:
# If we already have the user, just add 1 to their rating count
if user in ratings_count:
ratings_count[user] += 1
# Otherwise, set their rating count to 1
else:
ratings_count[user] = 1
# We want our users to have at least 50 ratings to be considered
RATINGS_CUTOFF = 50
remove_users = []
for user, num_ratings in ratings_count.items():
if num_ratings < RATINGS_CUTOFF:
remove_users.append(user)
df = df.loc[~df.user_id.isin(remove_users)]
df.shape
(175023, 7)
# Get the column containing the books
books = df.book_id
# Create a dictionary from books to find their number of users
ratings_count = dict()
for book in books:
# If we already have the book, just add 1 to their rating count
if book in ratings_count:
ratings_count[book] += 1
# Otherwise, set their rating count to 1
else:
ratings_count[book] = 1
# We want our book to be interacted by at least 10 users to be considered
RATINGS_CUTOFF = 10
remove_books = []
for book, num_ratings in ratings_count.items():
if num_ratings < RATINGS_CUTOFF:
remove_books.append(book)
df= df.loc[~df.book_id.isin(remove_books)]
df.shape
(26698, 7)
Distribution of the user-books interactions in the dataset¶
df.nunique()
user_id 1257 book_id 1497 rating 10 Book-Title 1367 Book-Author 587 Year-Of-Publication 43 Publisher 204 dtype: int64
# Finding user-books interactions distribution
count_interactions = df.groupby('user_id').count()['book_id']
count_interactions
user_id 254 18 638 20 643 3 1025 7 1211 3 .. 277427 36 278026 11 278137 8 278188 9 278418 9 Name: book_id, Length: 1257, dtype: int64
# Plotting user-item interactions distribution
plt.figure(figsize=(15,7))
sns.histplot(count_interactions)
plt.xlabel('Number of Interactions by Users')
plt.show()
Observations:
- The distribution is highly skewed to the right.
- It clearly shows that there are very few books which have many ratings.
As we have now explored the data, let's start building Recommendation Systems
Model 1: Create Rank-Based Recommendation System¶
Rank-based recommendation systems provide recommendations based on the most popular items. This kind of recommendation system is useful when we have cold start problems. Cold start refers to the issue when we get a new user into the system and the machine is not able to recommend book to the new user, as the user did not have any historical interactions in the dataset. In those cases, we can use rank-based recommendation system to recommend book to the new user.
To build the rank-based recommendation system, we take average of all the ratings provided to each book and then rank them based on their average rating.
# Calculating average ratings
average_rating = df.groupby('book_id')['rating'].mean()
# Calculating the count of ratings
count_rating = df.groupby('book_id')['rating'].count()
# Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})
final_rating.head()
avg_rating | rating_count | |
---|---|---|
book_id | ||
0020442203 | 8.727273 | 11 |
002542730X | 7.428571 | 28 |
0028604199 | 8.000000 | 10 |
0060002050 | 7.800000 | 10 |
006000438X | 7.666667 | 15 |
final_rating['rating_count'].value_counts()
rating_count 10 237 11 196 12 161 13 114 14 104 ... 70 1 47 1 77 1 145 1 68 1 Name: count, Length: 65, dtype: int64
Now, let's create a function to find the top n books for a recommendation based on the average ratings of books. We can also add a threshold for a minimum number of interactions for a book to be considered for recommendation.
def top_n_books(data, n, min_interaction=100):
# Finding books with minimum number of interactions
recommendations = data[data['rating_count'] > min_interaction]
# Sorting values w.r.t. average rating
recommendations = recommendations.sort_values(by='avg_rating', ascending=False)
return recommendations.index[:n]
We can use this function with different n's and minimum interactions to get books to recommend.
Recommending top 5 Book with 10 minimum interactions based on popularity¶
res = list(top_n_books(final_rating, 5, 10))
# Name of the books
list_of_books = []
for i in res:
list_of_books.append(df[df['book_id']== str(i) ]['Book-Title'].unique()[0])
list_of_books
['The Two Towers (The Lord of the Rings, Part 2)', 'Harry Potter and the Chamber of Secrets Postcard Book', "My Sister's Keeper : A Novel (Picoult, Jodi)", 'The Giving Tree', 'A Tree Grows in Brooklyn']
Recommending top 5 Book with 100 minimum interactions based on popularity¶
res2 = list(top_n_books(final_rating, 5, 100))
# Name of the books
list_of_book = []
for i in res2:
list_of_book.append(df[df['book_id']== str(i) ]['Book-Title'].unique()[0])
list_of_book
['The Da Vinci Code', 'The Lovely Bones: A Novel']
Model 2: Collaborative Filtering Based Recommendation System¶
In this type of recommendation system, we do not need any information about the users or items. We only need user-item interaction data to build a collaborative recommendation system. For example:
- Ratings provided by users. For example - ratings of books on Goodreads, movie ratings on IMDB, etc.
- Likes of users on different Facebook posts, likes on youtube videos
- Use/buying of a product by users. For example - buying different items on e-commerce sites
- Reading of articles by readers on various blogs
Types of Collaborative Filtering¶
- Similarity/Neighborhood based
- Model based
Below we are building a similarity-based recommendation system using cosine similarity and using KNN to find similar users who are the nearest neighbor to the given user.
# To compute the accuracy of models
from surprise import accuracy
# Class is used to parse a file containing ratings, data should be in the structure - user ; item ; rating
from surprise.reader import Reader
# Class for loading datasets
from surprise.dataset import Dataset
# For tuning model hyperparameters
from surprise.model_selection import GridSearchCV
# For splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split
# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic
Before building the recommendation systems, let's go over some some basic terminologies we are going to use:
Relevant item - An item (book in this case) that is actually rated higher than the threshold rating (here 7) is relevant, if the actual rating is below the threshold then it is a non-relevant item.
Recommended item - An item that's predicted rating is higher than the threshold (here 7) is a recommended item, if the predicted rating is below the threshold then that book will not be recommended to the user.
False Negative (FN) - It is the frequency of relevant items that are not recommended to the user. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the loss of opportunity for the service provider which they would like to minimize.
False Positive (FP) - It is the frequency of recommended items that are actually not relevant. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in loss of resources for the service provider which they would also like to minimize.
Recall - It is the fraction of actually relevant items that are recommended to the user i.e. if out of 10 relevant books, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.
Precision - It is the fraction of recommended items that are relevant actually i.e. if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.
While making a recommendation system it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are the two most used performance metrics used in the assessment of recommendation systems.
Precision@k and Recall@ k¶
Precision@k - It is the fraction of recommended items that are relevant in top k
predictions. Value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.
Recall@k - It is the fraction of relevant items that are recommended to the user in top k
predictions.
F1-Score@k - It is the harmonic mean of Precision@k and Recall@k. When precision@k and recall@k both seem to be important then it is useful to use this metric because it is representative of both of them.
Some useful functions¶
- The following function takes the recommendation model as input and gives the precision@k and recall@k for that model.
- To compute precision and recall, top k predictions are taken under consideration for each user.
def precision_recall_at_k(model, k=10, threshold=7):
# First map the predictions to each user
user_est_true = defaultdict(list)
# Making predictions on the test data
predictions=model.test(testset)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0.
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0.
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
# Mean of all the predicted precisions is calculated.
precision = round((sum(prec for prec in precisions.values()) / len(precisions)),3)
# Mean of all the predicted recalls is calculated.
recall = round((sum(rec for rec in recalls.values()) / len(recalls)),3)
accuracy.rmse(predictions)
print('Precision: ', precision) # Command to print the overall precision
print('Recall: ', recall) # Command to print the overall recall
print('F_1 score: ', round((2*precision*recall)/(precision+recall),3)) # Formula to compute the F-1 score.
Let's encode the user_id and book_id for simplicity, also encoding them will not make any change in the prediction.
from sklearn.preprocessing import LabelEncoder
data=df[['user_id','book_id']].apply(LabelEncoder().fit_transform)
data['rating']=df['rating']
data.head()
user_id | book_id | rating | |
---|---|---|---|
1211 | 1251 | 521 | 9 |
1213 | 1251 | 524 | 9 |
1214 | 1251 | 525 | 8 |
1456 | 1252 | 1 | 10 |
1474 | 1252 | 52 | 9 |
# Creating a copy of the above dataset for further use
df_rating = data.copy()
# Calculating average ratings
average_rating = data.groupby('book_id')['rating'].mean()
# Calculating the count of ratings
count_rating = data.groupby('book_id')['rating'].count()
# Updating the final_rating dataframe with the new encoded book_id count and average of ratings based on the new dataframe
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})
final_rating.head()
avg_rating | rating_count | |
---|---|---|
book_id | ||
0 | 8.727273 | 11 |
1 | 7.428571 | 28 |
2 | 8.000000 | 10 |
3 | 7.800000 | 10 |
4 | 7.666667 | 15 |
Below we are loading the data
dataset, which is a pandas dataframe, into a different format called surprise.dataset.DatasetAutoFolds
which is required by this library. To do this, we will be using the classes Reader
and Dataset
Making the dataset into surprise dataset and splitting it into train and test set
# Instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(1, 10))
# Loading the rating dataset
data = Dataset.load_from_df(data[['user_id', 'book_id', 'rating']], reader)
# Splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.3, random_state=42)
Now, we are ready to build the first baseline similarity based recommendation system using cosine similarity and KNN.
User-Based Collaborative Filtering Recommendation System¶
sim_options = {'name': 'cosine',
'user_based': True}
algo_knn_user = KNNBasic(sim_options=sim_options,verbose=False)
# Train the algorithm on the train set, and predict ratings for the test set
algo_knn_user.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k =10.
precision_recall_at_k(algo_knn_user)
RMSE: 1.8455 Precision: 0.816 Recall: 0.812 F_1 score: 0.814
Observations:
- We can observe that the baseline model has
RMSE=1.84
on the test set. - Intuition of Recall - We are getting a recall of ~0.81, which means out of all the relevant books, 81% are recommended.
- Intuition of Precision - We are getting a precision of ~ 0.81, which means out of all the recommended books, 81% are relevant.
- Here F_1 score of the baseline model is ~0.81. It indicates that mostly recommended books were relevant and relevant books were recommended. We can try to improve the performance by using GridSearchCV to tune different hyperparameters of the algorithm.
What is the predicted rating for the user with userId=1326 and for book_id=12126?
algo_knn_user.predict(1326, 12126, r_ui=8, verbose=True)
user: 1326 item: 12126 r_ui = 8.00 est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=12126, r_ui=8, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
- We observe that the actual rating for this user-item pair is 8 and predicted rating is 7.99 by this similarity based baseline model, which is good.
Let's predict the rating for the same userId=1326
but for a book which this user has not a rated before, i.e., book_id=2150
algo_knn_user.predict(1326, 2150, verbose=True)
user: 1326 item: 2150 r_ui = None est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=2150, r_ui=None, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Improving similarity-based recommendation system by tuning its hyperparameters¶
Below we will be tuning hyperparameters for the KNNBasic
algorithms. Let's try to understand different hyperparameters of KNNBasic algorithm -
- k (int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.
- min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.
- sim_options (dict) – A dictionary of options for the similarity measure. And there are four similarity measures available in surprise -
- cosine
- msd (default)
- pearson
- pearson baseline
For more details please refer the official documentation https://surprise.readthedocs.io/en/stable/knn_inspired.html
Perform hyperparameter tuning for the baseline user based collaborative filtering recommendation system and find the RMSE for tuned user based collaborative filtering recommendation system
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [20, 30, 40], 'min_k': [3, 6, 9],
'sim_options': {'name': ['msd', 'cosine'],
'user_based': [True]}
}
# Performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
# Fitting the data
gs.fit(data)
# Best RMSE score
print(gs.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])
1.7009224993802097 {'k': 20, 'min_k': 6, 'sim_options': {'name': 'msd', 'user_based': True}}
Now, let's build the final model by using tuned values of the hyperparameters, which we received by using grid search cross-validation.
# Using the optimal similarity measure for user-user based collaborative filtering
sim_options = {'name': 'msd',
'user_based': True}
# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized = KNNBasic(sim_options=sim_options, k=20, min_k=6, verbose=False)
# Training the algorithm on the train set
similarity_algo_optimized.fit(trainset)
# Let us compute precision@k and recall@k with k=10.
precision_recall_at_k(similarity_algo_optimized)
RMSE: 1.6866 Precision: 0.834 Recall: 0.891 F_1 score: 0.862
Observations:
- After tuning hyperparameters, RMSE for the test set has reduced from 1.84 to 1.68.
- We can observe that after tuning the hyperparameters, the tuned model's F-1 score increased from 0.81 to 0.86 in comparison to the baseline model. As a result, we can say that the model's performance has improved after hyperparameter tuning.
What is the predicted rating for the user with user_id=1326 and for book_id=12126 using the tuned user-based collaborative filtering?
similarity_algo_optimized.predict(1326, 12126, r_ui=8, verbose=True)
user: 1326 item: 12126 r_ui = 8.00 est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=12126, r_ui=8, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Observation:
- There is no difference in the prediction of the baseline model and the tuned model for this particular user-item pair. Both models predicted the rating as 7.99, which is very close to the actual rating of 8.
Below we are predicting rating for the same user_id=1326
but for a book which this user has not a rated before, i.e., book_id=2150
.
similarity_algo_optimized.predict(1326, 2150, verbose=True)
user: 1326 item: 2150 r_ui = None est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=2150, r_ui=None, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Identifying users similar to a given user (nearest neighbors)
We can find out the similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar user to the user_id=1
.
similarity_algo_optimized.get_neighbors(1, k=5)
[7, 23, 95, 107, 109]
Implementing the recommendation algorithm based on optimized KNNBasic model¶
Below we will be implementing a function where the input parameters are:
- data: a rating dataset
- user_id: user_id against which we want the recommendations
- top_n: the number of items we want to recommend
- algo: the algorithm we want to use to predict the ratings
- The output of the function is a set of top_n items recommended for the given user_id based on the given algorithm
def get_recommendations(data, user_id, top_n, algo):
# Creating an empty list to store the recommended book ids
recommendations = []
# Creating an user item interactions matrix
user_item_interactions_matrix = data.pivot(index='user_id', columns='book_id', values='rating')
# Extracting those book ids which the user_id has not interacted with yet
non_interacted_items = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# Looping through each of the book id which user_id has not interacted with yet
for book_id in non_interacted_items:
# Predicting the ratings for those non interacted book ids by this user
est = algo.predict(user_id, book_id).est
# Appending the predicted ratings
recommendations.append((book_id, est))
# Sorting the predicted ratings in descending order
recommendations.sort(key=lambda x: x[1], reverse=True)
return recommendations[:top_n] # Returning top n predicted rating items for this user
df_rating=df_rating.drop_duplicates()
Predicting the top 5 items for userId=1 using the similarity-based recommendation system
recommendations = get_recommendations(df_rating, 1, 5, similarity_algo_optimized)
# Building the dataframe for above recommendations with columns "book_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['book_Id', 'predicted_ratings'])
book_Id | predicted_ratings | |
---|---|---|
0 | 259 | 10.000000 |
1 | 1297 | 9.884446 |
2 | 658 | 9.870802 |
3 | 639 | 9.764398 |
4 | 451 | 9.702661 |
Correcting the Ratings and Ranking the above books¶
While comparing the ratings of two books, it is not only the ratings that describe the likelihood of the user to that book. Along with the rating the number of users who have read that book also becomes a important point to consider. Due to this, we have calculated the "corrected_ratings" for each book. Commonly higher the "rating_count" of a book more reliable the rating is. To interpret the above concept, a book rated 8 with rating_count 5 is less liked in comparison to a book rated 7 with a rating count of 50. It has been empirically found that the likelihood of the book is directly proportional to the inverse of the square root of the rating_count of the book.
def ranking_books(recommendations, final_rating):
# Sort the books based on ratings count
ranked_books = final_rating.loc[[items[0] for items in recommendations]].sort_values('rating_count', ascending=False)[['rating_count']].reset_index()
# Merge with the recommended books to get predicted ratings
ranked_books = ranked_books.merge(pd.DataFrame(recommendations, columns=['book_id', 'predicted_ratings']), on='book_id', how='inner')
# Rank the books based on corrected ratings
ranked_books['corrected_ratings'] = ranked_books['predicted_ratings'] - 1 / np.sqrt(ranked_books['rating_count'])
# Sort the books based on corrected ratings
ranked_books = ranked_books.sort_values('corrected_ratings', ascending=False)
return ranked_books
Note: In the above-corrected rating formula, we can add the quantity 1/np.sqrt(n)
instead of subtracting it to get more optimistic predictions. But here we are subtracting this quantity, as there are some books with ratings 10 and we can't have a rating more than 10 for a book.
# Applying the ranking_books function and sorting it based on corrected ratings
ranking_books(recommendations, final_rating)
book_id | rating_count | predicted_ratings | corrected_ratings | |
---|---|---|---|---|
3 | 259 | 31 | 10.000000 | 9.820395 |
0 | 658 | 53 | 9.870802 | 9.733441 |
2 | 1297 | 35 | 9.884446 | 9.715415 |
1 | 639 | 43 | 9.764398 | 9.611899 |
4 | 451 | 18 | 9.702661 | 9.466959 |
Model 3: Item based Collaborative Filtering Recommendation System¶
- We have seen user-user similarity-based collaborative filtering. Now, let us look into similarity-based collaborative filtering where similarity is calculated between items.
# Defining similarity measure
sim_options = {'name': 'cosine',
'user_based': False}
# Defining nearest neighbour algorithm
algo_knn_item = KNNBasic(sim_options=sim_options,verbose=False)
# Train the algorithm on the train set
algo_knn_item.fit(trainset)
# Let us compute precision@k, recall@k, and f_1 score with k=10
precision_recall_at_k(algo_knn_item)
RMSE: 1.6210 Precision: 0.802 Recall: 0.8 F_1 score: 0.801
Observations:
- We can observe that the baseline model has
RMSE=1.62
&F_1 Score=0.80
on the test set. - We can try to improve the performance number by using
GridSearchCV
to tune different hyperparameters of this algorithm.
What is the predicted rating for an user with user_id=1326 and for book_id=12126?
algo_knn_item.predict(1326, 12126, r_ui=8, verbose=True)
user: 1326 item: 12126 r_ui = 8.00 est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=12126, r_ui=8, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
algo_knn_item.predict(1326, 2150, verbose=True)
user: 1326 item: 2150 r_ui = None est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=2150, r_ui=None, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Tuning the baseline item-based collaborative filtering recommendation system's hyperparameters and determining the RMSE for the tuned item-based collaborative filtering recommendation system
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10,20,30], 'min_k': [3,6,9],
'sim_options': {'name': ['msd', 'cosine'],
'user_based': [False]}
}
# Performing 3-fold cross validation to tune the hyperparameters
grid_obj = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3)
# Fitting the data
grid_obj.fit(data)
# Best RMSE score
print(grid_obj.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(grid_obj.best_params['rmse'])
Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the msd similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. Computing the cosine similarity matrix... Done computing similarity matrix. 1.5992918772292677 {'k': 30, 'min_k': 3, 'sim_options': {'name': 'cosine', 'user_based': False}}
Now, let's build the final model by using optimal values of the hyperparameters which we received by using grid search cross-validation.
# Creating an instance of KNNBasic with optimal hyperparameter values
similarity_algo_optimized_item = KNNBasic(sim_options={'name': 'cosine', 'user_based': False}, k=30, min_k=3,verbose=False)
# Training the algorithm on the train set
similarity_algo_optimized_item.fit(trainset)
# Let us compute precision@k and recall@k with k=10
precision_recall_at_k(similarity_algo_optimized_item)
RMSE: 1.5882 Precision: 0.818 Recall: 0.836 F_1 score: 0.827
Observations:
- We observe that after tuning hyperparameters, RMSE for the test set has reduced to 1.58 from 1.62. F_1 score of the tuned model is also slightly better than the baseline model. So, the model performance has improved slightly after hyperparameter tuning.
Let's predict the rating for an user with user_id=1326 and for book_id= 12126.
similarity_algo_optimized_item.predict(1326, 12126, r_ui=8, verbose=True)
user: 1326 item: 12126 r_ui = 8.00 est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=12126, r_ui=8, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
similarity_algo_optimized_item.predict(1326, 2150, verbose=True)
user: 1326 item: 2150 r_ui = None est = 7.99 {'was_impossible': True, 'reason': 'User and/or item is unknown.'}
Prediction(uid=1326, iid=2150, r_ui=None, est=7.9887628424657535, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})
Identifying similar items to a given item (nearest neighbors)¶
We can also find out the similar items to a given item or its nearest neighbors based on this KNNBasic algorithm. Below we are finding 5 most similar items to the BookId=1
.
similarity_algo_optimized_item.get_neighbors(1, k=5)
[11, 12, 17, 21, 22]
Predicted top 5 books for user_id=1 with similarity based recommendation system¶
recommendations = get_recommendations(df_rating, 1, 5, similarity_algo_optimized_item)
# Building the dataframe for above recommendations with columns "book_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['book_Id', 'predicted_ratings'])
book_Id | predicted_ratings | |
---|---|---|
0 | 1 | 10 |
1 | 15 | 10 |
2 | 16 | 10 |
3 | 17 | 10 |
4 | 30 | 10 |
# Applying the ranking_books function and sorting it based on corrected ratings
ranking_books(recommendations, final_rating)
book_id | rating_count | predicted_ratings | corrected_ratings | |
---|---|---|---|---|
0 | 1 | 28 | 10 | 9.811018 |
1 | 15 | 13 | 10 | 9.722650 |
2 | 16 | 13 | 10 | 9.722650 |
3 | 17 | 12 | 10 | 9.711325 |
4 | 30 | 10 | 10 | 9.683772 |
Model 4: Matrix Factorization¶
Model-based Collaborative Filtering is a personalized recommendation system. The recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.
Latent Features: The features that are not present in the empirical data but can be inferred from the data.
Singular Value Decomposition (SVD)¶
SVD is used to compute the latent features from the user-item matrix that we already learned earlier. But SVD does not work when we miss values in the user-item matrix.
Building a baseline matrix factorization recommendation system¶
# Using SVD matrix factorization
svd = SVD(random_state=1)
# Training the algorithm on the train set
svd.fit(trainset)
# Let us compute precision@k and recall@k with k=10
precision_recall_at_k(svd)
RMSE: 1.5106 Precision: 0.827 Recall: 0.86 F_1 score: 0.843
Observations:
- We observe that the baseline F_1 score for the matrix factorization model on the test set is higher in comparison to the F_1 score for the user-user similarity-based recommendation system and lower in comparison to the optimized user-user similarity-based recommendation system.
- The result for SVD is better than both baseline and optimized item-item similarity-based recommendation systems.
What is the predicted rating for an user with user_id=1326 and for book_id= 12126?
# Making prediction for user_id 1326 and book_id 12126
svd.predict(1326, 12126, r_ui=8, verbose=True)
user: 1326 item: 12126 r_ui = 8.00 est = 7.99 {'was_impossible': False}
Prediction(uid=1326, iid=12126, r_ui=8, est=7.9887628424657535, details={'was_impossible': False})
# Making prediction for user_id 1326 and book_id 2150
svd.predict(1326, 2150, verbose=True)
user: 1326 item: 2150 r_ui = None est = 7.99 {'was_impossible': False}
Prediction(uid=1326, iid=2150, r_ui=None, est=7.9887628424657535, details={'was_impossible': False})
Improving matrix factorization based recommendation system by tuning its hyperparameters¶
In SVD, rating is predicted as:
$$\hat{r}_{u i}=\mu+b_{u}+b_{i}+q_{i}^{T} p_{u}$$
If user $u$ is unknown, then the bias $b_{u}$ and the factors $p_{u}$ are assumed to be zero. The same applies for item $i$ with $b_{i}$ and $q_{i}$.
To estimate all the unknown, we minimize the following regularized squared error:
$$\sum_{r_{u i} \in R_{\text {train }}}\left(r_{u i}-\hat{r}_{u i}\right)^{2}+\lambda\left(b_{i}^{2}+b_{u}^{2}+\left\|q_{i}\right\|^{2}+\left\|p_{u}\right\|^{2}\right)$$
The minimization is performed by a very straightforward stochastic gradient descent:
$$\begin{aligned} b_{u} & \leftarrow b_{u}+\gamma\left(e_{u i}-\lambda b_{u}\right) \\ b_{i} & \leftarrow b_{i}+\gamma\left(e_{u i}-\lambda b_{i}\right) \\ p_{u} & \leftarrow p_{u}+\gamma\left(e_{u i} \cdot q_{i}-\lambda p_{u}\right) \\ q_{i} & \leftarrow q_{i}+\gamma\left(e_{u i} \cdot p_{u}-\lambda q_{i}\right) \end{aligned}$$
There are many hyperparameters to tune in this algorithm, you can find a full list of hyperparameters here
Below we will be tuning only three hyperparameters:
- n_epochs: The number of iteration of the SGD algorithm
- lr_all: The learning rate for all parameters
- reg_all: The regularization term for all parameters
# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
'reg_all': [0.2, 0.4, 0.6]}
# Performing 3-fold gridsearch cross validation
gs_ = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)
# Fitting data
gs_.fit(data)
# Best RMSE score
print(gs_.best_score['rmse'])
# Combination of parameters that gave the best RMSE score
print(gs_.best_params['rmse'])
1.5040520223692642 {'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}
Now, let's build the final model by using optimal values of the hyperparameters which we received by using grid search cross-validation.
# Building the optimized SVD model
svd_optimized = SVD(n_epochs=30, lr_all=0.005, reg_all=0.2, random_state=1)
# Training the algorithm on the train set
svd_optimized=svd_optimized.fit(trainset)
# Let us compute precision@k and recall@k with k=10
precision_recall_at_k(svd_optimized)
RMSE: 1.5024 Precision: 0.829 Recall: 0.856 F_1 score: 0.842
Observation:
- We observe that after tuning hyperparameters, the model performance has not improved by much. We can try other values for hyperparameters and see if we can get a better performance. However, here we will proceed with the existing model.
Let's predict the rating for an user with user_id=1326 and for book_id=12126.
# Making prediction for user_id 1326 and book_id 12126
svd_optimized.predict(1326, 12126, r_ui=8, verbose=True)
user: 1326 item: 12126 r_ui = 8.00 est = 7.99 {'was_impossible': False}
Prediction(uid=1326, iid=12126, r_ui=8, est=7.9887628424657535, details={'was_impossible': False})
# Making prediction for user_id 1326 and book_id 2150
svd_optimized.predict(1326, 2150, verbose=True)
user: 1326 item: 2150 r_ui = None est = 7.99 {'was_impossible': False}
Prediction(uid=1326, iid=2150, r_ui=None, est=7.9887628424657535, details={'was_impossible': False})
Now, let's recommend the books using the optimized svd model
# Getting top 5 recommendations for user_id 1 using "svd_optimized" algorithm
svd_recommendations = get_recommendations(df_rating, 1, 5, svd_optimized)
# Ranking book based on above recommendations
ranking_books(svd_recommendations, final_rating)
book_id | rating_count | predicted_ratings | corrected_ratings | |
---|---|---|---|---|
0 | 70 | 32 | 10 | 9.823223 |
1 | 65 | 15 | 10 | 9.741801 |
2 | 73 | 14 | 10 | 9.732739 |
3 | 16 | 13 | 10 | 9.722650 |
4 | 34 | 12 | 10 | 9.711325 |
Conclusion¶
In this case study, we built recommendation systems using four different algorithms. They are as follows:
- Rank-based using averages
- User-user similarity-based collaborative filtering
- Item-item similarity-based collaborative filtering
- Model-based (matrix factorization) collaborative filtering
To demonstrate "user-user similarity-based collaborative filtering", "item-item similarity-based collaborative filtering", and "model-based (matrix factorization) collaborative filtering", surprise library has been used. For these algorithms, grid search cross-validation is used to find the optimal hyperparameters for the data, and improve the performance of the model**.
For performance evaluation of these models, precision@k and recall@k are used. Using these two metrics, the F_1 score is calculated for each working model.
Overall, the optimized user-user similarity-based recommendation system has given the best performance in terms of the F1-Score (~0.86)
Collaborative Filtering searches for neighbors based on similarity of books (example) preferences and recommend books that those neighbors read while Matrix factorization works by decomposing the user-item matrix into the product of two lower dimensionality rectangular matrices.
Matrix Factorization has lower RMSE (1.50) due to the reason that it assumes that both books and users are present in some low dimensional space describing their properties and recommend a book based on its proximity to the user in the latent space. Implying it accounts for latent factors as well.
We can try to further improve the performance of these models using hyperparameter tuning.
We can also try to combine different recommendation techniques to build a more complex model like hybrid recommendation systems.
# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Book_Recommendation_System/Recommendation_Systems_Practice_Project_Solution_Notebook.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Book_Recommendation_System/Recommendation_Systems_Practice_Project_Solution_Notebook.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 4 image(s). [NbConvertApp] Writing 668803 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Book_Recommendation_System/Recommendation_Systems_Practice_Project_Solution_Notebook.html