from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl.metadata (327 bytes)
Collecting scikit-surprise (from surprise)
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.4/154.4 kB 3.0 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.4.2)
Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-surprise->surprise) (1.13.1)
Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... done
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357301 sha256=96d1916ebffbfd506f19107ea8e35c89075dbcf437f027a3c9cce22df807fba4
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.4 surprise-0.1

# Data Handling
import pandas as pd

# Mathematical calculation
import numpy as np

# Date Time Analysis
from time import time
from datetime import datetime

# Data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.patches as mpatches

# Statistical analysis
from scipy import stats

# Warnings
import warnings;
import numpy as np
warnings.filterwarnings('ignore')

# Comment formatting
from IPython.display import display, HTML

# scikit-surprise recommender package
from surprise import SVD, KNNWithMeans
from surprise import Dataset, Reader, accuracy
from surprise.model_selection import train_test_split, GridSearchCV
from surprise.prediction_algorithms.baseline_only import BaselineOnly

# Read the csv
data = pd.read_csv('/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Project_Amazon_Product_Recommendation_System/ratings_Electronics.csv')

# Add column names
data.columns = ['user_id', 'prod_id', 'rating', 'timestamp']

data.shape

(7824481, 4)

data.head()

data.tail()

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7824481 entries, 0 to 7824480
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   user_id    object 
 1   prod_id    object 
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 238.8+ MB

# Copy the data to another DataFrame called df
df = data.copy()

# Get the column containing the users
users = df.user_id

# Create a dictionary from users to their number of ratings
ratings_count = dict()

for user in users:

    # If we already have the user, just add 1 to their rating count
    if user in ratings_count:
        ratings_count[user] += 1

    # Otherwise, set their rating count to 1
    else:
        ratings_count[user] = 1

# We want our users to have at least 50 ratings to be considered
RATINGS_CUTOFF = 50

remove_users = []

for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df = df.loc[ ~ df.user_id.isin(remove_users)]

# Get the column containing the products
prods = df.prod_id

# Create a dictionary from products to their number of ratings
ratings_count = dict()

for prod in prods:

    # If we already have the product, just add 1 to its rating count
    if prod in ratings_count:
        ratings_count[prod] += 1

    # Otherwise, set their rating count to 1
    else:
        ratings_count[prod] = 1

# We want our item to have at least 5 ratings to be considered
RATINGS_CUTOFF = 5

remove_users = []

for user, num_ratings in ratings_count.items():
    if num_ratings < RATINGS_CUTOFF:
        remove_users.append(user)

df_final = df.loc[~ df.prod_id.isin(remove_users)]

# Print a few rows of the imported dataset
df_final.head()

# Check the number of rows and columns and provide observations
df.shape

(125871, 4)

# Check Data types and provide observations
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 125871 entries, 93 to 7824443
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   user_id    125871 non-null  object 
 1   prod_id    125871 non-null  object 
 2   rating     125871 non-null  float64
 3   timestamp  125871 non-null  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 4.8+ MB

# Check for missing values present and provide observations
df.isnull().sum()

# Check for duplicate user_id, product_id combinations
df.duplicated(subset = ['user_id', 'prod_id']).sum()

0

ratings_count = df.groupby('rating')['rating'].agg(['count'])
ratings_count

# Summary statistics of 'rating' variable and provide observations
df.describe(include = "all").T

# Define ratings variables for later use
min_rating = min(df['rating'])
max_rating = max(df['rating'])

# Convert the timestamp column to a readable date time format
def convert_timestamp(ts):
  try:
    # Check if it's already a datetime object, if so return it
    if isinstance(ts, datetime):
      return ts.strftime('%Y-%m-%d %H:%M:%S')

    # Check if it's a string and if it can be converted to a timestamp
    if isinstance(ts, str) and ts.isdigit():
      return datetime.utcfromtimestamp(int(ts)).strftime('%Y-%m-%d %H:%M:%S')

    # If it's already a number, assume it's a Unix timestamp
    if isinstance(ts, (int, float)):
      return datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

    # If none of the above, return the original value (and it will likely result in a NaT)
    return ts

  # Handle potential errors during conversion
  except (ValueError, TypeError):
    return ts

# Apply the function to the timestamp column
data['timestamp'] = data.timestamp.apply(convert_timestamp)

# Convert the datatype to datetime
data['timestamp'] = pd.to_datetime(data['timestamp'])

# Get the timespan of data

print("Minimum recorded ts:",data.timestamp.min())
print("Maximum recorded ts:",data.timestamp.max())

Minimum recorded ts: 1998-12-04 00:00:00
Maximum recorded ts: 2014-07-23 00:00:00

# Visualize the year wise ratings distributions

# First, extract the year from the timestamp
data['Year'] = data['timestamp'].dt.year

# Calculate the count of ratings per year
yearly_counts = data['Year'].value_counts().sort_index()

# Set up the plotting area
fig, ax = plt.subplots(figsize=(12, 6))

# Plot a bar chart
yearly_counts.plot(kind='bar', ax=ax, color='skyblue')

# Add titles and labels
ax.set_title('Distribution of Ratings by Year')
ax.set_xlabel('Year')
ax.set_ylabel('Number of Ratings')

# Improve layout and display the plot
plt.tight_layout()
plt.show()

# Drop the column timestamp
data.drop('timestamp', axis = 1, inplace = True)

# Create a frequency table for the ratings
rating_counts = df['rating'].value_counts().reset_index()
rating_counts.columns = ['Rating', 'Frequency']

# Calculate percentage for each rating
rating_counts['Percentage'] = (rating_counts['Frequency'] / rating_counts['Frequency'].sum()) * 100

# Create the horizontal bar plot with the Tab10 palette
fig, ax = plt.subplots(figsize=(10, 6))
colors = plt.get_cmap('tab10').colors
bars = ax.barh(rating_counts['Rating'], rating_counts['Frequency'], color=colors[:len(rating_counts)])

# Remove scientific notation on the x-axis
ax.ticklabel_format(style='plain', axis='x')

# Annotate the relative percentage inside each bar
for bar, percentage in zip(bars, rating_counts['Percentage']):
    width = bar.get_width()
    ax.text(width - (width * 0.05), bar.get_y() + bar.get_height() / 2,
            f'{percentage:.2f}%', va='center', ha='right', color='white')

# Add labels and title
ax.set_xlabel('Frequency')
ax.set_ylabel('Rating')
ax.set_title('Rating Distribution')

# Display the plot
plt.show()

# Display distribution by count and percentage
ratings_counts = df.groupby('rating')['rating'].agg(['count'])
ratings_percentages = ratings_counts / ratings_counts.sum() * 100

combined_table = pd.concat([ratings_counts, ratings_percentages], axis=1)

combined_table.columns = ['Count', 'Percentage (%)']

combined_table['Percentage (%)'] = combined_table['Percentage (%)'].round(2)

combined_table

# Number of total rows in the data and number of unique user id and product id in the data

# Total number of rows
total_rows = df.shape[0]

# Number of unique users
unique_users = df['user_id'].nunique()

# Print in table format
pd.DataFrame({'Total Rows': [total_rows], 'Unique Users': [unique_users]})

# Top 10 users based on the number of ratings
user_ratings_counts = df.groupby('user_id').size().reset_index(name='num_ratings')
user_ratings_counts = user_ratings_counts.sort_values(by='num_ratings', ascending=False)

# Display the top 10 users
user_ratings_counts.head(10)

# Users with 250 or more reviews

users_with_250_or_more_reviews = user_ratings_counts[user_ratings_counts['num_ratings'] >= 250].shape[0]
users_with_250_or_more_reviews_perc = round((users_with_250_or_more_reviews / user_ratings_counts.shape[0]) * 100, 2)

print(f"Number of unique users with 250 or more reviews: {users_with_250_or_more_reviews} or {users_with_250_or_more_reviews_perc}%")

# Users with 100 or more reviews

users_with_100_or_more_reviews = user_ratings_counts[user_ratings_counts['num_ratings'] >= 100].shape[0]
users_with_100_or_more_reviews_perc = round((users_with_100_or_more_reviews / user_ratings_counts.shape[0]) * 100, 2)

print(f"Number of unique users with 100 or more reviews: {users_with_100_or_more_reviews} or {users_with_100_or_more_reviews_perc}%")

# Users with 50 reviews

users_with_50_reviews = user_ratings_counts[user_ratings_counts['num_ratings'] == 50].shape[0]
users_with_50_reviews_perc = round((users_with_50_reviews / user_ratings_counts.shape[0]) * 100, 2)

print(f"Number of unique users with 50 reviews: {users_with_50_reviews} or {users_with_50_reviews_perc}%")

Number of unique users with 250 or more reviews: 23 or 1.49%
Number of unique users with 100 or more reviews: 289 or 18.77%
Number of unique users with 50 reviews: 74 or 4.81%

# Chart the distribution of users performing ratings, using bins

plt.figure(figsize=(10, 6))
plt.hist(user_ratings_counts['num_ratings'], bins=50)
plt.xlabel('Number of Ratings')
plt.ylabel('Number of Users')
plt.title('Distribution of Users Performing Ratings')
plt.show()

# Perform a calculation to find the average rating for each product

# Group by 'prod_id' and calculate the average rating for each product
average_ratings = df.groupby('prod_id')['rating'].mean().reset_index()

# Rename the columns for better readability
average_ratings.columns = ['prod_id', 'average_rating']

# Calculate the count of ratings for each product

# Group by 'prod_id' and calculate the count of ratings for each product
ratings_count = df.groupby('prod_id')['rating'].count().reset_index()

# Rename the columns for better clarity
ratings_count.columns = ['prod_id', 'ratings_count']

# Create a dataframe with calculated average and count of ratings

# Group by 'prod_id' and calculate both the average and count of ratings
product_stats = df.groupby('prod_id').agg(
    average_rating=('rating', 'mean'),   # Calculate the average rating
    ratings_count=('rating', 'count')    # Calculate the count of ratings
).reset_index()

# Round the 'average_rating' column to 2 decimal places
product_stats['average_rating'] = product_stats['average_rating'].round(2)

# Sort the dataframe by average of ratings in the descending order

# Sort the DataFrame by 'average_rating' in descending order
final_rating = product_stats.sort_values(by='average_rating', ascending=False)

# See the first five records of the "final_rating" dataset
final_rating.head(5)

# Products with more than one rating

# Sort the DataFrame by 'average_rating' in descending order
final_rating = product_stats.sort_values(by='ratings_count', ascending=False)
final_rating

# Create scatter plot

x_column    = 'average_rating'
y_column    = 'ratings_count'
data        = final_rating
title       = 'Average Rating vs Ratings Count'
x_label     = 'Average Rating'
y_label     = 'Ratings Count'
palette     = 'tab10'

plt.figure(figsize=(10, 6))
sns.scatterplot(x=x_column, y=y_column, data=data, palette='palette')
plt.title(title if title else f'Scatter Plot: {x_column} vs {y_column}')
plt.xlabel(x_label if x_label else x_column)
plt.ylabel(y_label if y_label else y_column)
plt.grid(True)
plt.show()

# Defining a function to get the top n products based on the highest average rating and minimum interactions by finding products with a minimum number of interactions and sorting by average rating

def get_top_n_products(final_rating, n, min_interactions):
    """
    Get the top n products based on the highest average rating and a specified minimum number of interactions.

    Parameters:
    final_rating (DataFrame): DataFrame containing 'prod_id', 'average_rating', and 'ratings_count'.
    n (int): The number of top products to return.
    min_interactions (int): The minimum number of interactions required for a product to be considered.

    Returns:
    DataFrame: A DataFrame with the top n products sorted by average rating.
    """
    # Filter products based on the minimum number of interactions passed as an argument
    filtered_products = final_rating[final_rating['ratings_count'] >= min_interactions]

    # Sort values with respect to average rating in descending order
    sorted_products = filtered_products.sort_values(by='average_rating', ascending=False)

    # Return the top n products
    return sorted_products.head(n)

# Example usage:
# Get the top 5 products with at least 10 ratings
#top_products = get_top_n_products(final_rating, n=5, min_interactions=10)

# Top 5 products with 50 minimum interactions based on popularity
top_products_50 = get_top_n_products(final_rating, n=5, min_interactions=50)
top_products_50

# Top 5 products with 100 minimum interactions based on popularity
top_products_100 = get_top_n_products(final_rating, n=5, min_interactions=100)
top_products_100

# To compute the accuracy of models
# from surprise import accuracy                                                 - instantiated above

# Class is used to parse a file containing ratings,
# data should be in structure - user ; item ; rating
# from surprise.reader import Reader                                            - instantiated above

# Class for loading datasets
# from surprise.dataset import Dataset                                          - instantiated above

# For tuning model hyperparameters
# from surprise.model_selection import GridSearchCV                             - instantiated above

# For splitting the rating data in train and test datasets
# from surprise.model_selection import train_test_split                         - instantiated above


# For implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# for implementing K-Fold cross-validation
from surprise.model_selection import KFold

# For implementing clustering-based recommendation system
from surprise import CoClustering

# The provided function requires the following

from collections import defaultdict

def precision_recall_at_k(model, k=10, threshold = 3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user
    user_est_true = defaultdict(list)

    # Making predictions on the test data
    predictions = model.test(testset)

    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key = lambda x: x[0], reverse = True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. Therefore, we are setting Precision to 0 when n_rec_k is 0

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. Therefore, we are setting Recall to 0 when n_rel is 0

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    # Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

    # Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)

    accuracy.rmse(predictions)

    print('Precision: ', precision)                                             # Command to print the overall precision

    print('Recall: ', recall)                                                   # Command to print the overall recall

    print('F_1 score: ', round((2*precision*recall)/(precision+recall), 3))     # Formula to compute the F-1 score

# Instatiate the Reader object (use min_rating and max_rating as previously defined )

reader = Reader(rating_scale=(min_rating, max_rating))

# Loading the rating dataset

ratings_suprise = Dataset.load_from_df(df_final[['user_id', 'prod_id', 'rating']], reader)

# Splitting the data  (70:30 ratio)

trainset, testset = train_test_split(                                           # The dataset size is sufficiently large enough to split the data a bit more generously with respect to testing
                                    ratings_suprise,                            # Note: a test if an 80:20 split lead to essentially the same results
                                    test_size=0.3,
                                    random_state=1
                                    )

# Declaring the similarity options for a user-user collaborative filtering

sim_options = {
              'name': 'cosine',                                                 # Use 'cosine' or 'pearson' or 'pearson_baseline'
              'user_based': True,                                               # Set to True for user-user similarity
              }

# Initialize the KNNBasic model using sim_options, verbose=False, and setting random_state

sim_user = KNNBasic(sim_options=sim_options, verbose=False, random_state=1)

# Fit the model on the training data (assuming `trainset` is already prepared)

sim_user.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x783cafb16b90>

# Compute precision@k, recall@k, and F1 Score using the provided function

precision_recall_at_k(sim_user)

RMSE: 1.0390
Precision:  0.852
Recall:  0.785
F_1 score:  0.817

# Predicting rating for a sample user with an interacted product

sim_user.predict('A3LDPF5FMB782Z', '1400501466', r_ui = 5, verbose = True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 3.80   {'actual_k': 5, 'was_impossible': False}

Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=3.8, details={'actual_k': 5, 'was_impossible': False})

# Find unique user_id where prod_id is not equal to "1400501466"

not_seen = df_final[df_final.prod_id != "1400501466"]
not_seen

'A34BZM6S9L7QI4' in not_seen['user_id'].unique()

True

# Predicting rating for a sample user with a non interacted product

sim_user.predict('A34BZM6S9L7QI4', '1400501466', verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 2.00   {'actual_k': 2, 'was_impossible': False}

Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=1.9969324864734994, details={'actual_k': 2, 'was_impossible': False})

# Parameter grid to tune the hyperparameters

param_grid = {
            'k': [10, 20, 30],
            'min_k': [3, 6, 9],
            'sim_options': {
                            'name': ['msd', 'cosine'],
                            'user_based': [True]                                # Compute similarities between users
                            }
            }

# Perform 3-fold cross-validation to tune hyperparameters

grid_search = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3)

# Fit the data

grid_search.fit(ratings_suprise)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.

# Best RMSE score

print(grid_search.best_score['rmse'])

0.9697570001107455

# Combination of parameters that gave the best RMSE score

print(grid_search.best_params['rmse'])

{'k': 30, 'min_k': 6, 'sim_options': {'name': 'cosine', 'user_based': True}}

# Using the optimal similarity measure for user-user based collaborative filtering

sim_options = {
              'name': 'cosine',
              'user_based': True,
              }

# Creating an instance of KNNBasic with optimal hyperparameter values

sim_user_optimized = KNNBasic(sim_options=sim_options, verbose=False, random_state=1, k=30, min_k=6)

# Training the algorithm on the trainset

sim_user_optimized.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x783cafb16f80>

# Let us compute precision@k and recall@k also with k = 10

precision_recall_at_k(sim_user_optimized)

RMSE: 0.9791
Precision:  0.842
Recall:  0.807
F_1 score:  0.824

# Use sim_user_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId "1400501466"

sim_user_optimized.predict('A3LDPF5FMB782Z', '1400501466', r_ui = 5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.30   {'was_impossible': True, 'reason': 'Not enough neighbors.'}

Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.29674200818327, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

# Use sim_user_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"

sim_user_optimized.predict('A34BZM6S9L7QI4', '1400501466', verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.30   {'was_impossible': True, 'reason': 'Not enough neighbors.'}

Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.29674200818327, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

# 0 is the inner id of the above user

sim_user_optimized.get_neighbors(0, k=5)

[1, 10, 17, 18, 28]

def get_recommendations(data, user_id, top_n, algo):

    # Creating an empty list to store the recommended product ids
    recommendations = []

    # Creating an user item interactions matrix
    user_item_interactions_matrix = data.pivot(index = 'user_id', columns = 'prod_id', values = 'rating')

    # Extracting those product ids which the user_id has not interacted yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()

    # Looping through each of the product ids which user_id has not interacted yet
    for item_id in non_interacted_products:

        # Predicting the ratings for those non interacted product ids by this user
        est = algo.predict(user_id, item_id).est

        # Appending the predicted ratings
        recommendations.append((item_id, est))

    # Sorting the predicted ratings in descending order
    recommendations.sort(key = lambda x: x[1], reverse = True)

    return recommendations[:top_n] # Returing top n highest predicted rating products for this user

# Making top 5 recommendations for user_id "A3LDPF5FMB782Z" with a similarity-based recommendation engine

recommendations = get_recommendations(df_final, 'A3LDPF5FMB782Z', 5, sim_user_optimized)

# Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"

pd.DataFrame(recommendations, columns = ['prod_id', 'predicted_ratings'])

# Declaring the similarity options.

sim_options = {
              'name': 'cosine',
              'user_based': False
              }

# KNN algorithm is used to find desired similar items.

sim_item = KNNBasic(sim_options=sim_options,  verbose=False, random_state=1)

# Train the algorithm on the trainset

sim_item.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x783cadf794e0>

# Predict ratings for the testset

testset_predictions = sim_item.test(testset)
testset_predictions[:5]

[Prediction(uid='A20DZX38KRBIT8', iid='B001EQ0HAW', r_ui=5.0, est=1.0, details={'actual_k': 1, 'was_impossible': False}),
 Prediction(uid='A1N5FSCYN4796F', iid='B006MBP7T0', r_ui=5.0, est=3.7256848753629352, details={'actual_k': 14, 'was_impossible': False}),
 Prediction(uid='A1D9V11QUHXENQ', iid='B0000645RH', r_ui=4.0, est=4.0, details={'actual_k': 1, 'was_impossible': False}),
 Prediction(uid='A1L1N3J6XNABO2', iid='B004DBD4TG', r_ui=5.0, est=5.0, details={'actual_k': 2, 'was_impossible': False}),
 Prediction(uid='A30XZK10EZN9V4', iid='B003XM1WE0', r_ui=5.0, est=3.998468592044059, details={'actual_k': 4, 'was_impossible': False})]

# Let us compute precision@k, recall@k, and f_1 score with k = 10

precision_recall_at_k(sim_item)

RMSE: 1.0345
Precision:  0.833
Recall:  0.768
F_1 score:  0.799

# Predicting rating for a sample user with an interacted product

sim_item.predict('A3LDPF5FMB782Z', '1400501466', r_ui = 5, verbose = True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.19   {'actual_k': 16, 'was_impossible': False}

Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.1875, details={'actual_k': 16, 'was_impossible': False})

# Predicting rating for a sample user with a non interacted product

sim_item.predict('A34BZM6S9L7QI4', '1400501466', verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.00   {'actual_k': 3, 'was_impossible': False}

Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.0, details={'actual_k': 3, 'was_impossible': False})

# Setting up parameter grid to tune the hyperparameters

param_grid_ = {
              'k': [10, 20, 30],
              'min_k': [3, 6, 9],
              'sim_options': {
                              'name': ['msd', 'cosine'],
                              'user_based': [True]
                              }
              }

# Perform 3-fold cross-validation to tune hyperparameters

grid_search_ = GridSearchCV(KNNBasic, param_grid_, measures=['rmse'], cv=3)

# Fit the data

grid_search_.fit(ratings_suprise)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.

# Find the best RMSE score

print(grid_search_.best_score['rmse'])

0.9714976467628085

# Find the combination of parameters that gave the best RMSE score

print(grid_search_.best_params['rmse'])

{'k': 30, 'min_k': 6, 'sim_options': {'name': 'cosine', 'user_based': True}}

# Using the optimal similarity measure for item-item based collaborative filtering

sim_options_ = {
              'name': 'msd',
              'user_based': False
              }

# Creating an instance of KNNBasic with optimal hyperparameter values

sim_item_optimized = KNNBasic(sim_options=sim_options_, verbose=False, random_state=1, k=30, min_k=6)

# Training the algorithm on the trainset

sim_item_optimized.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x783caf97d780>

# Let us compute precision@k and recall@k, f1_score and RMSE

precision_recall_at_k(sim_item_optimized)

RMSE: 0.9804
Precision:  0.833
Recall:  0.8
F_1 score:  0.816

# Use sim_item_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId "1400501466"

sim_item_optimized.predict('A3LDPF5FMB782Z', '1400501466', r_ui = 5,verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.53   {'actual_k': 16, 'was_impossible': False}

Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.534653465346536, details={'actual_k': 16, 'was_impossible': False})

# Use sim_item_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"

sim_item_optimized.predict('A34BZM6S9L7QI4', '1400501466', verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.30   {'was_impossible': True, 'reason': 'Not enough neighbors.'}

Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.29674200818327, details={'was_impossible': True, 'reason': 'Not enough neighbors.'})

# 0 is the inner id of the above user

sim_item_optimized.get_neighbors(0, k=5)

[9, 12, 13, 22, 28]

# Making top 5 recommendations for user_id A1A5KUIIIHFF4U with similarity-based recommendation engine.

recommendations_ = get_recommendations(df_final, 'A1A5KUIIIHFF4U', 5, sim_item_optimized)

# Building the dataframe for above recommendations with columns "prod_id" and "predicted_ratings"

pd.DataFrame(recommendations_, columns = ['prod_id', 'predicted_ratings'])

# Using SVD matrix factorization. Use random_state = 1

svd = SVD(random_state=1)

# Training the algorithm on the trainset

svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9057fb9cf0>

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE

precision_recall_at_k(svd)

RMSE: 0.9114
Precision:  0.854
Recall:  0.802
F_1 score:  0.827

# Making prediction

svd.predict('A3LDPF5FMB782Z', '1400501466', r_ui = 5, verbose = True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.26   {'was_impossible': False}

Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.262585198727372, details={'was_impossible': False})

# Making prediction

svd.predict('A34BZM6S9L7QI4', '1400501466', verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.43   {'was_impossible': False}

Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.430784168423419, details={'was_impossible': False})

# Set the parameter space to tune

param_grid_ = {
              'n_epochs': [5, 10],
              'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]
              }

# Performing 3-fold gridsearch cross-validation

grid_search_ = GridSearchCV(SVD, param_grid_, measures=['rmse'], cv=3)

# Fit the data

grid_search_.fit(ratings_suprise)

# Best RMSE score

print(grid_search_.best_score['rmse'])

0.913529048522442

# Combination of parameters that gave the best RMSE score

print(grid_search_.best_params['rmse'])

{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}

# Build the optimized SVD model using optimal hyperparameter search. Use random_state=1

svd_optimized = SVD(random_state=1)

# Train the algorithm on the trainset

svd_optimized.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f90580dad70>

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSE

precision_recall_at_k(svd_optimized)

RMSE: 0.9114
Precision:  0.854
Recall:  0.802
F_1 score:  0.827

# Use svd_optimized model to recommend for userId "A3LDPF5FMB782Z" and productId "1400501466"

svd_optimized.predict('A3LDPF5FMB782Z', '1400501466', r_ui = 5, verbose=True)

user: A3LDPF5FMB782Z item: 1400501466 r_ui = 5.00   est = 4.26   {'was_impossible': False}

Prediction(uid='A3LDPF5FMB782Z', iid='1400501466', r_ui=5, est=4.262585198727372, details={'was_impossible': False})

# Use svd_optimized model to recommend for userId "A34BZM6S9L7QI4" and productId "1400501466"

svd_optimized.predict('A34BZM6S9L7QI4', '1400501466', verbose=True)

user: A34BZM6S9L7QI4 item: 1400501466 r_ui = None   est = 4.43   {'was_impossible': False}

Prediction(uid='A34BZM6S9L7QI4', iid='1400501466', r_ui=None, est=4.430784168423419, details={'was_impossible': False})

import os
os.getcwd()

# Convert notebook to html
!jupyter nbconvert --to html "/content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Project_Amazon_Product_Recommendation_System/Recommendation_Systems_Learner_Notebook_Full_Code_Mine.ipynb"

[NbConvertApp] Converting notebook /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Project_Amazon_Product_Recommendation_System/Recommendation_Systems_Learner_Notebook_Full_Code_Mine.ipynb to html
[NbConvertApp] Writing 1197624 bytes to /content/drive/MyDrive/MIT - Data Sciences/Colab Notebooks/Week_Seven_-_Recommendation_Systems/Project_Amazon_Product_Recommendation_System/Recommendation_Systems_Learner_Notebook_Full_Code_Mine.html

	user_id	prod_id	rating	timestamp
0	A2CX7LUOHB2NDG	0321732944	5.0	1341100800
1	A2NWSAGRHCP8N5	0439886341	1.0	1367193600
2	A2WNBOD3WNDNKT	0439886341	3.0	1374451200
3	A1GI0U4ZRJA8WN	0439886341	1.0	1334707200
4	A1QGNMC6O1VW39	0511189877	5.0	1397433600

	user_id	prod_id	rating	timestamp
7824476	A2YZI3C9MOHC0L	BT008UKTMW	5.0	1396569600
7824477	A322MDK0M89RHN	BT008UKTMW	5.0	1313366400
7824478	A1MH90R0ADMIK0	BT008UKTMW	4.0	1404172800
7824479	A10M2KEFPEQDHN	BT008UKTMW	4.0	1297555200
7824480	A2G81TMIOIDEQQ	BT008V9J9U	5.0	1312675200

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
user_id	125871	1540	A5JLAU2ARJ0BO	520	NaN	NaN	NaN	NaN	NaN	NaN	NaN
prod_id	125871	48190	B0088CJT4U	206	NaN	NaN	NaN	NaN	NaN	NaN	NaN
rating	125871.0	NaN	NaN	NaN	4.261339	1.062144	1.0	4.0	5.0	5.0	5.0
timestamp	125871.0	NaN	NaN	NaN	1321979017.4512	75835985.7041	939600000.0	1286928000.0	1346716800.0	1377129600.0	1406073600.0

	user_id	num_ratings
1203	A5JLAU2ARJ0BO	520
1287	ADLVFFE4VBT8	501
1086	A3OXHLG6DIBRW8	498
1210	A6FIAB28IS79	431
1209	A680RUE1FDO8B	406
264	A1ODOGXEYECQQ8	380
903	A36K2N527TXXJN	314
521	A2AY4YUOX2N1BQ	311
1508	AWPODHOB4GFWL	308
1439	ARBKYIVNYWK3C	296

	prod_id	average_rating	ratings_count
0	0594451647	5.0	1
26017	B003RRY9RS	5.0	1
26013	B003RR95Q8	5.0	1
26009	B003RIPMZU	5.0	1
26006	B003RFRNYQ	5.0	2

	user_id	prod_id	rating	timestamp
1309	A3LDPF5FMB782Z	1400501466	5.0	1336003200
1321	A1A5KUIIIHFF4U	1400501466	1.0	1332547200
1334	A2XIOXRRYX0KZY	1400501466	3.0	1371686400
1450	AW3LX47IHPFRL	1400501466	5.0	1339804800
1455	A1E3OB6QMBKRYZ	1400501466	1.0	1350086400

	Count	Percentage (%)
rating
1.0	5115	4.06
2.0	5367	4.26
3.0	12060	9.58
4.0	32295	25.66
5.0	71034	56.43

	prod_id	average_rating	ratings_count
39003	B0088CJT4U	4.22	206
24827	B003ES5ZUU	4.86	184
11078	B000N99BBC	4.77	167
38250	B007WTAJTO	4.70	164
38615	B00829TIEK	4.44	149
...	...	...	...
19185	B001UVCYN4	5.00	1
19187	B001UW7YDS	5.00	1
19188	B001UWKP0M	2.00	1
19190	B001V2PGX2	4.00	1
48189	B00LKG1MC8	5.00	1

	prod_id	average_rating	ratings_count
18891	B001TH7GUU	4.87	78
15538	B0019EHU8G	4.86	90
24827	B003ES5ZUU	4.86	184
36352	B006W8U2MU	4.82	57
11943	B000QUUFRW	4.81	84

	user_id	prod_id	rating	timestamp
2081	A2ZR3YTMEEIIZ4	1400532655	5.0	1300233600
2149	A3CLWR1UUZT6TG	1400532655	5.0	1291420800
2161	A5JLAU2ARJ0BO	1400532655	1.0	1291334400
2227	A1P4XD7IORSEFN	1400532655	4.0	1324166400
2362	A341HCMGNZCBIT	1400532655	5.0	1302048000
...	...	...	...	...
7824422	A34BZM6S9L7QI4	B00LGQ6HL8	5.0	1405555200
7824423	A1G650TTTHEAL5	B00LGQ6HL8	5.0	1405382400
7824424	A25C2M3QF9G7OQ	B00LGQ6HL8	5.0	1405555200
7824425	A1E1LEVQ9VQNK	B00LGQ6HL8	5.0	1405641600
7824426	A2NYK9KWFMJV4Y	B00LGQ6HL8	5.0	1405209600

	prod_id	predicted_ratings
0	B002WE4HE2	5.000000
1	B002WE6D44	5.000000
2	B0052SCU8U	5.000000
3	B003ES5ZUU	4.951627
4	B000Q8UAWY	4.857143

	prod_id	predicted_ratings
0	1400532655	4.296742
1	1400599997	4.296742
2	9983891212	4.296742
3	B00000DM9W	4.296742
4	B00000J1V5	4.296742

Model	RMSE	Precision@K	Recall@K	F1 Score
Baseline User-User CF	1.0345	0.833	0.768	0.799
Optimized User-User CF	0.9791	0.842	0.807	0.824
Baseline Item-Item CF	1.0345	0.833	0.768	0.799
Optimized Item-Item CF	0.9804	0.833	0.800	0.816
Baseline SVD (Matrix Factorization)	1.0390	0.852	0.785	0.817
Optimized SVD (Matrix Factorization)	0.9114	0.854	0.802	0.827

Project: Amazon Product Recommendation System¶

Marks: 40¶

Context:¶

Objective:¶

Dataset:¶

Importing the necessary libraries and overview of the dataset¶

Loading the data¶

Observations:¶

Observations:¶

Exploratory Data Analysis¶

Shape of the data¶

Check the number of rows and columns and provide observations.¶

Observations:¶

Data types¶

Observations:¶

Checking for missing values¶

Observations:¶

Checking for duplicate values¶

Observations:¶

Summary Statistics¶

Observations:¶

Timestamp Distribution¶

Observations:¶

Ratings Distribution¶

Observations:¶

Unique Users and Products¶

Observations:¶

Users with the Most Number of Ratings¶

Observations:¶

Model 1: Rank Based Recommendation System¶

Products with the most/fewest ratings¶

Observations:¶

Recommending top 5 products with 50 minimum interactions based on popularity¶

Recommending top 5 products with 100 minimum interactions based on popularity¶

Model 2: Collaborative Filtering Recommendation System¶

Building a baseline user-user similarity based recommendation system¶

Precision@k, Recall@ k, and F1-score@k¶

Some useful functions¶

Building the user-user Similarity-based Recommendation System¶

Observations:¶

Observations:¶

Observations:¶

Improving similarity-based recommendation system by tuning its hyperparameters¶

Observations:¶

Steps:¶

Observations:¶

Identifying similar users to a given user (nearest neighbors)¶

Implementing the recommendation algorithm based on optimized KNNBasic model¶

Item-Item Similarity-based Collaborative Filtering Recommendation System¶

Observations:¶

Observations:¶

Observations:¶

Hyperparameter tuning the item-item similarity-based model¶

Use the best parameters from GridSearchCV to build the optimized item-item similarity-based model. Compare the performance of the optimized model with the baseline model.¶

Observations:¶

Steps:¶

Observations:¶

Identifying similar items to a given item (nearest neighbors)¶

Model 3: Model-Based Collaborative Filtering - Matrix Factorization¶

Singular Value Decomposition (SVD)¶

Observations:¶

Observations:¶

Observations:¶

Improving Matrix Factorization based recommendation system by tuning its hyperparameters¶

Observations:¶

Steps:¶

Observations:¶

Conclusion and Recommendations¶

Methodologies:¶

Model Performance Metrics:¶

Analysis of Results:¶

Final Recommendation:¶

Future Considerations:¶