Content-based Filtering
TOC
Introduction
Content based filtering is a technique used in recommendation system that is based on the content (representing by features) of the items.
Let’s start with the utility matrix: the columns being the users and the rows being the items. Since each user rates only a few items, the matrix is sparse, i.e. there are many missing values. If we can fill in those missing values, we can roughly know which items the user like and maybe suggest them those items. If we can classify the items into groups, when new item comes in the classification algorithm would predict the class the item belongs to and can also provide recommendation for the users accordingly.
For the rating, it can be explicit rating in which the user rates the items according to their preference or it can be implicit rating in which the user’s preference is extrapolated based on the number of times they rewatch the videos, the amount of time they visit the product, or when the user actually buys that item.
Content based recommendation
To build a content based recommendation system, we need to build a profile for each items. A profile is represented by a feature vector. For example, the features of a movie can be the actor, the director, the year and the genre. For the content based recommendation, the gist of the algorithm is to look for similarity among items, hence the result can be not so much novel and diverse. And each time a new item comes in, we need to profile it. Those attribute scoring might be automated or done by human and it can be costly.
Code examples
Using the MovieLens dataset, we are going to do three examples: recommend movies based on genres, tags and ratings. When we recommend based on genres, it is the content of the movies that we care about. So if we want a movie for children, it is safer to use this method since it only recommends movies in proximity of the children genre. When we recommend based on tags, it is the comment of the users that we care about. And it is not just one user’s comment, we aggregate the comments from different users on the same movies and generate new recommendation based on that. Doing this makes use of the preference of the user base. Surely some user might care about what other people say about the product and base their decision on that. When we recommend based on ratings, it is similar, this is for users who care about the opinion of the crowd to pick their next movie to watch. At the end, we can aggregate those three models and have a long list of recommendations minus the duplicates.
The techniques used are called TF-IDF and CountVectorizer. TF-IDF has been introduced in LSD/LSA articles. CountVectorizer in Python simply counts the number of appearance of the word in the documents, those are indicators of similarity.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
# Load Movies and genre
movies = pd.read_csv('ml-latest-small/movies.csv', low_memory=False)
len(movies)
9742
movies.head()
movieId | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
# Define a TF-IDF Vectorizer Object.
tfidf = TfidfVectorizer(stop_words='english')
# Replace NaN with an empty string
movies['genres'] = movies['genres'].fillna('')
# Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
tfidf_matrix = tfidf.fit_transform(movies['genres'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, movies, cosine_sim):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
return movies['title'].iloc[movie_indices]
print(get_recommendations('Toy Story (1995)', movies, cosine_sim))
1706 Antz (1998)
2355 Toy Story 2 (1999)
2809 Adventures of Rocky and Bullwinkle, The (2000)
3000 Emperor's New Groove, The (2000)
3568 Monsters, Inc. (2001)
6194 Wild, The (2006)
6486 Shrek the Third (2007)
6948 Tale of Despereaux, The (2008)
7760 Asterix and the Vikings (Astérix et les Viking...
8219 Turbo (2013)
Name: title, dtype: object
Apart from the genres, we also have the information about the movie’s tags by user. This kind of information reveals a bit more about the preference of users on a movie.
# Load the tags data
tags = pd.read_csv('ml-latest-small/tags.csv')
tags.head()
userId | movieId | tag | timestamp | |
---|---|---|---|---|
0 | 2 | 60756 | funny | 1445714994 |
1 | 2 | 60756 | Highly quotable | 1445714996 |
2 | 2 | 60756 | will ferrell | 1445714992 |
3 | 2 | 89774 | Boxing story | 1445715207 |
4 | 2 | 89774 | MMA | 1445715200 |
# Merge movies and tags into a single DataFrame
movies_with_tags = pd.merge(movies, tags, on='movieId', how='left')
movies_with_tags['tag'] = movies_with_tags['tag'].fillna('')
# Concatenate all tags for each movie into a single string
movies_with_tags['tags'] = movies_with_tags.groupby('movieId')['tag'].transform(lambda x: ' '.join(x))
# Replace NaN with an empty string
movies_with_tags['tags'] = movies_with_tags['tags'].fillna('')
# Remove duplicate movies
movies_with_tags = movies_with_tags.drop_duplicates(subset=["movieId"])
# Define a TF-IDF Vectorizer Object
tfidf = TfidfVectorizer(stop_words='english')
# Construct the required TF-IDF matrix by applying the fit_transform method on the tags feature
tfidf_matrix = tfidf.fit_transform(movies_with_tags['tags'])
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
movies_with_tags.head()
movieId | title | genres | userId | tag | timestamp | tags | |
---|---|---|---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 336.0 | pixar | 1.139046e+09 | pixar pixar fun |
3 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 62.0 | fantasy | 1.528844e+09 | fantasy magic board game Robin Williams game |
7 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 289.0 | moldy | 1.143425e+09 | moldy old |
9 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | NaN | NaN | ||
10 | 5 | Father of the Bride Part II (1995) | Comedy | 474.0 | pregnancy | 1.137374e+09 | pregnancy remake |
cosine_sim
array([[1., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
#Construct a reverse map of indices and movie titles
indices = pd.Series(movies_with_tags.index, index=movies_with_tags['title']).drop_duplicates()
print(get_recommendations('Toy Story (1995)', movies_with_tags, cosine_sim))
2484 Bug's Life, A (1998)
3210 Toy Story 2 (1999)
10675 Guardians of the Galaxy 2 (2017)
8664 Up (2009)
10485 Big Hero 6 (2014)
10240 The Lego Movie (2014)
9459 Avengers, The (2012)
395 Pulp Fiction (1994)
3 Jumanji (1995)
7 Grumpier Old Men (1995)
Name: title, dtype: object
from sklearn.feature_extraction.text import CountVectorizer
# Load ratings data
ratings = pd.read_csv('ml-latest-small/ratings.csv')
# Calculate the average rating for each movie
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()
average_ratings.rating = average_ratings.rating.round()
# Merge the average rating to the movie data
movies_with_ratings = movies.merge(average_ratings, on='movieId')
# Define a CountVectorizer Object to create a matrix where each row will represent a movie and each column will represent a different user's rating
count = CountVectorizer()
# Construct the required matrix by applying the fit_transform method on the title feature and average rating feature
count_matrix = count.fit_transform(movies_with_ratings['title'] + ' ' + movies_with_ratings['rating'].astype(str))
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
movies_with_ratings.head()
movieId | title | genres | rating | |
---|---|---|---|---|
0 | 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy | 3.920930 |
1 | 2 | Jumanji (1995) | Adventure|Children|Fantasy | 3.431818 |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance | 3.259615 |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance | 2.357143 |
4 | 5 | Father of the Bride Part II (1995) | Comedy | 3.071429 |
cosine_sim
array([[1. , 0.28867513, 0.2236068 , ..., 0. , 0. ,
0. ],
[0.28867513, 1. , 0.25819889, ..., 0. , 0. ,
0. ],
[0.2236068 , 0.25819889, 1. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 1. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 1. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
1. ]])
indices = pd.Series(movies_with_ratings.index, index=movies_with_ratings['title']).drop_duplicates()
print(get_recommendations('Toy Story (1995)', movies_with_ratings, cosine_sim))
2353 Toy Story 2 (1999)
7338 Toy Story 3 (2010)
256 Pyromaniac's Love Story, A (1995)
1 Jumanji (1995)
5 Heat (1995)
6 Sabrina (1995)
9 GoldenEye (1995)
12 Balto (1995)
13 Nixon (1995)
15 Casino (1995)
Name: title, dtype: object
In conclusion, content based filtering is a very popular method in building recommendation system. And we have reasonably good results. There are some flaws, though. When we ask to recommend by tags, the algorithm recommends “Pulp Fiction” for “Toy Story”. This is a bit too much if we are talking about children customer. In the algorithm to recommend by ratings, it is normal to have a wide range of genres recommended and this is what the user might need. They might just need a 5 star movie, doesn’t matter the topics. But for the algorithm that recommends based on users’ tags, Toy Story and Pulp Fiction are still widely inappropriate in the case of children, even though it is not in the top 5 recommendations. The algorithm might just reveals some bias in the preference of the user, and it is not pretty.
In the next example, we get access to the genome scores of the tags of the movie lists. The genome score measures how relevant a tag is to a movie, and there is a tag list of a thousand. These tag scores have been calculated in advance by a machine learning algorithm, based on tags, ratings, and textual reviews. This genome score table provides a better overview of the movie’s content for the machine to grasp. Then we can calculate the cosine similarities of those tag score vectors and use them for recommendation. The result is much better than before, since asking for recommendations on Toy Story returns very similar movies in the genres for children: Shrek, Ice Age, Monster Inc, Finding Nemo, Rattatouile, Up.
genome_tags = pd.read_csv('ml-latest-small/genome_tags.csv')
genome_tags.head(15)
tagId | tag | |
---|---|---|
0 | 1 | 007 |
1 | 2 | 007 (series) |
2 | 3 | 18th century |
3 | 4 | 1920s |
4 | 5 | 1930s |
5 | 6 | 1950s |
6 | 7 | 1960s |
7 | 8 | 1970s |
8 | 9 | 1980s |
9 | 10 | 19th century |
10 | 11 | 3d |
11 | 12 | 70mm |
12 | 13 | 80s |
13 | 14 | 9/11 |
14 | 15 | aardman |
# Load the genome scores
genome_scores = pd.read_csv('ml-latest-small/genome_scores.csv')
# Pivot the genome scores DataFrame to create a movie-tag matrix
movie_tag_matrix = genome_scores.pivot(index='movieId', columns='tagId', values='relevance')
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(movie_tag_matrix)
# Function that takes in movie title as input and outputs most similar movies
def get_recommended_movieIds(movieId, cosine_sim=cosine_sim):
# Get the index of the movie that matches the movieId
idx = movie_tag_matrix.index.get_loc(movieId)
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
return movie_tag_matrix.index[movie_indices]
recommended = get_recommended_movieIds(1)
movies[movies['movieId'].isin(recommended)]
movieId | title | genres | |
---|---|---|---|
1706 | 2294 | Antz (1998) | Adventure|Animation|Children|Comedy|Fantasy |
1757 | 2355 | Bug's Life, A (1998) | Adventure|Animation|Children|Comedy |
2355 | 3114 | Toy Story 2 (1999) | Adventure|Animation|Children|Comedy|Fantasy |
3194 | 4306 | Shrek (2001) | Adventure|Animation|Children|Comedy|Fantasy|Ro... |
3568 | 4886 | Monsters, Inc. (2001) | Adventure|Animation|Children|Comedy|Fantasy |
3745 | 5218 | Ice Age (2002) | Adventure|Animation|Children|Comedy |
4360 | 6377 | Finding Nemo (2003) | Adventure|Animation|Children|Comedy |
6405 | 50872 | Ratatouille (2007) | Animation|Children|Drama |
7039 | 68954 | Up (2009) | Adventure|Animation|Children|Drama |
7355 | 78499 | Toy Story 3 (2010) | Adventure|Animation|Children|Comedy|Fantasy|IMAX |
If we have access to lengthy overview or critics of the movies, or if the items are documents, we can leverage NLP techniques to process and analyze the documents then calculate the cosine similarity. This algorithm can be used to suggest title for blog posts, or suggest new movies to watch.
Let’s try to use three movie plots. NLP’s preprocessing task involves lowercasing, removing punctuations, removing stopwords, lemmatizing etc. Here we would skip this step. We would use the Word2Vec model (which comes with its latent word space) to train further on this new corpus of three plots. The number of dimensions for each word is 100. We consider a window of 10 words for each word. Word occurring less than five times would be ignored. Then we process the three plots. Each plot would have their own number of words. Each of these words are represented as a Word2Vec vector (100 dimensional vector). Each vector would then be averaged. So that we have a resulting list of three plots, each consists of 100 numerically represented words. Then we calculate the cosine similarity among those plots.
From the result, we can see that these three movies are not similar with similarity indices from 0.1 to 0.2. All those movies are in 2020: Demon Slayer (Japanese and anime based), the Eight Hundred (Chinese war drama), Bad Boys for Life (American action comedy). We can see that these movies come from very different cultures.
from gensim.models import Word2Vec
with open('movies/movie1.txt', 'r') as f:
m1 = f.read()
with open('movies/movie2.txt', 'r') as f:
m2 = f.read()
with open('movies/movie3.txt', 'r') as f:
m3 = f.read()
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
# Download necessary NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
def preprocess_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize text (convert text into list of words)
words = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]
# Lemmatize words (convert words into their root form)
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(word) for word in words]
return words
# Example usage:
preprocessed_text1 = preprocess_text(m1)
preprocessed_text2 = preprocess_text(m2)
preprocessed_text3 = preprocess_text(m3)
print(preprocessed_text2)
['early', 'day', 'second', 'sinojapanese', 'war', 'greater', 'scale', 'world', 'war', 'ii', 'imperial', 'japanese', 'army', 'invaded', 'shanghai', 'became', 'known', 'battle', 'shanghai', 'holding', 'back', 'japanese', '3', 'month', 'suffering', 'heavy', 'loss', 'chinese', 'army', 'forced', 'retreat', 'due', 'danger', 'encircled', 'lieutenant', 'colonel', 'xie', 'jinyuan', '524th', 'regiment', 'underequipped', '88th', 'division', 'national', 'revolutionary', 'army', 'led', '452', 'young', 'officer', 'soldier', 'defend', 'sihang', 'warehouse', '3rd', 'imperial', 'japanese', 'division', 'consisting', 'around', '20000', 'troop', 'heroic', 'suicidal', 'last', 'stand', 'japanese', 'order', 'generalissimo', 'nationalist', 'china', 'chiang', 'kaishek', 'decision', 'made', 'provide', 'morale', 'boost', 'chinese', 'people', 'loss', 'beijing', 'shanghai', 'helped', 'spur', 'support', 'western', 'power', 'full', 'view', 'battle', 'international', 'settlement', 'shanghai', 'across', 'suzhou', 'creek', '6']
[nltk_data] Downloading package punkt to
[nltk_data] /Users/nguyenlinhchi/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/nguyenlinhchi/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/nguyenlinhchi/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data] /Users/nguyenlinhchi/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
documents = [preprocessed_text1,preprocessed_text2,preprocessed_text3]
# Assume that `documents` is a list of strings, where each string is a movie's plot summary or reviews
# documents = [doc.split() for doc in documents] # split each document into words
# Train a Word2Vec model
model = Word2Vec(documents, size=100, window=5, min_count=1, workers=4)
# Vectorize the movies
movie_vectors = [np.mean([model.wv[word] for word in doc], axis=0) for doc in documents]
# Compute the similarity matrix
similarity_matrix = cosine_similarity(movie_vectors)
# Now you can use this similarity matrix to recommend similar movies
similarity_matrix
array([[0.99999994, 0.11791225, 0.17876801],
[0.11791225, 1.0000001 , 0.12657738],
[0.17876801, 0.12657738, 0.99999994]], dtype=float32)