Introduction

Content based filtering is a technique used in recommendation system that is based on the content (representing by features) of the items.

Let’s start with the utility matrix: the columns being the users and the rows being the items. Since each user rates only a few items, the matrix is sparse, i.e. there are many missing values. If we can fill in those missing values, we can roughly know which items the user like and maybe suggest them those items. If we can classify the items into groups, when new item comes in the classification algorithm would predict the class the item belongs to and can also provide recommendation for the users accordingly.

For the rating, it can be explicit rating in which the user rates the items according to their preference or it can be implicit rating in which the user’s preference is extrapolated based on the number of times they rewatch the videos, the amount of time they visit the product, or when the user actually buys that item.

Content based recommendation

To build a content based recommendation system, we need to build a profile for each items. A profile is represented by a feature vector. For example, the features of a movie can be the actor, the director, the year and the genre. For the content based recommendation, the gist of the algorithm is to look for similarity among items, hence the result can be not so much novel and diverse. And each time a new item comes in, we need to profile it. Those attribute scoring might be automated or done by human and it can be costly.

Code examples

Using the MovieLens dataset, we are going to do three examples: recommend movies based on genres, tags and ratings. When we recommend based on genres, it is the content of the movies that we care about. So if we want a movie for children, it is safer to use this method since it only recommends movies in proximity of the children genre. When we recommend based on tags, it is the comment of the users that we care about. And it is not just one user’s comment, we aggregate the comments from different users on the same movies and generate new recommendation based on that. Doing this makes use of the preference of the user base. Surely some user might care about what other people say about the product and base their decision on that. When we recommend based on ratings, it is similar, this is for users who care about the opinion of the crowd to pick their next movie to watch. At the end, we can aggregate those three models and have a long list of recommendations minus the duplicates.

The techniques used are called TF-IDF and CountVectorizer. TF-IDF has been introduced in LSD/LSA articles. CountVectorizer in Python simply counts the number of appearance of the word in the documents, those are indicators of similarity.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

# Load Movies and genre
movies = pd.read_csv('ml-latest-small/movies.csv', low_memory=False)

len(movies)

movies.head()

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

# Define a TF-IDF Vectorizer Object.
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
movies['genres'] = movies['genres'].fillna('')

# Construct the required TF-IDF matrix by applying the fit_transform method on the overview feature
tfidf_matrix = tfidf.fit_transform(movies['genres'])

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

#Construct a reverse map of indices and movie titles
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, movies, cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

print(get_recommendations('Toy Story (1995)', movies, cosine_sim))

                                        Antz (1998)
                                 Toy Story 2 (1999)
     Adventures of Rocky and Bullwinkle, The (2000)
                   Emperor's New Groove, The (2000)
                              Monsters, Inc. (2001)
                                   Wild, The (2006)
                             Shrek the Third (2007)
                     Tale of Despereaux, The (2008)
  Asterix and the Vikings (Astérix et les Viking...
                                       Turbo (2013)
Name: title, dtype: object

Apart from the genres, we also have the information about the movie’s tags by user. This kind of information reveals a bit more about the preference of users on a movie.

# Load the tags data
tags = pd.read_csv('ml-latest-small/tags.csv')
tags.head()

	userId	movieId	tag	timestamp
0	2	60756	funny	1445714994
1	2	60756	Highly quotable	1445714996
2	2	60756	will ferrell	1445714992
3	2	89774	Boxing story	1445715207
4	2	89774	MMA	1445715200

# Merge movies and tags into a single DataFrame
movies_with_tags = pd.merge(movies, tags, on='movieId', how='left')

movies_with_tags['tag'] = movies_with_tags['tag'].fillna('')

# Concatenate all tags for each movie into a single string
movies_with_tags['tags'] = movies_with_tags.groupby('movieId')['tag'].transform(lambda x: ' '.join(x))

# Replace NaN with an empty string
movies_with_tags['tags'] = movies_with_tags['tags'].fillna('')

# Remove duplicate movies
movies_with_tags = movies_with_tags.drop_duplicates(subset=["movieId"])

# Define a TF-IDF Vectorizer Object
tfidf = TfidfVectorizer(stop_words='english')

# Construct the required TF-IDF matrix by applying the fit_transform method on the tags feature
tfidf_matrix = tfidf.fit_transform(movies_with_tags['tags'])

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

movies_with_tags.head()

	movieId	title	genres	userId	tag	timestamp	tags
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	336.0	pixar	1.139046e+09	pixar pixar fun
3	2	Jumanji (1995)	Adventure\|Children\|Fantasy	62.0	fantasy	1.528844e+09	fantasy magic board game Robin Williams game
7	3	Grumpier Old Men (1995)	Comedy\|Romance	289.0	moldy	1.143425e+09	moldy old
9	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	NaN		NaN
10	5	Father of the Bride Part II (1995)	Comedy	474.0	pregnancy	1.137374e+09	pregnancy remake

cosine_sim

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#Construct a reverse map of indices and movie titles
indices = pd.Series(movies_with_tags.index, index=movies_with_tags['title']).drop_duplicates()

print(get_recommendations('Toy Story (1995)', movies_with_tags, cosine_sim))

               Bug's Life, A (1998)
                 Toy Story 2 (1999)
  Guardians of the Galaxy 2 (2017)
                          Up (2009)
                 Big Hero 6 (2014)
             The Lego Movie (2014)
               Avengers, The (2012)
                 Pulp Fiction (1994)
                        Jumanji (1995)
               Grumpier Old Men (1995)
Name: title, dtype: object

from sklearn.feature_extraction.text import CountVectorizer

# Load ratings data
ratings = pd.read_csv('ml-latest-small/ratings.csv')

# Calculate the average rating for each movie
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()
average_ratings.rating = average_ratings.rating.round()

# Merge the average rating to the movie data
movies_with_ratings = movies.merge(average_ratings, on='movieId')

# Define a CountVectorizer Object to create a matrix where each row will represent a movie and each column will represent a different user's rating
count = CountVectorizer()

# Construct the required matrix by applying the fit_transform method on the title feature and average rating feature
count_matrix = count.fit_transform(movies_with_ratings['title'] + ' ' + movies_with_ratings['rating'].astype(str))

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)

movies_with_ratings.head()

	movieId	title	genres	rating
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	3.920930
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.431818
2	3	Grumpier Old Men (1995)	Comedy\|Romance	3.259615
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	2.357143
4	5	Father of the Bride Part II (1995)	Comedy	3.071429

cosine_sim

array([[1.        , 0.28867513, 0.2236068 , ..., 0.        , 0.        ,
        0.        ],
       [0.28867513, 1.        , 0.25819889, ..., 0.        , 0.        ,
        0.        ],
       [0.2236068 , 0.25819889, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

indices = pd.Series(movies_with_ratings.index, index=movies_with_ratings['title']).drop_duplicates()

print(get_recommendations('Toy Story (1995)', movies_with_ratings, cosine_sim))

                 Toy Story 2 (1999)
                 Toy Story 3 (2010)
   Pyromaniac's Love Story, A (1995)
                        Jumanji (1995)
                           Heat (1995)
                        Sabrina (1995)
                      GoldenEye (1995)
                         Balto (1995)
                         Nixon (1995)
                        Casino (1995)
Name: title, dtype: object

In conclusion, content based filtering is a very popular method in building recommendation system. And we have reasonably good results. There are some flaws, though. When we ask to recommend by tags, the algorithm recommends “Pulp Fiction” for “Toy Story”. This is a bit too much if we are talking about children customer. In the algorithm to recommend by ratings, it is normal to have a wide range of genres recommended and this is what the user might need. They might just need a 5 star movie, doesn’t matter the topics. But for the algorithm that recommends based on users’ tags, Toy Story and Pulp Fiction are still widely inappropriate in the case of children, even though it is not in the top 5 recommendations. The algorithm might just reveals some bias in the preference of the user, and it is not pretty.

In the next example, we get access to the genome scores of the tags of the movie lists. The genome score measures how relevant a tag is to a movie, and there is a tag list of a thousand. These tag scores have been calculated in advance by a machine learning algorithm, based on tags, ratings, and textual reviews. This genome score table provides a better overview of the movie’s content for the machine to grasp. Then we can calculate the cosine similarities of those tag score vectors and use them for recommendation. The result is much better than before, since asking for recommendations on Toy Story returns very similar movies in the genres for children: Shrek, Ice Age, Monster Inc, Finding Nemo, Rattatouile, Up.

genome_tags = pd.read_csv('ml-latest-small/genome_tags.csv')
genome_tags.head(15)

	tagId	tag
0	1	007
1	2	007 (series)
2	3	18th century
3	4	1920s
4	5	1930s
5	6	1950s
6	7	1960s
7	8	1970s
8	9	1980s
9	10	19th century
10	11	3d
11	12	70mm
12	13	80s
13	14	9/11
14	15	aardman

# Load the genome scores
genome_scores = pd.read_csv('ml-latest-small/genome_scores.csv')

# Pivot the genome scores DataFrame to create a movie-tag matrix
movie_tag_matrix = genome_scores.pivot(index='movieId', columns='tagId', values='relevance')

# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(movie_tag_matrix)

# Function that takes in movie title as input and outputs most similar movies
def get_recommended_movieIds(movieId, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the movieId
    idx = movie_tag_matrix.index.get_loc(movieId)

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movie_tag_matrix.index[movie_indices]
recommended = get_recommended_movieIds(1)
movies[movies['movieId'].isin(recommended)]

	movieId	title	genres
1706	2294	Antz (1998)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1757	2355	Bug's Life, A (1998)	Adventure\|Animation\|Children\|Comedy
2355	3114	Toy Story 2 (1999)	Adventure\|Animation\|Children\|Comedy\|Fantasy
3194	4306	Shrek (2001)	Adventure\|Animation\|Children\|Comedy\|Fantasy\|Ro...
3568	4886	Monsters, Inc. (2001)	Adventure\|Animation\|Children\|Comedy\|Fantasy
3745	5218	Ice Age (2002)	Adventure\|Animation\|Children\|Comedy
4360	6377	Finding Nemo (2003)	Adventure\|Animation\|Children\|Comedy
6405	50872	Ratatouille (2007)	Animation\|Children\|Drama
7039	68954	Up (2009)	Adventure\|Animation\|Children\|Drama
7355	78499	Toy Story 3 (2010)	Adventure\|Animation\|Children\|Comedy\|Fantasy\|IMAX

If we have access to lengthy overview or critics of the movies, or if the items are documents, we can leverage NLP techniques to process and analyze the documents then calculate the cosine similarity. This algorithm can be used to suggest title for blog posts, or suggest new movies to watch.

Let’s try to use three movie plots. NLP’s preprocessing task involves lowercasing, removing punctuations, removing stopwords, lemmatizing etc. Here we would skip this step. We would use the Word2Vec model (which comes with its latent word space) to train further on this new corpus of three plots. The number of dimensions for each word is 100. We consider a window of 10 words for each word. Word occurring less than five times would be ignored. Then we process the three plots. Each plot would have their own number of words. Each of these words are represented as a Word2Vec vector (100 dimensional vector). Each vector would then be averaged. So that we have a resulting list of three plots, each consists of 100 numerically represented words. Then we calculate the cosine similarity among those plots.

From the result, we can see that these three movies are not similar with similarity indices from 0.1 to 0.2. All those movies are in 2020: Demon Slayer (Japanese and anime based), the Eight Hundred (Chinese war drama), Bad Boys for Life (American action comedy). We can see that these movies come from very different cultures.

from gensim.models import Word2Vec

with open('movies/movie1.txt', 'r') as f:
    m1 = f.read()
with open('movies/movie2.txt', 'r') as f:
    m2 = f.read()
with open('movies/movie3.txt', 'r') as f:
    m3 = f.read()

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Download necessary NLTK datasets
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize text (convert text into list of words)
    words = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Lemmatize words (convert words into their root form)
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    return words

# Example usage:
preprocessed_text1 = preprocess_text(m1)
preprocessed_text2 = preprocess_text(m2)
preprocessed_text3 = preprocess_text(m3)

print(preprocessed_text2)

['early', 'day', 'second', 'sinojapanese', 'war', 'greater', 'scale', 'world', 'war', 'ii', 'imperial', 'japanese', 'army', 'invaded', 'shanghai', 'became', 'known', 'battle', 'shanghai', 'holding', 'back', 'japanese', '3', 'month', 'suffering', 'heavy', 'loss', 'chinese', 'army', 'forced', 'retreat', 'due', 'danger', 'encircled', 'lieutenant', 'colonel', 'xie', 'jinyuan', '524th', 'regiment', 'underequipped', '88th', 'division', 'national', 'revolutionary', 'army', 'led', '452', 'young', 'officer', 'soldier', 'defend', 'sihang', 'warehouse', '3rd', 'imperial', 'japanese', 'division', 'consisting', 'around', '20000', 'troop', 'heroic', 'suicidal', 'last', 'stand', 'japanese', 'order', 'generalissimo', 'nationalist', 'china', 'chiang', 'kaishek', 'decision', 'made', 'provide', 'morale', 'boost', 'chinese', 'people', 'loss', 'beijing', 'shanghai', 'helped', 'spur', 'support', 'western', 'power', 'full', 'view', 'battle', 'international', 'settlement', 'shanghai', 'across', 'suzhou', 'creek', '6']


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nguyenlinhchi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nguyenlinhchi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nguyenlinhchi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/nguyenlinhchi/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

documents = [preprocessed_text1,preprocessed_text2,preprocessed_text3]

# Assume that `documents` is a list of strings, where each string is a movie's plot summary or reviews
# documents = [doc.split() for doc in documents]  # split each document into words

# Train a Word2Vec model
model = Word2Vec(documents, size=100, window=5, min_count=1, workers=4)

# Vectorize the movies
movie_vectors = [np.mean([model.wv[word] for word in doc], axis=0) for doc in documents]

# Compute the similarity matrix
similarity_matrix = cosine_similarity(movie_vectors)

# Now you can use this similarity matrix to recommend similar movies

similarity_matrix

array([[0.99999994, 0.11791225, 0.17876801],
       [0.11791225, 1.0000001 , 0.12657738],
       [0.17876801, 0.12657738, 0.99999994]], dtype=float32)

Content-based Filtering

TOC

Introduction

Content based recommendation

Code examples