Learn how to create a movie recommendation system in Python with this detailed step-by-step guide. Discover data preprocessing, feature extraction, and similarity computation techniques to build your own recommendation engine.
Creating a movie recommendation system involves several steps, including data preprocessing, feature extraction, and similarity computation. In this blog, we'll walk through how to build a simple movie recommendation system using Python. Our system will leverage movie metadata such as genres, keywords, cast, and crew to recommend movies similar to a given title.
Setting Up Your Environment
Installing Python
First, ensure you have Python installed on your system. You can download it from python.org. We recommend using Python 3.7 or later.
Necessary Libraries
You'll need several libraries to build your recommendation system:
- Pandas: For data manipulation and analysis.
- NumPy: For numerical operations.
- Scikit-learn: For machine learning algorithms.
- NLTK: For natural language processing.
Install these libraries using pip:
pip install numpy pandas nltk scikit-learn
Step by Step Movie Recommendation System in Python
1. Loading the Data
We will use two datasets: credits.csv and movies.csv. The credits.csv file contains information about the cast and crew of the movies, while the movies.csv file contains movie details such as title, overview, genres, and keywords.
import numpy as np import pandas as pd import nltk import pickle from nltk.stem.porter import PorterStemmer from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import CountVectorizer import ast # Load the datasets credits = pd.read_csv('credits.csv') movies = pd.read_csv('movies.csv') # Merge datasets on movie title movies = movies.merge(credits, on='title') movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']] movies.dropna(inplace=True)
2. Data Preprocessing
We need to preprocess the data to make it suitable for our recommendation system:
- Convert genre, keywords, and cast information from JSON strings to lists.
- Extract the top 3 cast members and the director from the crew.
- Collapse the lists to remove spaces for uniformity.
def convert(text): return [i['name'] for i in ast.literal_eval(text)] movies['genres'] = movies['genres'].apply(convert) movies['keywords'] = movies['keywords'].apply(convert) movies['cast'] = movies['cast'].apply(convert).apply(lambda x: x[0:3]) movies['crew'] = movies['crew'].apply(lambda x: [i['name'] for i in ast.literal_eval(x) if i['job'] == 'Director']) def collapse(L): return [i.replace(" ", "") for i in L] movies['cast'] = movies['cast'].apply(collapse) movies['crew'] = movies['crew'].apply(collapse) movies['genres'] = movies['genres'].apply(collapse) movies['keywords'] = movies['keywords'].apply(collapse)
We also need to process the overview text by splitting it into words and combining it with other features into a single "tags" column:
movies['overview'] = movies['overview'].apply(lambda x: x.split()) movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew'] new = movies.drop(columns=['overview', 'genres', 'keywords', 'cast', 'crew']) new['tags'] = new['tags'].apply(lambda x: " ".join(x))
3. Feature Extraction and Similarity Computation
We will use CountVectorizer to convert the text data into numerical vectors and compute the cosine similarity between these vectors:
cv = CountVectorizer(max_features=5000, stop_words='english') vector = cv.fit_transform(new['tags']).toarray() ps = PorterStemmer() def stem(text): return " ".join([ps.stem(word) for word in text.split()]) new['tags'] = new['tags'].apply(stem) similarity = cosine_similarity(vector)
4. Building the Recommendation Function
Finally, we create a function to recommend movies based on a given movie title. It finds the movie index, computes the similarity scores, and returns the top 5 similar movies:
def recommend(movie): index = new[new['title'] == movie].index[0] movie_list = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1]) recommendations = [new.iloc[i[0]].title for i in movie_list[1:6]] return recommendations print(recommend('Batman Begins'))
5. Saving the Model
To reuse the model without recomputing everything, we save the processed data and similarity matrix to disk:
pickle.dump(new, open('movie_list.pkl', 'wb')) pickle.dump(similarity, open('similarity.pkl', 'wb'))
Full Movie Recommendation System Project's Source Code
import numpy as np import pandas as pd import nltk import pickle from nltk.stem.porter import PorterStemmer from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import CountVectorizer import ast # Load the datasets credits = pd.read_csv('credits.csv') movies = pd.read_csv('movies.csv') movies.shape movies = movies.merge(credits,on='title') movies = movies[['movie_id','title','overview','genres','keywords','cast','crew']] movies.isnull().sum() movies.dropna(inplace=True) movies.iloc[0].genres def convert(text): L = [] for i in ast.literal_eval(text): L.append(i['name']) return L movies['genres'] = movies['genres'].apply(convert) movies['keywords'] = movies['keywords'].apply(convert) ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name":"Science Fiction"}]') def convert3(text): L = [] counter = 0 for i in ast.literal_eval(text): if counter < 3: L.append(i['name']) counter+=1 return L movies['cast'] = movies['cast'].apply(convert) movies['cast'] = movies['cast'].apply(lambda x:x[0:3]) def fetch_director(text): L = [] for i in ast.literal_eval(text): if i['job'] == 'Director': L.append(i['name']) return L movies['crew'] = movies['crew'].apply(fetch_director) def collapse(L): L1 = [] for i in L: L1.append(i.replace(" ","")) return L1 movies['cast'] = movies['cast'].apply(collapse) movies['crew'] = movies['crew'].apply(collapse) movies['genres'] = movies['genres'].apply(collapse) movies['keywords'] = movies['keywords'].apply(collapse) movies['overview'] = movies['overview'].apply(lambda x:x.split()) movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew'] new = movies.drop(columns=['overview','genres','keywords','cast','crew']) new['tags'] = new['tags'].apply(lambda x: " ".join(x)) cv = CountVectorizer(max_features=5000,stop_words='english') vector = cv.fit_transform(new['tags']).toarray() vector.shape ps= PorterStemmer() def stem(text): y = [] for i in text.split(): y.append(ps.stem(i)) return " ".join(y) new['tags'] = new['tags'].apply(stem) similarity = cosine_similarity(vector) def recommend(movie): index = new[new['title'] == movie].index[0] movie_list = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1]) for i in movie_list[1:6]: print(new.iloc[i[0]].title) print(recommend('Batman Begins')) pickle.dump(new,open('movie_list.pkl','wb')) pickle.dump(similarity,open('similarity.pkl','wb'))
Conclusion
You've now built a basic movie recommendation system using Python. This system processes movie metadata, calculates similarities, and provides recommendations based on content. You can further enhance this system by incorporating user ratings, more advanced machine learning techniques, or additional features for better accuracy.
Feel free to experiment with the code and customize it according to your needs.
Code by: Kanishka Sah
Demo Here: https://movie-recommender-system-ml-ca6n1lthfcd-kanishka.streamlit.app/
That’s a wrap!
I hope you enjoyed this article
Did you like it? Let me know in the comments below 🔥 and you can support me by buying me a coffee.
And don’t forget to sign up to our email newsletter so you can get useful content like this sent right to your inbox!
Thanks!
Faraz 😊