Collaborative Filtering Based Book Recommendation Engine

September 01, 2019


Introduction

Recommendation engines have laid the foundation of every major tech company around us that provides retail, video-on-demand or music streaming service and thus redefined the way we shop, search for an old friend, find new music or places to go to. From finding the best product in the market to searching for an old friend online or listening to songs while driving, recommender systems are everywhere. While a lot of datasets for movies or songs have been explored previously to understand how recommendation engine works for those applications and what are the scopes of future improvement, book recommendation engines have been relatively less explored.

The primary goal of this project is to develop a collaborative filtering based book recommendation model using goodreads dataset that can suggest readers what books to read next. Additionally, data wrangling and exploratory data analysis will be utilized to draw insights about users reading preferences (e.g. how they like to tag, what ratings they usually provide etc.) and current trends in the book market (book categories that are in demand, successful authors in the market etc.).

Key Development Goals

The recommendation system should have the following capabilities

  1. For new or anonymous users, the recommendation engine can make base-case recommendations based on the past ratings and/or other keywords.

  2. With user ID as input and user’s search preferences, the collaborative filtering can make personalized recommendations to an active user based on his/her activity history and search preferences.

  3. The search engine have smart filterning capabiliy and can provide built in tag-recommendations\suggestions to further refine the search

Dataset

The goodreads dataset for this project is available on Kaggle. A link to the original dataset is given below this section.

ratings.csv contains user_ids, book_ids and ratings. It has 6,000,000 observations.

books.csv has metadata for each book (goodreads IDs, authors, title, average rating, etc.). The raw dataset has 23 columns and 10,000 entries.

book_tags.csvcontains tags/shelves/genres assigned by users to books. Tags in this file are represented by their IDs. The file has 999912 observations

tags.csv translates tag IDs to names. The file contains 34252 observations

to_read.csv provides IDs of the books marked “to read” by each user, as user_id,book_id pairs.

Link to the preliminary dataset: https://github.com/zygmuntz/goodbooks-10k.

Solution Approach

The overall project is organized in the following framework. Please note that elaborate explanation of each step can be found in separate notebook corresponding to each step.

1. Data Wrangling: (Link to Notebook)

The first step in this process was to quickly inspect all the datasets, identify how they are connected to each other and then perform data wrangling (i.e. identify duplicates, missing values, non -english titles or tag_names, merge diferent datasets to extract meaningful information) so that we can have tidy datasets for exploratory data analysis and modeling. A key step in the data wrangling process was to explore the 34252 different tag_names user have used to tag the books they are interested in, and utlize it to group books into 348 generalized tag names by identifying common patterns in the user provided tag names. For example, Scinece Fiction and Fantasy were chosen as a generalized tag name for books tagged as ‘dark-fantasy’,’epic-fantasy’,’fantasy-sci-fi’,’scifi’,’scifi-fantasy’,’sf-fantasy’ and many other alternatives.

This step above also allowed to form a database of words or string patterns that the readers may use while searching for a new book. An additional performance metric to evaluate a book was also established by ranking the books based on their tag counts as the reader’s favorite book. Once data wrangling is complete, the clean datasets were exported to apply EDA.

2. Exploratory Data Analysis: (Link to Notebook)

In the exploratory data analysis, the goal was to explore the clean datasets and try to understand how factors such as categories, author, year of publication etc. affect the rating of a book. It also helped to look at the other factors that can be considered as a performance metric for differnt books or authors ( e.g. how many reviews a book received, how many books an author published and how their ratings compare etc). The user’s preferences in terms of tagging books and rate a book was also explored to better understand what built in features to offer in a book recommendation system (built - in tags) to improve the overall user experience.

3. Dataset Size Selection For Modeling: (Link to Notebook)

A qucik inspection of The ratings.csv dataset showed that the dataset has 6,000,000 observations. As modeling with such a large number of observation is computatinally challenging, it was important to identify active users (i.e. users who read and rate frequetntly) and books receiving siginficant reviews for modeling. This allowed to siginificantly reduce computational time and complexity required for modeling while retaining all the necessary information and patterns in the dataset. The cutoff points for users and books were identified by analyzing the CDF plots of number of review per user and number of reviews per book as the plots helped to identify where the information in the dataset and size of the dataset is balanced.

Next, modeling is performed on different subsets of the truncated dataset while increasing the size of the dataset in each step. The modeling accuracy (RMSE) is then calculated for each size to understand if there is a benifit of including more or less observations in terms of modeling accuracy. This step helps to identify the size of the final dataset that is to be used for modeling.

4. Machine Learning (ML) Modeling and Optimization: (Link to Notebook)

In this step, the user’s rating history was divided into train and test dataset. The idea is to create a scenario where the train data respresents the book the users have read and rated, and the test data contains books that can be recommendeded to the user in the future.Different collaborative filtering based algorithms (i.e.KNN, SVD or Matrix Factorization) were then fit to the train dataset. To be able to evaluate how the model will perform on unseen data, RMSE score was calculated for each model by utilizing cross validation method. The performance was compared with a baseline ML model to estimate if ML modeling can improve the precition accuracy. A hyperparamter optimization was then performed for each model to further improve the efficiency of the model. The best performing model was then used on the test dataset to predict user’s ratings for books in the test dataset.

5. Non Personalized and Personalized Recommendation System: (Link to Notebook)

Data Wrangling and EDA allowed to orgnaize the books under different popular categories based on their tag_counts and tag_names by different users. It also helped us to develop a database of .csv filesto aid keyword based search, recommend tags to refine search or shelve books as well as compute and combine all the relevant information about each book.

In this final step of the project, the non Personalized recommendation system is implemented by utlizing this book based database to recommend books to a new user once the user specifies any of his preferences (categories, authors, number of books to search etc.). The books in the search results is sorted by the average of ratings received in the past, and the top n results are shown. On the contrary, the personalized recommendation system shows customized search results for a particular user where the books are sorted by the ratings that the user is to give per prediction of the ML model.

Summary of Key Findings

Understanding User Behavior

Factors to Consider for a Book’s Rating

Book Categories

Authors in Demand

Rating Counts per Book and Per User

ML Models and Recommendation Engine: Overview

ML Modeling Results

Features of the Recommendation Engine