A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation
Collaborative Filtering Method of automating word-of-mouth Large groups of users collaborate by rating products, services, news articles, etc. Analyze ratings data of the group to produce recommendations for individual users –Find users with similar tastes
Problems with Collaborative Filtering Methods Performance –Prohibitively large dataset Scalability –Will the solution scale to millions of users on the Internet? Sparsity of data –User who has rated few items –Item with few ratings
Problems with Collaborative Filtering Methods Cannot compare users that have no common ratings User 1User 2 Billy Madison4 Happy Gilmore 5 Mr. Deeds 4 50 First Dates5 Big Daddy 4 (Ratings on a scale of 1-5)
A Content-Based Approach Build a feature list for each user based on content of items rated Compare users’ features to make recommendations Now we can find similarity between users with no common ratings
Data Source EachMovie Project –Compaq Systems Research Center –Over 18 months collected 2,811,983 ratings for 1,628 movies from 72,916 users –Ratings given on 1-5 scale –Dataset split into 75% training, 25% testing Internet Movie Database (IMDb) –Huge database of movie information Actors, director, genre, plot description, etc.
Creating the Feature List Retrieve content information for each movie from IMDb dataset – create “bag of words” Throw out common words (i.e.: the, and, but) Calculate frequency of remaining words, create movie’s feature list –Frequencies weighted based on total number of terms Goldeneye satellite2destroy2 xenia3london2 thriller2villain2 simon4revenge2
Comparing Users Each user has positive and negative feature list –Combine feature lists of movies they have rated Compare user’s feature lists using Pearson Correlation Coefficient Users can be compared with no common ratings Able to recommend items with few ratings Users only need to rate a few items to receive recommendations
Methods Three methods attempted to improve performance: –Clustering of users –Random groups of users –Compare users directly to items
User Clustering Simple algorithm, starting with first user: –Compare to existing clusters first If similarity is high, merge user into cluster –Compare to each remaining user –Stop if correlation is above threshold –Once a similar user is found, create a new cluster from the two users Cluster has combined feature list of all its users Not as efficient as possible - O(n 2 )
User Clustering Once clusters are formed, we can predict ratings for each item –For each user, find their 10 nearest neighbors –Predicted rating is the average rating of item from these neighbors
Selecting a Random Group Randomly select 5000 users as a (hopefully) representative sample As before, find a user’s 10 nearest neighbors from the random group –Predicted rating is the average rating of item from these neighbors Much less work than clustering –How much accuracy (if any) will be lost?
Comparing Users to Items No collaborative filtering involved Compare the positive and negative feature lists of user to feature list of item –Make prediction based on which feature list has higher correlation with item Pretty quick and easy to do –How accurate will this be?
Analyzing Predictions Collected 3 metrics to evaluate predictions –Accuracy: all items predicted correctly –Precision: positive items predicted correctly –Recall: unseen positive items predicted correctly Precision and recall have inverse relationship
Results
Conclusions Large gain from clustering users –Is the extra work worth it? –Depends on the application Purely content-based predictions worked pretty well –Simple, fast solution Random group prediction also performed reasonably well Problems solved by content-based analysis: –Sparsity of data –Performance –Scalability