Recommender Systems Session I Robin Burke DePaul University Chicago, IL
Roadmap Session A: Basic Techniques I Session B: Basic Techniques II Introduction Knowledge Sources Recommendation Types Collaborative Recommendation Session B: Basic Techniques II Content-based Recommendation Knowledge-based Recommendation Session C: Domains and Implementation I Recommendation domains Example Implementation Lab I Session D: Evaluation I Evaluation Session E: Applications User Interaction Web Personalization Session F: Implementation II Lab II Session G: Hybrid Recommendation Session H: Robustness Session I: Advanced Topics Dynamics Beyond accuracy 2
Current research Question 1 Question 2 do we lose something when we think of a ratings database as static? my work Question 2 does a summary statistic like MAE hide valuable information? Mike O’Mahoney (UCD colleague)
Collaborative Dynamics Remember our evaluation methodology get all the ratings divide them up into test / training data sets run prediction tests
Problem That isn’t how real recommender systems operate They get a stream of ratings over time They have to respond to user requests predictions recommendation lists dynamically
Questions Are early ratings more predictive than later ratings? Is there a pattern to how users build their profiles? How long does it take to get past the cold-start?
Some ideas Temporal leave-one-out Profile MAE Profile Hit Ratio
Temporal leave-one-out (TL1O) for a rating r(u,i) at time t predict that r(u,i) using the ratings database immediately prior to t the information that would have been available right before we learned u’s real rating Average the error over time intervals we see how error evolves as data is added cold-start in action
Profile MAE For each profile See the aggregate evolution of profiles do the TL1O ratings average over all profiles of that length See the aggregate evolution of profiles
Profile Hit Ratio Do a similar thing for hit ratio For each liked item r(u,i) > 3 at time t create a recommendation list at time t measure the rank of item i on that list compute the hit ratio of such items on lists of length k
Temporal MAE (ML1M)
Cold Start Seems to take about 150 days to get past the initial cold start about 15% of the data Temporal MAE improves after that but not as steeply
Profile MAE Decrease in MAE as profiles get longer Strongest decrease earlier in the curve Seems to be a kNN property same thing happens if the first 150
Diminishing returns Appears to be diminishing returns in longer profile sizes paradoxical given what we know about sparsity More data should be better
A clue ML100K data Sparser data compresses the curve 10% data size Sparser data compresses the curve Diminishing returns may be a function of the average profile length
Average rating Users seem to add positive ratings first and negative ratings later
Application-dependence Could be because ratings are added in response to recommendations Easy (popular) recommendations given first likely to be right Later recommendations more errors users rate lower
Profile Hit Ratio Cumulative hit ratio n=50 Dashed line is random performance
Interestingly Harder to see Appear to be diminishing returns like MAE but then a jump at the end Need to examine this data more ML100K data experiments very slow to run
MAE for different ratings Odd result MAE for each rating value correlated with # of ratings of that value in the profile subtract out contribution of total # of ratings of that value May tell us the average value of adding a rating of a particular type Look at R=5? saturation more about this later
Break
What Have The Neighbours Ever Done for Us? A Collaborative Filtering Perspective. Michael O’Mahony 5th March, 2009
Presentation based on paper submitted to UMAP ’09 Authors: R. Rafter, M.P. O’Mahony, N. J. Hurley and B. Smyth
Collaborative Filtering Collaborative filtering (CF) – key techniques used in recommender systems Harnesses past ratings to make predictions & recommendations for new items Recommend items with high predicted ratings and suppress those with low predicted ratings Assumption: CF techniques provide a considerable advantage over simpler average-rating approaches
Valid Assumption? We analyse the following: What do CF techniques actually contribute? How is accuracy performance measured? What datasets are used to evaluate CF techniques? Consider two standard CF techniques: User-based and item-based CF
CF Algorithms Two components to user-based and item-based CF: Initial estimate: based on average rating of target user or item Neighbour estimate: based on ratings of similar users or items Must perturb the initial estimate: By the correct magnitude In the correct direction General formula:
CF Algorithms User-based CF: Item-based CF: Initial Estimate Neighbour Estimate
Evaluating Accuracy Predictive accuracy: Mean Absolute Error (MAE): MAE calculated over all test set ratings (problem?) Other metrics: RMSE, ROC curves … – give similar trends
Evaluation Datasets: Procedure: # Users # Items # Ratings Sparsity Rating Scale MovieLens Netflix Book-crossing 943 24,010 77,805 1,682 17,471 185,973 100,000 5,581,775 433,671 93.695% 98.690% 99.997% 1 – 5 1 – 10 Datasets: Procedure: Create test set by randomly removing 10% of ratings Make predictions for test set ratings using remaining data Repeat x10 and compute average MAE
Results User-based Item-based Dataset Mag. Cor. Dir. MAE MovieLens Netflix Book-crossing 0.43 0.41 0.99 66% 53% 0.73 0.70 1.53 0.34 0.35 0.94 64% 67% 63% 0.69 1.34 Average performance, computed over all test set ratings Neighbour estimate magnitudes are small, between 8.5% – 11% of range Item-based CF is comparable to/outperforms user-based CF wrt MAE (smaller magnitudes observed for item-based CF) Book-crossing dataset – user-based CF shifts initial estimate in correct direction in only 53% of cases (just slightly better than chance!)
Neighbour Magnitude
Datasets Frequency of occurrence of ratings: Consider MovieLens: Bias (natural?) toward ratings on higher end of scale Consider MovieLens: Most ratings are 3 and 4 Mean user rating ≈ 3.6 –– small neighbour estimate magnitude required in most cases Consequences of such datasets characteristics for CF research: Computing average MAE across all test set ratings hide performance issues in light of such characteristics [Shardanand and Maes 1995] For example, can CF achieve large magnitudes when needed?
MAE vs Actual Ratings Recall: average overall MAE = 0.73 for both UB and IB …
Error PDFs
Neighbour Contribution Effect of neighbour estimate versus initial (mean-based) estimate:
Neighbour Contribution
Conclusions Examined the contribution of standard CF techniques: Neighbours have small influence (magnitude) which is not always reliable (direction) Evaluating accuracy performance: Need for more fine-grained error analysis [Shardanand and Maes 1995] Focus on developing CF algorithms which offer improved accuracy performance for extreme ratings Test datasets: Standard datasets have particular characteristics – e.g. bias in ratings toward higher end of rating scale – need for new datasets Such characteristics, combined with using overall MAE to evaluate accuracy, has “hidden” performance issues – and hindered CF development (?)
That’s all folks! Questions?