Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering.

1 Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business Technology Seoul National University Seoul, Korea Presented by Sangkeun Lee 1/14/2011 Paolo Cremonesi, Yehuda Koren, Roberto Turrin Politecnico di Milano, Yahoo! Research Haifa, Israel, Neptuny Milan, Italy

2 Copyright  2010 by CEBT Introduction  Competition of recommender systems By evaluating their error metrics such as RMSE (Root mean squared error) Average error between estimated ratings and actual ratings  Why the majority of the literature is focused on error metrics? Logical & convenient  However, many commercial systems perform top-N recommendation tasks The systems suggest a few specific items to the user that are likely to be very appealing to him

3 Copyright  2010 by CEBT Introduction: Top-N Performance  Classical error measures (e.g. RMSE, MAE) do not really measure top- N performance  Measure for Top-N Performance Accuracy metrics – Recall and Precision  In this paper, The authors present an extensive evaluation of several state-of-art recommender systems & naïve non-personalized algorithms And they give us some insight from the experimental results On Netflix & Movielens datasets

4 Copyright  2010 by CEBT Testing Methodology: Dataset  For each dataset, known ratings are split into two subsets : Training set M and test set T Test set T contains only 5-starts ratings – So, we can reasonably state that T contains items relevant to the respective users  For the Neflix dataset, Training set = training dataset 100M ratings for Netflix prize Test set = 5-star ratings from probe dataset for Netflix prize (|T|=384,573)  For the Movielens dataset, Randomly sub-sampled 1.4% of the ratings from the dataset to create testset

5 Copyright  2010 by CEBT Testing Methodology: measuring precision and recall  1) Train the model over the ratings in M  2) For each item I rated 5-starts by user u in T Randomly select 1000 additional items unrated by user u Predict the ratings for the test item I and for the additional 1000 items Form a ranked list by ordering 1001 items according to the predicted ratings. Let p denote the rank of the item I within this list. (The best result: p=1) Form a top-N recommendation list by picking the N top ranked items from the list. If p<=N we have a hit. Otherwise we have a miss.

6 Copyright  2010 by CEBT Testing Methodology: measuring precision and recall  For any single test case, recall for a single test can assume either 0 (miss) or 1(hit) Precision for a single test can assume either the value 0(miss) or 1/N (hit) The overall recall and precision are defined by averaging over all test cases

7 Copyright  2010 by CEBT Rating distribution : Popular items vs.Long-tail  About 33% of ratings collected by Netflix involve only the 1.7% of most popular items  To evaluate the accuracy of recommender algorithms in suggesting non-trivial items, T has been partitioned into T head and T long

8 Copyright  2010 by CEBT Algorithms  Non-personalized models Movie Rating Average (MovieAvg) – average of ratings Top Popular (TopPop) – number of ratings – non applicable to measure error metrics  Collaborative Filtering models Neighborhood models – The most common approaches – Based on similarity among either users or items Latent factor models – Finding hidden factors – Model users and items in the same latent factor spaces – Predict ratings usib proximity (e.g., inner-product)

9 Copyright  2010 by CEBT Neighborhood Models It’s no longer estimated rating, but still we can use this for top-N recommendation tasks

10 Copyright  2010 by CEBT Latent Factor Models

11 Copyright  2010 by CEBT Latent Factor Models: PureSVD It’s no longer estimated rating, but still we can use this for top-N recommendation tasks

12 Copyright  2010 by CEBT RMSE Ranking  SVD++ 0.8911  AsySVD 0.9000  CorNgbr 0.9406  MovieAvg 1.053  Note that TopPop, NNCorNgbr, PureSVD are not applicable for measuring error metrics

13 Copyright  2010 by CEBT Result: Movielens dataset Similar!? Best!?

14 Copyright  2010 by CEBT Result: Netflix dataset  All items TopPop outperforms CorNgbr AsySVD and SVD++ slightly performs better than TopPop (Note that these algorithms are possibly better tuned for Neflix data) NNCosNgbr works good PureSVD is still the best  Long-tail CorNgbr significantly underperforms for the head But it performs well on long-tail data (Probably, it explains why CorNgbr has been widely used)

15 Copyright  2010 by CEBT PureSVD??  Poor design in terms of rating estimation The authors did not expect the result  PureSVD Easy to code &Good computational performance in both offline and online When moving to longer tail items, accuracy improves with raising the dimensionality of the PureSVD model. (50 -> 150) – This could mean that first latent factors capture properties of popular items, while additional features capture properties of long-tail items

16 Copyright  2010 by CEBT Conclusions  Error metrics have been more popular Mathematical convenience Formal optimization However, it is well recognized that accuracy measures may be more natural  In summary, (1) There is no monotonic(trivial) relation between error metrics and accuracy metrics (2) Test-cases should be carefully selected as we can see the experimental results (long-tail vs. head) Watch out the possible pitfalls! (3) New variants of existing algorithms improves the top-N performances

