Download presentation
Presentation is loading. Please wait.
Published byRalph Shelton Modified over 9 years ago
1
Performance of Recommender Algorithms on Top-N Recommendation Tasks RecSys 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business Technology Seoul National University Seoul, Korea Presented by Sangkeun Lee 1/14/2011 Paolo Cremonesi, Yehuda Koren, Roberto Turrin Politecnico di Milano, Yahoo! Research Haifa, Israel, Neptuny Milan, Italy
2
Copyright 2010 by CEBT Introduction Competition of recommender systems By evaluating their error metrics such as RMSE (Root mean squared error) Average error between estimated ratings and actual ratings Why the majority of the literature is focused on error metrics? Logical & convenient However, many commercial systems perform top-N recommendation tasks The systems suggest a few specific items to the user that are likely to be very appealing to him
3
Copyright 2010 by CEBT Introduction: Top-N Performance Classical error measures (e.g. RMSE, MAE) do not really measure top- N performance Measure for Top-N Performance Accuracy metrics – Recall and Precision In this paper, The authors present an extensive evaluation of several state-of-art recommender systems & naïve non-personalized algorithms And they give us some insight from the experimental results On Netflix & Movielens datasets
4
Copyright 2010 by CEBT Testing Methodology: Dataset For each dataset, known ratings are split into two subsets : Training set M and test set T Test set T contains only 5-starts ratings – So, we can reasonably state that T contains items relevant to the respective users For the Neflix dataset, Training set = training dataset 100M ratings for Netflix prize Test set = 5-star ratings from probe dataset for Netflix prize (|T|=384,573) For the Movielens dataset, Randomly sub-sampled 1.4% of the ratings from the dataset to create testset
5
Copyright 2010 by CEBT Testing Methodology: measuring precision and recall 1) Train the model over the ratings in M 2) For each item I rated 5-starts by user u in T Randomly select 1000 additional items unrated by user u Predict the ratings for the test item I and for the additional 1000 items Form a ranked list by ordering 1001 items according to the predicted ratings. Let p denote the rank of the item I within this list. (The best result: p=1) Form a top-N recommendation list by picking the N top ranked items from the list. If p<=N we have a hit. Otherwise we have a miss.
6
Copyright 2010 by CEBT Testing Methodology: measuring precision and recall For any single test case, recall for a single test can assume either 0 (miss) or 1(hit) Precision for a single test can assume either the value 0(miss) or 1/N (hit) The overall recall and precision are defined by averaging over all test cases
7
Copyright 2010 by CEBT Rating distribution : Popular items vs.Long-tail About 33% of ratings collected by Netflix involve only the 1.7% of most popular items To evaluate the accuracy of recommender algorithms in suggesting non-trivial items, T has been partitioned into T head and T long
8
Copyright 2010 by CEBT Algorithms Non-personalized models Movie Rating Average (MovieAvg) – average of ratings Top Popular (TopPop) – number of ratings – non applicable to measure error metrics Collaborative Filtering models Neighborhood models – The most common approaches – Based on similarity among either users or items Latent factor models – Finding hidden factors – Model users and items in the same latent factor spaces – Predict ratings usib proximity (e.g., inner-product)
9
Copyright 2010 by CEBT Neighborhood Models It’s no longer estimated rating, but still we can use this for top-N recommendation tasks
10
Copyright 2010 by CEBT Latent Factor Models
11
Copyright 2010 by CEBT Latent Factor Models: PureSVD It’s no longer estimated rating, but still we can use this for top-N recommendation tasks
12
Copyright 2010 by CEBT RMSE Ranking SVD++ 0.8911 AsySVD 0.9000 CorNgbr 0.9406 MovieAvg 1.053 Note that TopPop, NNCorNgbr, PureSVD are not applicable for measuring error metrics
13
Copyright 2010 by CEBT Result: Movielens dataset Similar!? Best!?
14
Copyright 2010 by CEBT Result: Netflix dataset All items TopPop outperforms CorNgbr AsySVD and SVD++ slightly performs better than TopPop (Note that these algorithms are possibly better tuned for Neflix data) NNCosNgbr works good PureSVD is still the best Long-tail CorNgbr significantly underperforms for the head But it performs well on long-tail data (Probably, it explains why CorNgbr has been widely used)
15
Copyright 2010 by CEBT PureSVD?? Poor design in terms of rating estimation The authors did not expect the result PureSVD Easy to code &Good computational performance in both offline and online When moving to longer tail items, accuracy improves with raising the dimensionality of the PureSVD model. (50 -> 150) – This could mean that first latent factors capture properties of popular items, while additional features capture properties of long-tail items
16
Copyright 2010 by CEBT Conclusions Error metrics have been more popular Mathematical convenience Formal optimization However, it is well recognized that accuracy measures may be more natural In summary, (1) There is no monotonic(trivial) relation between error metrics and accuracy metrics (2) Test-cases should be carefully selected as we can see the experimental results (long-tail vs. head) Watch out the possible pitfalls! (3) New variants of existing algorithms improves the top-N performances
17
Q&A Thank you 17
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.