Download presentation
Presentation is loading. Please wait.
Published byHillary Flowers Modified over 8 years ago
1
Analysis of massive data sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing Consumer Computing Laboratory http://www.fer.hr/predmet/avsp
2
Recommender Systems Marin Šilić, PhD
3
Outline □Motivation □Formal Definition □Content-based Recommendation □Collaborative Filtering □Remarks, Practice & Advices Copyright © 2016 CCL: Analysis of massive data sets3 od 46
4
Motivation □Example o Music CDs web shop o User A Buys Metallica CD Buys Iron Maiden CD o User B Does search on Metallica Recommender suggests Iron Maiden from the data collected about User A Copyright © 2016 CCL: Analysis of massive data sets4 od 46
5
□Recommender Systems Motivation Items … ConsumeRecommendations Products, Web sites, Songs, Videos, News… Examples Copyright © 2016 CCL: Analysis of massive data sets5 od 46
6
Motivation □From Scarcity to Abundance o Shelf space is a limited resource to traditional retailers Similar: TV advertisement, movie theaters etc. o Web enables almost zero-cost dissemination of information about products No issue with limited space Moreover, product information can be personalized o The more choice there is, better filter are required Recommendation engines How Into Thin Air made Touching the Void a bestseller http://archive.wired.com/wired/archive/12.10/tail.html Copyright © 2016 CCL: Analysis of massive data sets6 od 46
7
Motivation Source: Chris Anderson (2004) Copyright © 2016 CCL: Analysis of massive data sets7 od 46
8
Motivation Read & Learn More Copyright © 2016 CCL: Analysis of massive data sets8 od 46
9
Motivation □Types of Recommendations o Editorial and hand selected List of favorites List of “must see” items o Simple aggregates Top 10 Most Popular Most Recent o Personalized recommendations Amazon, EBay, YouTube etc… Copyright © 2016 CCL: Analysis of massive data sets9 od 46
10
Formal Definition Copyright © 2016 CCL: Analysis of massive data sets10 od 46
11
Formal Definition □Utility Matrix o Then, we define a utility matrix which maps users and ratings AvatarLOTRMatrixPirates Alice Bob Carol David Copyright © 2016 CCL: Analysis of massive data sets11 od 46
12
Formal Definition □Utility Matrix o Then, we define a utility matrix which maps users and ratings AvatarLOTRMatrixPirates Alice1.00.2 Bob0.50.3 Carol0.21.0 David0.4 Copyright © 2016 CCL: Analysis of massive data sets12 od 46
13
Formal Definition □Main Challenges o Challenge #1: Collecting “known” ratings in matrix How to collect the initial data? o Challenge #2: Utilize known ratings in order to estimate/predict the unknown ratings Focus on high unknown ratings The idea is to rather extract the information what a particular user likes than dislikes o Challenge #3: Evaluate prediction methods How to measure accuracy & performance of recommendation methods Copyright © 2016 CCL: Analysis of massive data sets13 od 46
14
Formal Definition □Challenge #1: Collecting Initial Ratings o Explicit Ask people to rate items Not effective in practice Users don’t like to provide feedback explicitly o Implicit Learn ratings from users’ actions E.g. purchase implies high rating How can one implicitly detect low ratings? Copyright © 2016 CCL: Analysis of massive data sets14 od 46
15
Formal Definition Copyright © 2016 CCL: Analysis of massive data sets15 od 46
16
Content-based Recommender Systems □Main idea o Recommend user A items that are similar to previous items highly rated by user A □Example o Movie recommendations Recommends movies with same actor(s) director, genre, etc. o Websites, blogs, news Recommend other sites with similar content Copyright © 2016 CCL: Analysis of massive data sets16 od 46
17
□High-level Overiview Content-based Recommender Systems Item profiles Red Circles Triangles User profile likes build match recommend Copyright © 2016 CCL: Analysis of massive data sets17 od 46
18
Content-based Recommender Systems □Item Profiles o For each item, create an item profile o Profile is a set (vector) of features Movies: author, title, director, genre… Text: Set of “important” words in document o How to pick important features? Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Document Frequency) Term … Feature Document … Item Copyright © 2016 CCL: Analysis of massive data sets18 od 46
19
Content-based Recommender Systems Copyright © 2016 CCL: Analysis of massive data sets19 od 46
20
Content-based Recommender Systems Copyright © 2016 CCL: Analysis of massive data sets20 od 46
21
Content-based Recommender Systems □User Profiles Example o User has rated 5 movies 2 starring actor A, ratings 3 and 5 3 starring actor B, ratings 1, 2 and 4 o Let’s build user profile First, notice that ratings are relative, For one user rating 4 might be a high rating For some other user rating 4 might be an average rating So, first normalize ratings by subtracting the mean which is 3 Actor A, 0 and 2, Actor B, -2, -1 and 1 Weight of feature A becomes (2 + 0) / 2 = 1 Weight of feature B becomes (-2 -1 + 1) / 3 = -2 / 3 User profile is A + (-2/3) * B Copyright © 2016 CCL: Analysis of massive data sets21 od 46
22
Content-based Recommender Systems □Pros: Content-based RS o No need for data on other users No cold-start and sparsity issues o It can recommend to users with unique tastes o It can recommend new & unpopular items No first-rater issue o Transparency It can provide transparent explanation about content features that caused an item to be recommended Copyright © 2016 CCL: Analysis of massive data sets22 od 46
23
Content-based Recommender Systems □Cons: Content-based RS o Finding appropriate features is difficult and requires domain-specific knowledge E.g., movies, images, books etc. o It can not recommend for new users How to build user profile for a new user? o Overspecialization Recommends only items that match user’s content profile People might have multiple interests It can not utilize ratings from other users Copyright © 2016 CCL: Analysis of massive data sets23 od 46
24
Collaborative Filtering Data taste search similar recommended items like recommendation Copyright © 2016 CCL: Analysis of massive data sets24 od 46
25
Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets25 od 46
26
Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets I1I2I3I4I5I6I7 A451 B554 C245 26 od 46
27
Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets I1I2I3I4I5I6I7 A4005100 B5540000 C0002450 27 od 46
28
I1I2I3I4I5I6I7 A2/3005/3-7/300 B1/3 -2/30000 C000-5/31/34/30 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets28 od 46
29
Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets29 od 46
30
Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets30 od 46
31
Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455311 3124452 534321423 245424 5224345 423316 Missing ratings Ratings [1, 5] users movies Copyright © 2016 CCL: Analysis of massive data sets31 od 46
32
Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 Estimate Rating [1, 5] users movies Copyright © 2016 CCL: Analysis of massive data sets32 od 46
33
Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 users movies Copyright © 2016 CCL: Analysis of massive data sets33 od 46
34
Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 users movies Sim(1, m) 1.00 -0.18 0.41 -0.10 -0.31 0.59 Copyright © 2016 CCL: Analysis of massive data sets34 od 46
35
Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 users movies Sim(1, m) 1.00 -0.18 0.41 -0.10 -0.31 0.59 Predict missing value by using weighted average: r1,5 = (0.41 * 2 + 0.59 * 3) / (0.41 + 0.59) = 2.6 Copyright © 2016 CCL: Analysis of massive data sets35 od 46
36
Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets36 od 46
37
Collaborative Filtering □Item-Item vs. User-User In practice: Item-Item works better! Items are simpler, users have very different preferences AvatarLOTRMatrixPirates Alice10.2 Bob0.50.3 Carol0.21 David0.4 Copyright © 2016 CCL: Analysis of massive data sets37 od 46
38
Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets38 od 46
39
Collaborative Filtering □Pros o Works for various types of items Universal approach, domain-specific knowledge not needed □Cons o Cold Start It needs users already in the system to provide predictions o Sparsity Utility matrix is usually very sparse Hard to find users that have rated same items o First Rater It cannot recommend unrated item New items, unpopular items o Popularity Bias Black sheep problem (users with unique taste) Favors popular items Copyright © 2016 CCL: Analysis of massive data sets39 od 46
40
Remarks, Practice & Advices □Evaluation 1233 33 21 3 2452 424 41 users movies Copyright © 2016 CCL: Analysis of massive data sets40 od 46
41
Remarks, Practice & Advices □Evaluation 1233 33 21 3 24?? ??? 4? users movies Testing Data Set Copyright © 2016 CCL: Analysis of massive data sets41 od 46
42
Remarks, Practice & Advices Copyright © 2016 CCL: Analysis of massive data sets42 od 46
43
Remarks, Practice & Advices □Evaluating Predictions o 0/1 model Coverage Number of items/users for which the system can make predictions Precision Accuracy of predictions Receiver Operating Characteristic (ROC) Tradeoff curve between true positives and false positives Copyright © 2016 CCL: Analysis of massive data sets43 od 46
44
Remarks, Practice & Advices □Issues with Error Measures o Focusing on accuracy may miss the point Prediction Diversity Prediction Context Order of predictions o In recommenders good performance for high ratings is important RMSE emphases errors, method that performs well for high ratings and bad for other might get penalized Copyright © 2016 CCL: Analysis of massive data sets44 od 46
45
Remarks, Practice & Advices Copyright © 2016 CCL: Analysis of massive data sets45 od 46
46
Remarks, Practice & Advices □Tip: Add more data o All the data should be utilized No use in data reduction to make fancy algorithms work Simple methods on large data sets do best o Add even more data E.g. for movies add IMDB data about genres o More data outperforms fancy algorithms http://anand.typepad.com/datawocky/2008/03/more-data- usual.html http://anand.typepad.com/datawocky/2008/03/more-data- usual.html Copyright © 2016 CCL: Analysis of massive data sets46 od 46
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.