Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of massive data sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing.

Similar presentations


Presentation on theme: "Analysis of massive data sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing."— Presentation transcript:

1 Analysis of massive data sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing Consumer Computing Laboratory http://www.fer.hr/predmet/avsp

2 Recommender Systems Marin Šilić, PhD

3 Outline □Motivation □Formal Definition □Content-based Recommendation □Collaborative Filtering □Remarks, Practice & Advices Copyright © 2016 CCL: Analysis of massive data sets3 od 46

4 Motivation □Example o Music CDs web shop o User A  Buys Metallica CD  Buys Iron Maiden CD o User B  Does search on Metallica  Recommender suggests Iron Maiden from the data collected about User A Copyright © 2016 CCL: Analysis of massive data sets4 od 46

5 □Recommender Systems Motivation Items … ConsumeRecommendations Products, Web sites, Songs, Videos, News… Examples Copyright © 2016 CCL: Analysis of massive data sets5 od 46

6 Motivation □From Scarcity to Abundance o Shelf space is a limited resource to traditional retailers  Similar: TV advertisement, movie theaters etc. o Web enables almost zero-cost dissemination of information about products  No issue with limited space  Moreover, product information can be personalized o The more choice there is, better filter are required  Recommendation engines  How Into Thin Air made Touching the Void a bestseller http://archive.wired.com/wired/archive/12.10/tail.html Copyright © 2016 CCL: Analysis of massive data sets6 od 46

7 Motivation Source: Chris Anderson (2004) Copyright © 2016 CCL: Analysis of massive data sets7 od 46

8 Motivation Read & Learn More Copyright © 2016 CCL: Analysis of massive data sets8 od 46

9 Motivation □Types of Recommendations o Editorial and hand selected  List of favorites  List of “must see” items o Simple aggregates  Top 10  Most Popular  Most Recent o Personalized recommendations  Amazon, EBay, YouTube etc… Copyright © 2016 CCL: Analysis of massive data sets9 od 46

10 Formal Definition Copyright © 2016 CCL: Analysis of massive data sets10 od 46

11 Formal Definition □Utility Matrix o Then, we define a utility matrix which maps users and ratings AvatarLOTRMatrixPirates Alice Bob Carol David Copyright © 2016 CCL: Analysis of massive data sets11 od 46

12 Formal Definition □Utility Matrix o Then, we define a utility matrix which maps users and ratings AvatarLOTRMatrixPirates Alice1.00.2 Bob0.50.3 Carol0.21.0 David0.4 Copyright © 2016 CCL: Analysis of massive data sets12 od 46

13 Formal Definition □Main Challenges o Challenge #1: Collecting “known” ratings in matrix  How to collect the initial data? o Challenge #2: Utilize known ratings in order to estimate/predict the unknown ratings  Focus on high unknown ratings  The idea is to rather extract the information what a particular user likes than dislikes o Challenge #3: Evaluate prediction methods  How to measure accuracy & performance of recommendation methods Copyright © 2016 CCL: Analysis of massive data sets13 od 46

14 Formal Definition □Challenge #1: Collecting Initial Ratings o Explicit  Ask people to rate items  Not effective in practice  Users don’t like to provide feedback explicitly o Implicit  Learn ratings from users’ actions E.g. purchase implies high rating  How can one implicitly detect low ratings? Copyright © 2016 CCL: Analysis of massive data sets14 od 46

15 Formal Definition Copyright © 2016 CCL: Analysis of massive data sets15 od 46

16 Content-based Recommender Systems □Main idea o Recommend user A items that are similar to previous items highly rated by user A □Example o Movie recommendations  Recommends movies with same actor(s) director, genre, etc. o Websites, blogs, news  Recommend other sites with similar content Copyright © 2016 CCL: Analysis of massive data sets16 od 46

17 □High-level Overiview Content-based Recommender Systems Item profiles Red Circles Triangles User profile likes build match recommend Copyright © 2016 CCL: Analysis of massive data sets17 od 46

18 Content-based Recommender Systems □Item Profiles o For each item, create an item profile o Profile is a set (vector) of features  Movies: author, title, director, genre…  Text: Set of “important” words in document o How to pick important features?  Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Document Frequency)  Term … Feature  Document … Item Copyright © 2016 CCL: Analysis of massive data sets18 od 46

19 Content-based Recommender Systems Copyright © 2016 CCL: Analysis of massive data sets19 od 46

20 Content-based Recommender Systems Copyright © 2016 CCL: Analysis of massive data sets20 od 46

21 Content-based Recommender Systems □User Profiles Example o User has rated 5 movies  2 starring actor A, ratings 3 and 5  3 starring actor B, ratings 1, 2 and 4 o Let’s build user profile  First, notice that ratings are relative, For one user rating 4 might be a high rating For some other user rating 4 might be an average rating  So, first normalize ratings by subtracting the mean which is 3 Actor A, 0 and 2, Actor B, -2, -1 and 1  Weight of feature A becomes (2 + 0) / 2 = 1  Weight of feature B becomes (-2 -1 + 1) / 3 = -2 / 3  User profile is A + (-2/3) * B Copyright © 2016 CCL: Analysis of massive data sets21 od 46

22 Content-based Recommender Systems □Pros: Content-based RS o No need for data on other users  No cold-start and sparsity issues o It can recommend to users with unique tastes o It can recommend new & unpopular items  No first-rater issue o Transparency  It can provide transparent explanation about content features that caused an item to be recommended Copyright © 2016 CCL: Analysis of massive data sets22 od 46

23 Content-based Recommender Systems □Cons: Content-based RS o Finding appropriate features is difficult and requires domain-specific knowledge  E.g., movies, images, books etc. o It can not recommend for new users  How to build user profile for a new user? o Overspecialization  Recommends only items that match user’s content profile  People might have multiple interests  It can not utilize ratings from other users Copyright © 2016 CCL: Analysis of massive data sets23 od 46

24 Collaborative Filtering Data taste search similar recommended items like recommendation Copyright © 2016 CCL: Analysis of massive data sets24 od 46

25 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets25 od 46

26 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets I1I2I3I4I5I6I7 A451 B554 C245 26 od 46

27 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets I1I2I3I4I5I6I7 A4005100 B5540000 C0002450 27 od 46

28 I1I2I3I4I5I6I7 A2/3005/3-7/300 B1/3 -2/30000 C000-5/31/34/30 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets28 od 46

29 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets29 od 46

30 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets30 od 46

31 Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455311 3124452 534321423 245424 5224345 423316 Missing ratings Ratings [1, 5] users movies Copyright © 2016 CCL: Analysis of massive data sets31 od 46

32 Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 Estimate Rating [1, 5] users movies Copyright © 2016 CCL: Analysis of massive data sets32 od 46

33 Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 users movies Copyright © 2016 CCL: Analysis of massive data sets33 od 46

34 Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 users movies Sim(1, m) 1.00 -0.18 0.41 -0.10 -0.31 0.59 Copyright © 2016 CCL: Analysis of massive data sets34 od 46

35 Collaborative Filtering □Item-Item CF (k = 2) 121110987654321 455 ? 311 3124452 534321423 245424 5224345 423316 users movies Sim(1, m) 1.00 -0.18 0.41 -0.10 -0.31 0.59 Predict missing value by using weighted average: r1,5 = (0.41 * 2 + 0.59 * 3) / (0.41 + 0.59) = 2.6 Copyright © 2016 CCL: Analysis of massive data sets35 od 46

36 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets36 od 46

37 Collaborative Filtering □Item-Item vs. User-User  In practice: Item-Item works better!  Items are simpler, users have very different preferences AvatarLOTRMatrixPirates Alice10.2 Bob0.50.3 Carol0.21 David0.4 Copyright © 2016 CCL: Analysis of massive data sets37 od 46

38 Collaborative Filtering Copyright © 2016 CCL: Analysis of massive data sets38 od 46

39 Collaborative Filtering □Pros o Works for various types of items  Universal approach, domain-specific knowledge not needed □Cons o Cold Start  It needs users already in the system to provide predictions o Sparsity  Utility matrix is usually very sparse  Hard to find users that have rated same items o First Rater  It cannot recommend unrated item  New items, unpopular items o Popularity Bias  Black sheep problem (users with unique taste)  Favors popular items Copyright © 2016 CCL: Analysis of massive data sets39 od 46

40 Remarks, Practice & Advices □Evaluation 1233 33 21 3 2452 424 41 users movies Copyright © 2016 CCL: Analysis of massive data sets40 od 46

41 Remarks, Practice & Advices □Evaluation 1233 33 21 3 24?? ??? 4? users movies Testing Data Set Copyright © 2016 CCL: Analysis of massive data sets41 od 46

42 Remarks, Practice & Advices Copyright © 2016 CCL: Analysis of massive data sets42 od 46

43 Remarks, Practice & Advices □Evaluating Predictions o 0/1 model  Coverage Number of items/users for which the system can make predictions  Precision Accuracy of predictions  Receiver Operating Characteristic (ROC) Tradeoff curve between true positives and false positives Copyright © 2016 CCL: Analysis of massive data sets43 od 46

44 Remarks, Practice & Advices □Issues with Error Measures o Focusing on accuracy may miss the point  Prediction Diversity  Prediction Context  Order of predictions o In recommenders good performance for high ratings is important  RMSE emphases errors, method that performs well for high ratings and bad for other might get penalized Copyright © 2016 CCL: Analysis of massive data sets44 od 46

45 Remarks, Practice & Advices Copyright © 2016 CCL: Analysis of massive data sets45 od 46

46 Remarks, Practice & Advices □Tip: Add more data o All the data should be utilized  No use in data reduction to make fancy algorithms work  Simple methods on large data sets do best o Add even more data  E.g. for movies add IMDB data about genres o More data outperforms fancy algorithms  http://anand.typepad.com/datawocky/2008/03/more-data- usual.html http://anand.typepad.com/datawocky/2008/03/more-data- usual.html Copyright © 2016 CCL: Analysis of massive data sets46 od 46


Download ppt "Analysis of massive data sets Prof. dr. sc. Siniša Srbljić Doc. dr. sc. Dejan Škvorc Doc. dr. sc. Ante Đerek Faculty of Electrical Engineering and Computing."

Similar presentations


Ads by Google