Presentation is loading. Please wait.

Presentation is loading. Please wait.

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

Similar presentations


Presentation on theme: "A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis."— Presentation transcript:

1 A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis

2 Outline Introduction to Collaborative Filtering Special nature of CF Inverted File Search Algorithm Item-based Slope-one Hybrid method No random access Experiment

3 Collaborative Filtering Looking for opinions from similar taste friends The active user collaborate to other users Trust those who are similar taste more

4 Example i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u1 123455 u2u2 543211 u a trust u 1 more than u 2

5 Special nature of CF Trust your feeling in the following a few slides

6 Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone Only i 2 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua -2--? u1u1 -2--3 u2u2 12--1 u3u3 -22-4 u4u4 22-32 u5u5 12214

7 Similarity The similarity is not based on all attributes (the items) Only the items which the active user rated are relevant Although some suggested (Breese al. et.) more items could be considered (by default voting), it is not popular.

8 Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone except u 5 i1i1 i2i2 i3i3 i4i4 iaia uaua 1235? u1u1 1---3 u2u2 -2--1 u3u3 --3-4 u4u4 ---52 u5u5 ----4

9 Similarity The similarity is not based on all attributes (the items) Only the items which both the active user and the user under consideration rated are relevant

10 A Notice u a is similar to u 1, u 2, u 3 and u 4 BUT u 1, u 2, u 3 and u 4 are totally not relevant to each other

11 Searching for similar users Which user is the best one to trust in order to predict “?” ? u 3 is the one. Only u 3 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua 1234? u1u1 1235- u2u2 2314- u3u3 43214 u4u4 2113- u5u5 1421-

12 Top-k most similar users It is not the top-k of among all users It is the top-k of among the users who rated i a

13 Summary on the nature The matrix is incomplete Similarity  The set of items could be different for every pair of users (the intersect)  The set of users (the candidates) could be different for each query (those who rated i a )  No triangle inequality (in extreme, u a is similar to u 1, u 2 ; but u 1 and u 2 can be irrelevant)

14 Popular Similarity measure Very often, Pearson Correlation is used: j iterate through the items that rated by both user i and user a Vote (rating) on item j by user a Average vote (rating) of user a

15 Output - Prediction C is a set of users who Rated the queried Item

16 Brute Force Searching Given an active user and active movie:  Relevant movies are known from the active user profile  Candidates are known from the active movie profile Find sim(u a, u i ) for all u i in candidate set The top-k are used as advisors

17 Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-

18 Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-

19 Useful Information What are the useful information? The Green entries are useful i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-

20 Useful Information All user profiles or All movie profiles Contains the useful information i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-

21 Inverted file Item 123456 User 1-1-345 21345-5 3-3-41- Item 1 2 3 4 5 6 21 13 24 14 15 2534 112333 31 25 Coster & Svensson 2002

22 Pearson Correlation The active user is fixed in a single query For each user i, there are 3 summations Instead of calculate the w(a,i) for each user i, calculate SAI[i], SAA[i] and SII[i] for all users (with help of inverted list) SAA[i] SAI[i] SII[i]

23 Early Termination Self-Indexing Inverted Files for Fast Text Retrieval, Alistair Moffat and Justin Zobel, 1994 Quit  Stop when number of user reaches a threshold Continue  Stop consider new users when number of user reaches a threshold

24 Item-based The matrix is symmetric Exchange the role of row (user profile) and column (movie profile) Looks for movies which are similar to the active movie If the users act similarly to both movies, the active user may act similarly too.

25 Item-based example The users act exactly the same on i 2 and i a Perhaps i 2 and i a are very similar ? May be 1, as u a give i 2 rating 1 i1i1 i2i2 i3i3 i4i4 iaia uaua 1134? u1u1 11351 u2u2 22142 u3u3 44214 u4u4 24134 u5u5 15215

26 Sarwar et al 2001  Pre-find top-k similar items Amazon.com  Personal promotion on the top-k similar items

27 Slope-one Not only find similar items Measure the pattern between items Lemire & Maclachlan 2005

28 Slope-one For items pair j and i For all users rated both items Find the average difference in rating

29 Slope-one A prediction is made based on dev j,i

30 Slope-one example All users gave i a higher rating than i 3 by 1 By considering i a and i 3, u a may rate ‘?’ as 4 i1i1 i2i2 i3i3 i4i4 iaia uaua 1434? u1u1 12152 u2u2 22142 u3u3 44314 u4u4 24334 u5u5 15415

31 Summary A common argument  There are less items than users Pre-computation  Similarity in item-based  dev j,i in slope-one

32 Hybrid method Finding top-k similar users Brute force  Inefficient when number of candidate is large Inverted file  Inefficient when number of relevant items is large Mixing the 2

33 Hybrid method Inverted file again The files are segmented according to ratings

34 I1I1 Segmented inverted file example All users here given I 1 rating 5 All users here given I 1 rating 4 All users here given I 1 rating 3 All users here given I 1 rating 2 All users here given I 1 rating 1

35 Accessing Segmented inverted file First access the segments which is closer to the active user’s rating

36 I1I1 Access example Access order 1, d=0 u a here Access order 2, d=1 Access order 3, d=1 Access order 4, d=2 Access order 5, d=3

37 Accessing Segmented inverted file The inverted file is a list ranked on d (distance to u a ’s rating) The best bound on similarity can be found

38 Algorithm phase 1  Access all inverted lists, such that all d=0 segments are loaded  Starting from the most frequently seen candidates, find the actual similarity (totally k candidates are needed)  The similarity of the k th candidate who actual similarity is known will be the initial filter

39 I1I1 I2I2 I3I3 I4I4 I5I5

40 I1I1 I2I2 I3I3 I4I4 I5I5

41 I1I1 I2I2 I3I3 I4I4 I5I5

42 Algorithm phase 1 example candidateactual similarity u30.89 u80.88 …… u10.77 u90.70 filter K

43 Algorithm phase 2 – keep loading form the inverted lists  The best bound of the similarity decreases  Similarity bound is worse than filter => pruned  The partial information is more complete  Update filter after some number of segments are load  Stop when number of remaining candidate is small

44 Algorithm – phase 2 In the implementation, the items rated by u a extremely (close to 1 or 5) are loaded first The candidates’ best bound drop faster

45 I1I1 I2I2 I3I3 I4I4 I5I5

46 I1I1 I2I2 I3I3 I4I4 I5I5

47 I1I1 I2I2 I3I3 I4I4 I5I5

48 I1I1 I2I2 I3I3 I4I4 I5I5

49 I1I1 I2I2 I3I3 I4I4 I5I5

50 I1I1 I2I2 I3I3 I4I4 I5I5

51 Similarity measure Additive L1 Segmental Manhattan Distance = Manhattan Distance / # of relevant items Sim=1-(SMD)/(maximum distance)

52 Horting To ensure the intersect of items is large enough Aggarwal et al

53 Horting i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u1 123455 u2u2 1----1 Sim(u a, u 1 ) = Sim(u a, u 2 ) u 2 is less reliable

54 Best bound We have ‘user num of appearance’ ‘max num of more appearance’ = min(u a _profile.len, u i _profile.len) – ‘user num of appearance’ if never see this user in any segment best distance = 1 else if ( partial distance > 1 ) The user appear in unseen items, and d=1 else if (‘max num of more appearance’ < horting_factor) The user appear enough number of times only else The user does not appear anymore, partial distance is the best

55 No random access The inverted file is a list ranked on d (distance to u a ’s rating) Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006 phase 1  Do not find any actual similarity until The best bound of an unseen user is worse than The k th best worst bound

56 I1I1 I2I2 I3I3 I4I4 I5I5

57 I1I1 I2I2 I3I3 I4I4 I5I5

58 I1I1 I2I2 I3I3 I4I4 I5I5

59 I1I1 I2I2 I3I3 I4I4 I5I5

60 I1I1 I2I2 I3I3 I4I4 I5I5

61 Worst Bound While a user’s partial distance is smaller than the maximum possible distance  include the distance

62 No random access phase 2  Find actual similarity and prune candidates

63 Experiment Netflix dataset 480189 users 17770 movies 100 million ratings (1.17%) k = 50 h = 10

64 Efficiency Brute force 185.24s per query Hybrid 25.85s per query NRA 59.34s per query

65 Disk IO statistic (hybrid) % of actual similarity  7.60% % of entries loaded from inverted file  68.52% % of entries which loaded and relevant  49.77%

66 Reference Breese et al  Empirical Analysis of Predictive Algorithms for Collaborative Filtering Coster & Svensson 2002  Inverted File Search Algorithms for Collaborative Filtering Lemire & Maclachlan 2005  Slope One Predictors for Online Rating-Based Collaborative Filtering Sarwar et al 2001  ItemBased Collaborative Filtering Recommendation Algorithms Aggarwal et al  Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006  Efficient Aggregation of Ranked Inputs


Download ppt "A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis."

Similar presentations


Ads by Google