Download presentation
Presentation is loading. Please wait.
Published byAlbert Bell Modified over 9 years ago
1
A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis
2
Outline Introduction to Collaborative Filtering Special nature of CF Inverted File Search Algorithm Item-based Slope-one Hybrid method No random access Experiment
3
Collaborative Filtering Looking for opinions from similar taste friends The active user collaborate to other users Trust those who are similar taste more
4
Example i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u1 123455 u2u2 543211 u a trust u 1 more than u 2
5
Special nature of CF Trust your feeling in the following a few slides
6
Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone Only i 2 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua -2--? u1u1 -2--3 u2u2 12--1 u3u3 -22-4 u4u4 22-32 u5u5 12214
7
Similarity The similarity is not based on all attributes (the items) Only the items which the active user rated are relevant Although some suggested (Breese al. et.) more items could be considered (by default voting), it is not popular.
8
Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone except u 5 i1i1 i2i2 i3i3 i4i4 iaia uaua 1235? u1u1 1---3 u2u2 -2--1 u3u3 --3-4 u4u4 ---52 u5u5 ----4
9
Similarity The similarity is not based on all attributes (the items) Only the items which both the active user and the user under consideration rated are relevant
10
A Notice u a is similar to u 1, u 2, u 3 and u 4 BUT u 1, u 2, u 3 and u 4 are totally not relevant to each other
11
Searching for similar users Which user is the best one to trust in order to predict “?” ? u 3 is the one. Only u 3 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua 1234? u1u1 1235- u2u2 2314- u3u3 43214 u4u4 2113- u5u5 1421-
12
Top-k most similar users It is not the top-k of among all users It is the top-k of among the users who rated i a
13
Summary on the nature The matrix is incomplete Similarity The set of items could be different for every pair of users (the intersect) The set of users (the candidates) could be different for each query (those who rated i a ) No triangle inequality (in extreme, u a is similar to u 1, u 2 ; but u 1 and u 2 can be irrelevant)
14
Popular Similarity measure Very often, Pearson Correlation is used: j iterate through the items that rated by both user i and user a Vote (rating) on item j by user a Average vote (rating) of user a
15
Output - Prediction C is a set of users who Rated the queried Item
16
Brute Force Searching Given an active user and active movie: Relevant movies are known from the active user profile Candidates are known from the active movie profile Find sim(u a, u i ) for all u i in candidate set The top-k are used as advisors
17
Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-
18
Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-
19
Useful Information What are the useful information? The Green entries are useful i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-
20
Useful Information All user profiles or All movie profiles Contains the useful information i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-
21
Inverted file Item 123456 User 1-1-345 21345-5 3-3-41- Item 1 2 3 4 5 6 21 13 24 14 15 2534 112333 31 25 Coster & Svensson 2002
22
Pearson Correlation The active user is fixed in a single query For each user i, there are 3 summations Instead of calculate the w(a,i) for each user i, calculate SAI[i], SAA[i] and SII[i] for all users (with help of inverted list) SAA[i] SAI[i] SII[i]
23
Early Termination Self-Indexing Inverted Files for Fast Text Retrieval, Alistair Moffat and Justin Zobel, 1994 Quit Stop when number of user reaches a threshold Continue Stop consider new users when number of user reaches a threshold
24
Item-based The matrix is symmetric Exchange the role of row (user profile) and column (movie profile) Looks for movies which are similar to the active movie If the users act similarly to both movies, the active user may act similarly too.
25
Item-based example The users act exactly the same on i 2 and i a Perhaps i 2 and i a are very similar ? May be 1, as u a give i 2 rating 1 i1i1 i2i2 i3i3 i4i4 iaia uaua 1134? u1u1 11351 u2u2 22142 u3u3 44214 u4u4 24134 u5u5 15215
26
Sarwar et al 2001 Pre-find top-k similar items Amazon.com Personal promotion on the top-k similar items
27
Slope-one Not only find similar items Measure the pattern between items Lemire & Maclachlan 2005
28
Slope-one For items pair j and i For all users rated both items Find the average difference in rating
29
Slope-one A prediction is made based on dev j,i
30
Slope-one example All users gave i a higher rating than i 3 by 1 By considering i a and i 3, u a may rate ‘?’ as 4 i1i1 i2i2 i3i3 i4i4 iaia uaua 1434? u1u1 12152 u2u2 22142 u3u3 44314 u4u4 24334 u5u5 15415
31
Summary A common argument There are less items than users Pre-computation Similarity in item-based dev j,i in slope-one
32
Hybrid method Finding top-k similar users Brute force Inefficient when number of candidate is large Inverted file Inefficient when number of relevant items is large Mixing the 2
33
Hybrid method Inverted file again The files are segmented according to ratings
34
I1I1 Segmented inverted file example All users here given I 1 rating 5 All users here given I 1 rating 4 All users here given I 1 rating 3 All users here given I 1 rating 2 All users here given I 1 rating 1
35
Accessing Segmented inverted file First access the segments which is closer to the active user’s rating
36
I1I1 Access example Access order 1, d=0 u a here Access order 2, d=1 Access order 3, d=1 Access order 4, d=2 Access order 5, d=3
37
Accessing Segmented inverted file The inverted file is a list ranked on d (distance to u a ’s rating) The best bound on similarity can be found
38
Algorithm phase 1 Access all inverted lists, such that all d=0 segments are loaded Starting from the most frequently seen candidates, find the actual similarity (totally k candidates are needed) The similarity of the k th candidate who actual similarity is known will be the initial filter
39
I1I1 I2I2 I3I3 I4I4 I5I5
40
I1I1 I2I2 I3I3 I4I4 I5I5
41
I1I1 I2I2 I3I3 I4I4 I5I5
42
Algorithm phase 1 example candidateactual similarity u30.89 u80.88 …… u10.77 u90.70 filter K
43
Algorithm phase 2 – keep loading form the inverted lists The best bound of the similarity decreases Similarity bound is worse than filter => pruned The partial information is more complete Update filter after some number of segments are load Stop when number of remaining candidate is small
44
Algorithm – phase 2 In the implementation, the items rated by u a extremely (close to 1 or 5) are loaded first The candidates’ best bound drop faster
45
I1I1 I2I2 I3I3 I4I4 I5I5
46
I1I1 I2I2 I3I3 I4I4 I5I5
47
I1I1 I2I2 I3I3 I4I4 I5I5
48
I1I1 I2I2 I3I3 I4I4 I5I5
49
I1I1 I2I2 I3I3 I4I4 I5I5
50
I1I1 I2I2 I3I3 I4I4 I5I5
51
Similarity measure Additive L1 Segmental Manhattan Distance = Manhattan Distance / # of relevant items Sim=1-(SMD)/(maximum distance)
52
Horting To ensure the intersect of items is large enough Aggarwal et al
53
Horting i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u1 123455 u2u2 1----1 Sim(u a, u 1 ) = Sim(u a, u 2 ) u 2 is less reliable
54
Best bound We have ‘user num of appearance’ ‘max num of more appearance’ = min(u a _profile.len, u i _profile.len) – ‘user num of appearance’ if never see this user in any segment best distance = 1 else if ( partial distance > 1 ) The user appear in unseen items, and d=1 else if (‘max num of more appearance’ < horting_factor) The user appear enough number of times only else The user does not appear anymore, partial distance is the best
55
No random access The inverted file is a list ranked on d (distance to u a ’s rating) Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006 phase 1 Do not find any actual similarity until The best bound of an unseen user is worse than The k th best worst bound
56
I1I1 I2I2 I3I3 I4I4 I5I5
57
I1I1 I2I2 I3I3 I4I4 I5I5
58
I1I1 I2I2 I3I3 I4I4 I5I5
59
I1I1 I2I2 I3I3 I4I4 I5I5
60
I1I1 I2I2 I3I3 I4I4 I5I5
61
Worst Bound While a user’s partial distance is smaller than the maximum possible distance include the distance
62
No random access phase 2 Find actual similarity and prune candidates
63
Experiment Netflix dataset 480189 users 17770 movies 100 million ratings (1.17%) k = 50 h = 10
64
Efficiency Brute force 185.24s per query Hybrid 25.85s per query NRA 59.34s per query
65
Disk IO statistic (hybrid) % of actual similarity 7.60% % of entries loaded from inverted file 68.52% % of entries which loaded and relevant 49.77%
66
Reference Breese et al Empirical Analysis of Predictive Algorithms for Collaborative Filtering Coster & Svensson 2002 Inverted File Search Algorithms for Collaborative Filtering Lemire & Maclachlan 2005 Slope One Predictors for Online Rating-Based Collaborative Filtering Sarwar et al 2001 ItemBased Collaborative Filtering Recommendation Algorithms Aggarwal et al Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006 Efficient Aggregation of Ranked Inputs
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.