A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis
Outline Introduction to Collaborative Filtering Special nature of CF Inverted File Search Algorithm Item-based Slope-one Hybrid method No random access Experiment
Collaborative Filtering Looking for opinions from similar taste friends The active user collaborate to other users Trust those who are similar taste more
Example i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u u2u u a trust u 1 more than u 2
Special nature of CF Trust your feeling in the following a few slides
Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone Only i 2 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua -2--? u1u u2u u3u u4u u5u
Similarity The similarity is not based on all attributes (the items) Only the items which the active user rated are relevant Although some suggested (Breese al. et.) more items could be considered (by default voting), it is not popular.
Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone except u 5 i1i1 i2i2 i3i3 i4i4 iaia uaua 1235? u1u u2u u3u u4u u5u
Similarity The similarity is not based on all attributes (the items) Only the items which both the active user and the user under consideration rated are relevant
A Notice u a is similar to u 1, u 2, u 3 and u 4 BUT u 1, u 2, u 3 and u 4 are totally not relevant to each other
Searching for similar users Which user is the best one to trust in order to predict “?” ? u 3 is the one. Only u 3 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua 1234? u1u u2u u3u u4u u5u
Top-k most similar users It is not the top-k of among all users It is the top-k of among the users who rated i a
Summary on the nature The matrix is incomplete Similarity The set of items could be different for every pair of users (the intersect) The set of users (the candidates) could be different for each query (those who rated i a ) No triangle inequality (in extreme, u a is similar to u 1, u 2 ; but u 1 and u 2 can be irrelevant)
Popular Similarity measure Very often, Pearson Correlation is used: j iterate through the items that rated by both user i and user a Vote (rating) on item j by user a Average vote (rating) of user a
Output - Prediction C is a set of users who Rated the queried Item
Brute Force Searching Given an active user and active movie: Relevant movies are known from the active user profile Candidates are known from the active movie profile Find sim(u a, u i ) for all u i in candidate set The top-k are used as advisors
Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u
Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u
Useful Information What are the useful information? The Green entries are useful i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u
Useful Information All user profiles or All movie profiles Contains the useful information i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u
Inverted file Item User Item Coster & Svensson 2002
Pearson Correlation The active user is fixed in a single query For each user i, there are 3 summations Instead of calculate the w(a,i) for each user i, calculate SAI[i], SAA[i] and SII[i] for all users (with help of inverted list) SAA[i] SAI[i] SII[i]
Early Termination Self-Indexing Inverted Files for Fast Text Retrieval, Alistair Moffat and Justin Zobel, 1994 Quit Stop when number of user reaches a threshold Continue Stop consider new users when number of user reaches a threshold
Item-based The matrix is symmetric Exchange the role of row (user profile) and column (movie profile) Looks for movies which are similar to the active movie If the users act similarly to both movies, the active user may act similarly too.
Item-based example The users act exactly the same on i 2 and i a Perhaps i 2 and i a are very similar ? May be 1, as u a give i 2 rating 1 i1i1 i2i2 i3i3 i4i4 iaia uaua 1134? u1u u2u u3u u4u u5u
Sarwar et al 2001 Pre-find top-k similar items Amazon.com Personal promotion on the top-k similar items
Slope-one Not only find similar items Measure the pattern between items Lemire & Maclachlan 2005
Slope-one For items pair j and i For all users rated both items Find the average difference in rating
Slope-one A prediction is made based on dev j,i
Slope-one example All users gave i a higher rating than i 3 by 1 By considering i a and i 3, u a may rate ‘?’ as 4 i1i1 i2i2 i3i3 i4i4 iaia uaua 1434? u1u u2u u3u u4u u5u
Summary A common argument There are less items than users Pre-computation Similarity in item-based dev j,i in slope-one
Hybrid method Finding top-k similar users Brute force Inefficient when number of candidate is large Inverted file Inefficient when number of relevant items is large Mixing the 2
Hybrid method Inverted file again The files are segmented according to ratings
I1I1 Segmented inverted file example All users here given I 1 rating 5 All users here given I 1 rating 4 All users here given I 1 rating 3 All users here given I 1 rating 2 All users here given I 1 rating 1
Accessing Segmented inverted file First access the segments which is closer to the active user’s rating
I1I1 Access example Access order 1, d=0 u a here Access order 2, d=1 Access order 3, d=1 Access order 4, d=2 Access order 5, d=3
Accessing Segmented inverted file The inverted file is a list ranked on d (distance to u a ’s rating) The best bound on similarity can be found
Algorithm phase 1 Access all inverted lists, such that all d=0 segments are loaded Starting from the most frequently seen candidates, find the actual similarity (totally k candidates are needed) The similarity of the k th candidate who actual similarity is known will be the initial filter
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
Algorithm phase 1 example candidateactual similarity u30.89 u80.88 …… u10.77 u90.70 filter K
Algorithm phase 2 – keep loading form the inverted lists The best bound of the similarity decreases Similarity bound is worse than filter => pruned The partial information is more complete Update filter after some number of segments are load Stop when number of remaining candidate is small
Algorithm – phase 2 In the implementation, the items rated by u a extremely (close to 1 or 5) are loaded first The candidates’ best bound drop faster
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
Similarity measure Additive L1 Segmental Manhattan Distance = Manhattan Distance / # of relevant items Sim=1-(SMD)/(maximum distance)
Horting To ensure the intersect of items is large enough Aggarwal et al
Horting i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u u2u Sim(u a, u 1 ) = Sim(u a, u 2 ) u 2 is less reliable
Best bound We have ‘user num of appearance’ ‘max num of more appearance’ = min(u a _profile.len, u i _profile.len) – ‘user num of appearance’ if never see this user in any segment best distance = 1 else if ( partial distance > 1 ) The user appear in unseen items, and d=1 else if (‘max num of more appearance’ < horting_factor) The user appear enough number of times only else The user does not appear anymore, partial distance is the best
No random access The inverted file is a list ranked on d (distance to u a ’s rating) Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006 phase 1 Do not find any actual similarity until The best bound of an unseen user is worse than The k th best worst bound
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
I1I1 I2I2 I3I3 I4I4 I5I5
Worst Bound While a user’s partial distance is smaller than the maximum possible distance include the distance
No random access phase 2 Find actual similarity and prune candidates
Experiment Netflix dataset users movies 100 million ratings (1.17%) k = 50 h = 10
Efficiency Brute force s per query Hybrid 25.85s per query NRA 59.34s per query
Disk IO statistic (hybrid) % of actual similarity 7.60% % of entries loaded from inverted file 68.52% % of entries which loaded and relevant 49.77%
Reference Breese et al Empirical Analysis of Predictive Algorithms for Collaborative Filtering Coster & Svensson 2002 Inverted File Search Algorithms for Collaborative Filtering Lemire & Maclachlan 2005 Slope One Predictors for Online Rating-Based Collaborative Filtering Sarwar et al 2001 ItemBased Collaborative Filtering Recommendation Algorithms Aggarwal et al Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006 Efficient Aggregation of Ranked Inputs