A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis

Outline Introduction to Collaborative Filtering Special nature of CF Inverted File Search Algorithm Item-based Slope-one Hybrid method No random access Experiment

Collaborative Filtering Looking for opinions from similar taste friends The active user collaborate to other users Trust those who are similar taste more

Example i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u1 123455 u2u2 543211 u a trust u 1 more than u 2

Special nature of CF Trust your feeling in the following a few slides

Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone Only i 2 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua -2--? u1u1 -2--3 u2u2 12--1 u3u3 -22-4 u4u4 22-32 u5u5 12214

Similarity The similarity is not based on all attributes (the items) Only the items which the active user rated are relevant Although some suggested (Breese al. et.) more items could be considered (by default voting), it is not popular.

Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone except u 5 i1i1 i2i2 i3i3 i4i4 iaia uaua 1235? u1u1 1---3 u2u2 -2--1 u3u3 --3-4 u4u4 ---52 u5u5 ----4

Similarity The similarity is not based on all attributes (the items) Only the items which both the active user and the user under consideration rated are relevant

A Notice u a is similar to u 1, u 2, u 3 and u 4 BUT u 1, u 2, u 3 and u 4 are totally not relevant to each other

Searching for similar users Which user is the best one to trust in order to predict “?” ? u 3 is the one. Only u 3 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua 1234? u1u1 1235- u2u2 2314- u3u3 43214 u4u4 2113- u5u5 1421-

Top-k most similar users It is not the top-k of among all users It is the top-k of among the users who rated i a

Summary on the nature The matrix is incomplete Similarity  The set of items could be different for every pair of users (the intersect)  The set of users (the candidates) could be different for each query (those who rated i a )  No triangle inequality (in extreme, u a is similar to u 1, u 2 ; but u 1 and u 2 can be irrelevant)

Popular Similarity measure Very often, Pearson Correlation is used: j iterate through the items that rated by both user i and user a Vote (rating) on item j by user a Average vote (rating) of user a

Output - Prediction C is a set of users who Rated the queried Item

Brute Force Searching Given an active user and active movie:  Relevant movies are known from the active user profile  Candidates are known from the active movie profile Find sim(u a, u i ) for all u i in candidate set The top-k are used as advisors

Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-

Useful Information What are the useful information? The Green entries are useful i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-

Useful Information All user profiles or All movie profiles Contains the useful information i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u1 -23-4 u2u2 231-- u3u3 4-214 u4u4 21-33 u5u5 1-21-

Inverted file Item 123456 User 1-1-345 21345-5 3-3-41- Item 1 2 3 4 5 6 21 13 24 14 15 2534 112333 31 25 Coster & Svensson 2002

Pearson Correlation The active user is fixed in a single query For each user i, there are 3 summations Instead of calculate the w(a,i) for each user i, calculate SAI[i], SAA[i] and SII[i] for all users (with help of inverted list) SAA[i] SAI[i] SII[i]

Early Termination Self-Indexing Inverted Files for Fast Text Retrieval, Alistair Moffat and Justin Zobel, 1994 Quit  Stop when number of user reaches a threshold Continue  Stop consider new users when number of user reaches a threshold

Item-based The matrix is symmetric Exchange the role of row (user profile) and column (movie profile) Looks for movies which are similar to the active movie If the users act similarly to both movies, the active user may act similarly too.

Item-based example The users act exactly the same on i 2 and i a Perhaps i 2 and i a are very similar ? May be 1, as u a give i 2 rating 1 i1i1 i2i2 i3i3 i4i4 iaia uaua 1134? u1u1 11351 u2u2 22142 u3u3 44214 u4u4 24134 u5u5 15215

Sarwar et al 2001  Pre-find top-k similar items Amazon.com  Personal promotion on the top-k similar items

Slope-one Not only find similar items Measure the pattern between items Lemire & Maclachlan 2005

Slope-one For items pair j and i For all users rated both items Find the average difference in rating

Slope-one A prediction is made based on dev j,i

Slope-one example All users gave i a higher rating than i 3 by 1 By considering i a and i 3, u a may rate ‘?’ as 4 i1i1 i2i2 i3i3 i4i4 iaia uaua 1434? u1u1 12152 u2u2 22142 u3u3 44314 u4u4 24334 u5u5 15415

Summary A common argument  There are less items than users Pre-computation  Similarity in item-based  dev j,i in slope-one

Hybrid method Finding top-k similar users Brute force  Inefficient when number of candidate is large Inverted file  Inefficient when number of relevant items is large Mixing the 2

Hybrid method Inverted file again The files are segmented according to ratings

I1I1 Segmented inverted file example All users here given I 1 rating 5 All users here given I 1 rating 4 All users here given I 1 rating 3 All users here given I 1 rating 2 All users here given I 1 rating 1

Accessing Segmented inverted file First access the segments which is closer to the active user’s rating

I1I1 Access example Access order 1, d=0 u a here Access order 2, d=1 Access order 3, d=1 Access order 4, d=2 Access order 5, d=3

Accessing Segmented inverted file The inverted file is a list ranked on d (distance to u a ’s rating) The best bound on similarity can be found

Algorithm phase 1  Access all inverted lists, such that all d=0 segments are loaded  Starting from the most frequently seen candidates, find the actual similarity (totally k candidates are needed)  The similarity of the k th candidate who actual similarity is known will be the initial filter

I1I1 I2I2 I3I3 I4I4 I5I5

Algorithm phase 1 example candidateactual similarity u30.89 u80.88 …… u10.77 u90.70 filter K

Algorithm phase 2 – keep loading form the inverted lists  The best bound of the similarity decreases  Similarity bound is worse than filter => pruned  The partial information is more complete  Update filter after some number of segments are load  Stop when number of remaining candidate is small

Algorithm – phase 2 In the implementation, the items rated by u a extremely (close to 1 or 5) are loaded first The candidates’ best bound drop faster

Similarity measure Additive L1 Segmental Manhattan Distance = Manhattan Distance / # of relevant items Sim=1-(SMD)/(maximum distance)

Horting To ensure the intersect of items is large enough Aggarwal et al

Horting i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u1 123455 u2u2 1----1 Sim(u a, u 1 ) = Sim(u a, u 2 ) u 2 is less reliable

Best bound We have ‘user num of appearance’ ‘max num of more appearance’ = min(u a _profile.len, u i _profile.len) – ‘user num of appearance’ if never see this user in any segment best distance = 1 else if ( partial distance > 1 ) The user appear in unseen items, and d=1 else if (‘max num of more appearance’ < horting_factor) The user appear enough number of times only else The user does not appear anymore, partial distance is the best

No random access The inverted file is a list ranked on d (distance to u a ’s rating) Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006 phase 1  Do not find any actual similarity until The best bound of an unseen user is worse than The k th best worst bound

Worst Bound While a user’s partial distance is smaller than the maximum possible distance  include the distance

No random access phase 2  Find actual similarity and prune candidates

Experiment Netflix dataset 480189 users 17770 movies 100 million ratings (1.17%) k = 50 h = 10

Efficiency Brute force 185.24s per query Hybrid 25.85s per query NRA 59.34s per query

Disk IO statistic (hybrid) % of actual similarity  7.60% % of entries loaded from inverted file  68.52% % of entries which loaded and relevant  49.77%

Reference Breese et al  Empirical Analysis of Predictive Algorithms for Collaborative Filtering Coster & Svensson 2002  Inverted File Search Algorithms for Collaborative Filtering Lemire & Maclachlan 2005  Slope One Predictors for Online Rating-Based Collaborative Filtering Sarwar et al 2001  ItemBased Collaborative Filtering Recommendation Algorithms Aggarwal et al  Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006  Efficient Aggregation of Ranked Inputs

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

Similar presentations

Presentation on theme: "A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

Similar presentations

Presentation on theme: "A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis."— Presentation transcript:

Similar presentations

About project

Feedback