A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Group Recommendation: Semantics and Efficiency
Book Recommender System Guided By: Prof. Ellis Horowitz Kaijian Xu Group 3 Ameet Nanda Bhaskar Upadhyay Bhavana Parekh.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Oct 14, 2014 Lirong Xia Recommender systems acknowledgment: Li Zhang, UCSC.
COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.
Intro to RecSys and CCF Brian Ackerman 1. Roadmap Introduction to Recommender Systems & Collaborative Filtering Collaborative Competitive Filtering 2.
Item-based Collaborative Filtering Idea: a user is likely to have the same opinion for similar items [if I like Canon cameras, I might also like Canon.
Using a Trust Network To Improve Top-N Recommendation
Rubi’s Motivation for CF  Find a PhD problem  Find “real life” PhD problem  Find an interesting PhD problem  Make Money!
Memory-Based Recommender Systems : A Comparative Study Aaron John Mani Srinivasan Ramani CSCI 572 PROJECT RECOMPARATOR.
A shot at Netflix Challenge Hybrid Recommendation System Priyank Chodisetti.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Learning Bit by Bit Collaborative Filtering/Recommendation Systems.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Collaborative Filtering CMSC498K Survey Paper Presented by Hyoungtae Cho.
Recommender systems Ram Akella November 26 th 2008.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 12 (Section 12.4) : Recommender Systems Second edition of the book, coming soon.
Performance of Recommender Algorithms on Top-N Recommendation Tasks
Crowd-Augmented Social Aware Search Soudip Roy Chowdhury & Bogdan Cautis.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Item Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karpis, Joseph KonStan, John Riedl (UMN) p.s.: slides adapted from:
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
6/10/14 27th Canadian Conference on Artificial Intelligence Sistemas de Recomendação Hibridos baseados em Mineração de Preferências “pairwise” Data Mining.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Collaborative Filtering  Introduction  Search or Content based Method  User-Based Collaborative Filtering  Item-to-Item Collaborative Filtering  Using.
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Answering Similar Region Search Queries Chang Sheng, Yu Zheng.
Efficient Processing of Top-k Spatial Preference Queries
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
EigenRank: A ranking oriented approach to collaborative filtering By Nathan N. Liu and Qiang Yang Presented by Zachary 1.
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Singular Value Decomposition and Item-Based Collaborative Filtering for Netflix Prize Presentation by Tingda Lu at the Saturday Research meeting 10_23_10.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Singular Value Decomposition and Item-Based Collaborative Filtering for Netflix Prize Presentation by Tingda Lu at the Saturday Research meeting 10_23_10.
KNN CF: A Temporal Social Network kNN CF: A Temporal Social Network Neal Lathia, Stephen Hailes, Licia Capra University College London RecSys ’ 08 Advisor:
Community-Based Link Prediction/Recommendation in the Bipartite Network of BoardGameGeek.com Brett Boge CS 765 University of Nevada, Reno.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Online Evolutionary Collaborative Filtering RECSYS 2010 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University.
User Modeling and Recommender Systems: recommendation algorithms
Experimental Study on Item-based P-Tree Collaborative Filtering for Netflix Prize.
Company LOGO MovieMiner A collaborative filtering system for predicting Netflix user’s movie ratings [ECS289G Data Mining] Team Spelunker: Justin Becker,
Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.
Recommender Systems Based Rajaraman and Ullman: Mining Massive Data Sets & Francesco Ricci et al. Recommender Systems Handbook.
1 VLDB, Background What is important for the user.
The Wisdom of the Few Xavier Amatrian, Neal Lathis, Josep M. Pujol SIGIR’09 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Slope One Predictors for Online Rating-Based Collaborative Filtering Daniel Lemire, Anna Maclachlan In SIAM Data Mining (SDM’05), Newport Beach, California,
Chapter 14 – Association Rules and Collaborative Filtering © Galit Shmueli and Peter Bruce 2016 Data Mining for Business Analytics (3rd ed.) Shmueli, Bruce.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
CS728 The Collaboration Graph
Machine Learning With Python Sreejith.S Jaganadh.G.
Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.
Evaluation of Relational Operations: Other Operations
Problem Solving: Brute Force Approaches
Collaborative Filtering Nearest Neighbor Approach
M.Sc. Project Doron Harlev Supervisor: Dr. Dana Ron
Movie Recommendation System
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
Collaborative Filtering Non-negative Matrix Factorization
Recommendation Systems
Efficient Processing of Top-k Spatial Preference Queries
Sequence alignment, E-value & Extreme value distribution
Recommender System.
Presentation transcript:

A more efficient Collaborative Filtering method Tam Ming Wai Dr. Nikos Mamoulis

Outline Introduction to Collaborative Filtering Special nature of CF Inverted File Search Algorithm Item-based Slope-one Hybrid method No random access Experiment

Collaborative Filtering Looking for opinions from similar taste friends The active user collaborate to other users Trust those who are similar taste more

Example i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u u2u u a trust u 1 more than u 2

Special nature of CF Trust your feeling in the following a few slides

Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone Only i 2 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua -2--? u1u u2u u3u u4u u5u

Similarity The similarity is not based on all attributes (the items) Only the items which the active user rated are relevant Although some suggested (Breese al. et.) more items could be considered (by default voting), it is not popular.

Searching for similar users Which user is the best one to trust in order to predict “?” ? Everyone except u 5 i1i1 i2i2 i3i3 i4i4 iaia uaua 1235? u1u u2u u3u u4u u5u

Similarity The similarity is not based on all attributes (the items) Only the items which both the active user and the user under consideration rated are relevant

A Notice u a is similar to u 1, u 2, u 3 and u 4 BUT u 1, u 2, u 3 and u 4 are totally not relevant to each other

Searching for similar users Which user is the best one to trust in order to predict “?” ? u 3 is the one. Only u 3 is relevant i1i1 i2i2 i3i3 i4i4 iaia uaua 1234? u1u u2u u3u u4u u5u

Top-k most similar users It is not the top-k of among all users It is the top-k of among the users who rated i a

Summary on the nature The matrix is incomplete Similarity  The set of items could be different for every pair of users (the intersect)  The set of users (the candidates) could be different for each query (those who rated i a )  No triangle inequality (in extreme, u a is similar to u 1, u 2 ; but u 1 and u 2 can be irrelevant)

Popular Similarity measure Very often, Pearson Correlation is used: j iterate through the items that rated by both user i and user a Vote (rating) on item j by user a Average vote (rating) of user a

Output - Prediction C is a set of users who Rated the queried Item

Brute Force Searching Given an active user and active movie:  Relevant movies are known from the active user profile  Candidates are known from the active movie profile Find sim(u a, u i ) for all u i in candidate set The top-k are used as advisors

Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u

Useful Information What are the useful information? i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u

Useful Information What are the useful information? The Green entries are useful i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u

Useful Information All user profiles or All movie profiles Contains the useful information i1i1 i2i2 i3i3 i4i4 iaia uaua 12-4? u1u u2u u3u u4u u5u

Inverted file Item User Item Coster & Svensson 2002

Pearson Correlation The active user is fixed in a single query For each user i, there are 3 summations Instead of calculate the w(a,i) for each user i, calculate SAI[i], SAA[i] and SII[i] for all users (with help of inverted list) SAA[i] SAI[i] SII[i]

Early Termination Self-Indexing Inverted Files for Fast Text Retrieval, Alistair Moffat and Justin Zobel, 1994 Quit  Stop when number of user reaches a threshold Continue  Stop consider new users when number of user reaches a threshold

Item-based The matrix is symmetric Exchange the role of row (user profile) and column (movie profile) Looks for movies which are similar to the active movie If the users act similarly to both movies, the active user may act similarly too.

Item-based example The users act exactly the same on i 2 and i a Perhaps i 2 and i a are very similar ? May be 1, as u a give i 2 rating 1 i1i1 i2i2 i3i3 i4i4 iaia uaua 1134? u1u u2u u3u u4u u5u

Sarwar et al 2001  Pre-find top-k similar items Amazon.com  Personal promotion on the top-k similar items

Slope-one Not only find similar items Measure the pattern between items Lemire & Maclachlan 2005

Slope-one For items pair j and i For all users rated both items Find the average difference in rating

Slope-one A prediction is made based on dev j,i

Slope-one example All users gave i a higher rating than i 3 by 1 By considering i a and i 3, u a may rate ‘?’ as 4 i1i1 i2i2 i3i3 i4i4 iaia uaua 1434? u1u u2u u3u u4u u5u

Summary A common argument  There are less items than users Pre-computation  Similarity in item-based  dev j,i in slope-one

Hybrid method Finding top-k similar users Brute force  Inefficient when number of candidate is large Inverted file  Inefficient when number of relevant items is large Mixing the 2

Hybrid method Inverted file again The files are segmented according to ratings

I1I1 Segmented inverted file example All users here given I 1 rating 5 All users here given I 1 rating 4 All users here given I 1 rating 3 All users here given I 1 rating 2 All users here given I 1 rating 1

Accessing Segmented inverted file First access the segments which is closer to the active user’s rating

I1I1 Access example Access order 1, d=0 u a here Access order 2, d=1 Access order 3, d=1 Access order 4, d=2 Access order 5, d=3

Accessing Segmented inverted file The inverted file is a list ranked on d (distance to u a ’s rating) The best bound on similarity can be found

Algorithm phase 1  Access all inverted lists, such that all d=0 segments are loaded  Starting from the most frequently seen candidates, find the actual similarity (totally k candidates are needed)  The similarity of the k th candidate who actual similarity is known will be the initial filter

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

Algorithm phase 1 example candidateactual similarity u30.89 u80.88 …… u10.77 u90.70 filter K

Algorithm phase 2 – keep loading form the inverted lists  The best bound of the similarity decreases  Similarity bound is worse than filter => pruned  The partial information is more complete  Update filter after some number of segments are load  Stop when number of remaining candidate is small

Algorithm – phase 2 In the implementation, the items rated by u a extremely (close to 1 or 5) are loaded first The candidates’ best bound drop faster

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

Similarity measure Additive L1 Segmental Manhattan Distance = Manhattan Distance / # of relevant items Sim=1-(SMD)/(maximum distance)

Horting To ensure the intersect of items is large enough Aggarwal et al

Horting i1i1 i2i2 i3i3 i4i4 i5i5 iaia uaua 12345? u1u u2u Sim(u a, u 1 ) = Sim(u a, u 2 ) u 2 is less reliable

Best bound We have ‘user num of appearance’ ‘max num of more appearance’ = min(u a _profile.len, u i _profile.len) – ‘user num of appearance’ if never see this user in any segment best distance = 1 else if ( partial distance > 1 ) The user appear in unseen items, and d=1 else if (‘max num of more appearance’ < horting_factor) The user appear enough number of times only else The user does not appear anymore, partial distance is the best

No random access The inverted file is a list ranked on d (distance to u a ’s rating) Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006 phase 1  Do not find any actual similarity until The best bound of an unseen user is worse than The k th best worst bound

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

I1I1 I2I2 I3I3 I4I4 I5I5

Worst Bound While a user’s partial distance is smaller than the maximum possible distance  include the distance

No random access phase 2  Find actual similarity and prune candidates

Experiment Netflix dataset users movies 100 million ratings (1.17%) k = 50 h = 10

Efficiency Brute force s per query Hybrid 25.85s per query NRA 59.34s per query

Disk IO statistic (hybrid) % of actual similarity  7.60% % of entries loaded from inverted file  68.52% % of entries which loaded and relevant  49.77%

Reference Breese et al  Empirical Analysis of Predictive Algorithms for Collaborative Filtering Coster & Svensson 2002  Inverted File Search Algorithms for Collaborative Filtering Lemire & Maclachlan 2005  Slope One Predictors for Online Rating-Based Collaborative Filtering Sarwar et al 2001  ItemBased Collaborative Filtering Recommendation Algorithms Aggarwal et al  Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering Nikos Mamoulis, Kit Hung Cheng, Man Lung Yiu, and David W. Cheung 2006  Efficient Aggregation of Ranked Inputs