Clustering Using Pairwise Comparisons

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Online Multi-camera Tracking with a Switiching State-Space Model Wojciech Zajdel, A. Taylan Cemgil, and Ben KrÄose ICPR 2004.
T HE POWER OF C ONVEX R ELAXATION : N EAR - OPTIMAL MATRIX COMPLETION E MMANUEL J. C ANDES AND T ERENCE T AO M ARCH, 2009 Presenter: Shujie Hou February,
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Visual Recognition Tutorial
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
Common Voting Rules as Maximum Likelihood Estimators Vincent Conitzer and Tuomas Sandholm Carnegie Mellon University, Computer Science Department.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Visual Recognition Tutorial
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
Parameter estimation. 2D homography Given a set of (x i,x i ’), compute H (x i ’=Hx i ) 3D to 2D camera projection Given a set of (X i,x i ), compute.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun
1 Estimating Structured Vector Autoregressive Models Igor Melnyk and Arindam Banerjee.
11/24/2008CS Common Voting Rules as Maximum Likelihood Estimators - Matthew Kay 1 Common Voting Rules as Maximum Likelihood Estimators Vincent Conitzer,
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Ranking: Compare, Don’t Score Ammar Ammar, Devavrat Shah (LIDS – MIT) Poster ( No preprint), WIDS 2011.
Charles University Charles University STAKAN III
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
Inferential Statistics
Visual Recognition Tutorial
12. Principles of Parameter Estimation
New Characterizations in Turnstile Streams with Applications
LECTURE 11: Advanced Discriminant Analysis
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Inference for the mean vector
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Spectral Clustering.
Matrix Completion from a few entries
CS 4/527: Artificial Intelligence
Fitting Curve Models to Edges
Distinct Distances in the Plane
Outlier Discovery/Anomaly Detection
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Selection in heaps and row-sorted matrices
Latent Dirichlet Analysis
Matrix Martingales in Randomized Numerical Linear Algebra
Estimating Networks With Jumps
Multidimensional Integration Part I
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Pairwise Sequence Alignment (cont.)
General Strong Polarization
The coalescent with recombination (Chapter 5, Part 1)
What can we know from RREF?
The Byzantine Secretary Problem
Ensemble learning.
Generally Discriminant Analysis
Mathematical Foundations of BME
Rank-Sparsity Incoherence for Matrix Decomposition
Learning Theory Reza Shadmehr
Probabilistic Latent Preference Analysis
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Maths for Signals and Systems Linear Algebra in Engineering Lecture 6, Friday 21st October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
Data Mining Anomaly Detection
12. Principles of Parameter Estimation
CS200: Algorithm Analysis
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining Anomaly Detection
Presentation transcript:

Clustering Using Pairwise Comparisons R. Srikant ECE/CSL University of Illinois at Urbana-Champaign

Coauthors Barbara Dembin Siddhartha Satpathi Builds on the work in R. Wu, J. Xu, R. Srikant, L. Massoulie, M. Lelarge, and B. Hajek,Clustering and Inference from Pairwise comparisons (arXiv:1502.04631v2)

Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions

Noisy pairwise comparisons Amazon DSLR Item 1 < item 2; item 3 < item 2 Goal: Infer information about user preferences from such pairwise rankings The user buys this

Bradley-Terry model Item 𝑖 is associated with a score 𝜃 𝑖 𝑃 item 𝑖 is preferred over item 𝑗 = 𝑒 𝜃 𝑖 𝑒 𝜃 𝑖 + 𝑒 𝜃 𝑗 Goal: Estimate the vector 𝜃 from the pairwise comparisons Assumption: all users belong to one cluster, i.e., have the same 𝜃 vector. So we can aggregate the results from all users to estimate 𝜃

The data about the 𝑚 items (1, 2) (1, 3) ... (1, m) (2, 3) … (m-1, m) 1 -1

Maximum likelihood estimation Let 𝑅 𝑖𝑗 be the number of times item 𝑖 is preferred over item 𝑗 Maximum likelihood estimation 𝜃 = argmax 𝛾 𝐿(𝛾) 𝐿 𝛾 = 𝑖, 𝑗 𝑅 𝑖𝑗 log 𝑒 𝛾 𝑖 𝑒 𝛾 𝑖 + 𝑒 𝛾 𝑗 Well Studied: (Hunter 2004), (Negahban, Oh, D. Shah 2014) Non-parametric: NB Shah and Wainwright (2016)

Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions

Clustering Users & Ranking Items Amazon camera Different types of users use different score vectors Cluster users of the same type together, and then estimate the Bradley-Terry parameters for each cluster

Generalized Bradley-Terry model 𝑛 users and 𝑚 items (𝑛,𝑚→∞) Users are in 𝑟 clusters (𝑟 is a constant) : users in cluster 𝑘 have the same score vector 𝜃 𝑘 : 𝑃 item 𝑖 is preferred over item 𝑗 = 𝑒 𝜃 𝑘, 𝑖 𝑒 𝜃 𝑘, 𝑖 + 𝑒 𝜃 𝑘, 𝑗 Each user compares a pair of items with probability 1−𝜖: want 𝜖 to be close to 1 2018/11/20

Observation Matrix (1, 2) (1, 3) ... (1, m) (2, 3) … (m-1, m) 1 -1

Observation Matrix (1, 2) (1, 3) ... (1, m) (2, 3) … (m-1, m) ? 1 -1

Questions We focus on the clustering problem Once users are clustered, parameter estimation can be performed using other techniques; the results here don’t explicitly depend on the Bradley-Terry model What is the minimum of samples (pairwise comparisons) needed to cluster the users from pairwise comparison data ? What algorithm should we use to achieve this limit ? We will provide answers to these questions in the reverse order

Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions

Net Wins Matrix (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Item 1 2 3 4 1 -1 -1 Item 1 2 3 4 -1

Why Net Wins Matrix ? The original pairwise comparisons data is very noisy, unless the same pair of items is shown to the same user many times (which is not the case in our model) The net wins matrix reduces the 𝑚 2 comparisons for each user to information about the 𝑚 items Makes the data less noisy

Clustering rows of Net Wins Matrix Spectral Clustering Clustering rows of Net Wins Matrix Step 1: The expected net wins matrix has only 𝑟 independent rows. The true net wins matrix has a singular value distribution that looks like this (example, 𝑟=10):

Spectral Clustering 𝝈𝟏>…>𝝈𝒏 Step 2: Perform Singular Value Decomposition, and retain only the top 𝑟 singular values, and set the rest equal to zero 𝝈𝟏>…>𝝈𝒏

Spectral Clustering 𝝈𝟏>…>𝝈𝒏 Step 3: Cluster the rows of the rank 𝑟 projection using the K-means algorithm, for example: 𝝈𝟏>…>𝝈𝒏

Result from Prior Work (assume 𝑚=𝑛) With 𝒓𝟐 log𝟑 𝑛 pairwise comparisons per user at most 𝑲 𝒍𝒐𝒈 𝒏 users are misclustered with high probability While the fraction of misclustered users goes to zero, the rate at which it goes to zero is not satisfactory Moreover, to prove that perfect clustering (all users clustered with high probability) is achieved, we need 𝒏 𝒓 𝟐 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user Can we prove that perfect clustering is achieved with high probability with far fewer comparisons? Yes, we tweak the previous algorithm (Spectral Clustering on the Net Wins matrix)

Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions

Outline of the Algorithm Split the items into different partitions, and only consider the pairwise comparisons data within each partition (inspired by (Vu, 2014) for community detection) Apply the previous algorithm to each data partition, and cluster the users based on the information in each partition Can result in inconsistent clusters: users 1 and 2 may be in the same cluster in one partition, but not in another partition. Which one of these clusters is correct? Use simple majority voting to correct errors, i.e., assign the user to the cluster to which it belongs most often

(Note: some data is lost) Data Partitioning Split the items into 𝐿 sets, Example: 𝐿 =2 (1, 2) (1, 3) (1,4) (1, 5) (1, 6) (2,3) (2, 4) (2, 5) (2, 6) (3, 4) (3, 5) (3, 6) (4,5) (4,6) (5,6) 1 -1 (1, 2) (1, 3) (2,3) 1 -1 (4, 5) (4, 6) (5,6) 1 -1 𝐿 pairwise comparison matrices 𝐿 Net Wins matrices (Note: some data is lost)

Cluster Users Based on Each Partition Item 1 3 4 18 -1 Partition 1 1 r Spectral clustering Partition L Item 2 5 19 33 1 -1 1 r L Net Wins matrices L different clusterings

Numbering the Clusters Number the clusters 1, 2, … , r arbitrarily in the first data partition For the second partition, the cluster which overlaps the most with cluster 1 in Partition 1 is called cluster 1, the cluster which overlaps the most with cluster 2 in Partition 1 is called cluster 2, and so on Partition 1 Partition 2 Partition 3 Partition 4 1 2 3 ? ? ? ? ? ? ? ? ? 2016/5/2

Numbering the Clusters Number the clusters 1, 2, … , r in the results from the first data partition For the second partition, the cluster which overlaps the most with cluster 1 in Partition 1 is called cluster 1, the cluster which overlaps the most with cluster 2 in Partition 1 is called cluster 2, and so on Partition 1 Partition 2 Partition 3 Partition 4 1 2 3 3 2 1 1 3 2 2 1 3 2016/5/2

Clustering the Users A user may belong to cluster 1 in one partition, but may belong to some other cluster in another partition Majority voting determines the correct cluster for each user. Partition 1 Partition 2 Partition 3 Partition 4 1 2 3 1 2 3 1 2 3 1 2 3 = User 𝒖 e.g. Here # of data partitions 𝐿 = 4, # clusters 𝑟 = 3 e.g. Here user 𝑢 is assigned to cluster 2

Summary of the algorithm Partition items uniformly into L sets Partition 1 Item 1 2 3 4 -1 Net Wins matrix Partition L Item 1 2 3 4 -1 Partition 1 1 r Majority voting 1 r Spectral Clustering Final clustering of users Partition L 1 r

Main Result Previous result: If more than 𝒏 𝒓 𝟐 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user are available, all users are clustered w.p. at least (1 – 1/n). New result: If more than 𝒓 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user are available, all users are correctly clustered w.p. at least (1 – 1/n). Key Idea: Spectral clustering results in many incorrectly clustered users Split the items into many groups, perform spectral clustering on each, and combine the results using majority voting Works despite loss of data in the partitioning process Idea works for more general models than the B-T model

Outline of the Proof: Part I Two rows of the expected Net Wins matrix belonging to different clusters are well separated: 𝑆 𝑢 − 𝑆 𝑣 2 > 𝐶 1 1−𝜖 𝑛 (by assumption) Let 𝑃 𝑟 (⋅) be the rank r projection. Using concentration inequalities 𝑃 𝑟 𝑆 𝑢 − 𝑆 𝑢 2 ≤ 𝐶 2 log 3 2 𝑛 1−𝜖

Outline of Proof: Part II All the clusters are well separated with high probability if we have a lot of measurements (as in the previous paper) But with fewer measurements, the probability of misclustering is 𝛿, which does not go to zero when 𝑛→∞ 𝑺 𝒖 𝑺 𝒗 𝑺 𝒖 𝑺 𝒗

Outline of the Proof: Part III Partition items into 𝐿 sets In each set, user 𝑢 is misclustered w.p. δ By the Chernoff bound, 𝑃( 𝑢 is misclustered in more than 𝐿/2 sets) < 𝑒𝑥𝑝(− 𝛿−0.5 2 𝐿/2) For 𝐿=𝐶 log⁡(𝑛), majority voting clusters all users correctly

Lower Bound on Sample Complexity Event A: Two users from different clusters have no pairwise comparisons. If A occurs, all users cannot be clustered correctly. P(A) →1 as n→∞ when 1 −𝜖<𝑂( log 𝑛 𝑛 2 )

Main Result If more than 𝒓 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user are available, all users are correctly clustered w.p. at least 1 – 1/n. The number of comparisons required is within a polylog factor of the lower bound Assumption required for the main result: The rows in different clusters of the expected Net Wins matrix are well separated

Related Work Vu (2014) Lu-Negahban (2014) Exact cluster recovery in community detection through spectral methods Partition data into two sets, use one for clustering and other to correct errors in the recovered clusters Lu-Negahban (2014) Bradley-Terry parameters are different for each user, but form a low-rank matrix Park, Neeman, Zhang, Sanghavi (2015) Related to the model above, but with a different algorithm Oh, Thekumparampil, Xu (2015) Generalization to multi-item rankings

Conclusions Algorithm to achieve perfect clustering with high probability Majority voting from spectral clustering over different data partitions Number of samples required is within poly( log 𝑛 ) factor of a lower bound