Clustering Using Pairwise Comparisons R. Srikant ECE/CSL University of Illinois at Urbana-Champaign
Coauthors Barbara Dembin Siddhartha Satpathi Builds on the work in R. Wu, J. Xu, R. Srikant, L. Massoulie, M. Lelarge, and B. Hajek,Clustering and Inference from Pairwise comparisons (arXiv:1502.04631v2)
Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions
Noisy pairwise comparisons Amazon DSLR Item 1 < item 2; item 3 < item 2 Goal: Infer information about user preferences from such pairwise rankings The user buys this
Bradley-Terry model Item 𝑖 is associated with a score 𝜃 𝑖 𝑃 item 𝑖 is preferred over item 𝑗 = 𝑒 𝜃 𝑖 𝑒 𝜃 𝑖 + 𝑒 𝜃 𝑗 Goal: Estimate the vector 𝜃 from the pairwise comparisons Assumption: all users belong to one cluster, i.e., have the same 𝜃 vector. So we can aggregate the results from all users to estimate 𝜃
The data about the 𝑚 items (1, 2) (1, 3) ... (1, m) (2, 3) … (m-1, m) 1 -1
Maximum likelihood estimation Let 𝑅 𝑖𝑗 be the number of times item 𝑖 is preferred over item 𝑗 Maximum likelihood estimation 𝜃 = argmax 𝛾 𝐿(𝛾) 𝐿 𝛾 = 𝑖, 𝑗 𝑅 𝑖𝑗 log 𝑒 𝛾 𝑖 𝑒 𝛾 𝑖 + 𝑒 𝛾 𝑗 Well Studied: (Hunter 2004), (Negahban, Oh, D. Shah 2014) Non-parametric: NB Shah and Wainwright (2016)
Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions
Clustering Users & Ranking Items Amazon camera Different types of users use different score vectors Cluster users of the same type together, and then estimate the Bradley-Terry parameters for each cluster
Generalized Bradley-Terry model 𝑛 users and 𝑚 items (𝑛,𝑚→∞) Users are in 𝑟 clusters (𝑟 is a constant) : users in cluster 𝑘 have the same score vector 𝜃 𝑘 : 𝑃 item 𝑖 is preferred over item 𝑗 = 𝑒 𝜃 𝑘, 𝑖 𝑒 𝜃 𝑘, 𝑖 + 𝑒 𝜃 𝑘, 𝑗 Each user compares a pair of items with probability 1−𝜖: want 𝜖 to be close to 1 2018/11/20
Observation Matrix (1, 2) (1, 3) ... (1, m) (2, 3) … (m-1, m) 1 -1
Observation Matrix (1, 2) (1, 3) ... (1, m) (2, 3) … (m-1, m) ? 1 -1
Questions We focus on the clustering problem Once users are clustered, parameter estimation can be performed using other techniques; the results here don’t explicitly depend on the Bradley-Terry model What is the minimum of samples (pairwise comparisons) needed to cluster the users from pairwise comparison data ? What algorithm should we use to achieve this limit ? We will provide answers to these questions in the reverse order
Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions
Net Wins Matrix (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Item 1 2 3 4 1 -1 -1 Item 1 2 3 4 -1
Why Net Wins Matrix ? The original pairwise comparisons data is very noisy, unless the same pair of items is shown to the same user many times (which is not the case in our model) The net wins matrix reduces the 𝑚 2 comparisons for each user to information about the 𝑚 items Makes the data less noisy
Clustering rows of Net Wins Matrix Spectral Clustering Clustering rows of Net Wins Matrix Step 1: The expected net wins matrix has only 𝑟 independent rows. The true net wins matrix has a singular value distribution that looks like this (example, 𝑟=10):
Spectral Clustering 𝝈𝟏>…>𝝈𝒏 Step 2: Perform Singular Value Decomposition, and retain only the top 𝑟 singular values, and set the rest equal to zero 𝝈𝟏>…>𝝈𝒏
Spectral Clustering 𝝈𝟏>…>𝝈𝒏 Step 3: Cluster the rows of the rank 𝑟 projection using the K-means algorithm, for example: 𝝈𝟏>…>𝝈𝒏
Result from Prior Work (assume 𝑚=𝑛) With 𝒓𝟐 log𝟑 𝑛 pairwise comparisons per user at most 𝑲 𝒍𝒐𝒈 𝒏 users are misclustered with high probability While the fraction of misclustered users goes to zero, the rate at which it goes to zero is not satisfactory Moreover, to prove that perfect clustering (all users clustered with high probability) is achieved, we need 𝒏 𝒓 𝟐 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user Can we prove that perfect clustering is achieved with high probability with far fewer comparisons? Yes, we tweak the previous algorithm (Spectral Clustering on the Net Wins matrix)
Outline Traditional Noisy Pairwise Comparisons Our Problem: Clustering users Algorithm in Prior Work New Algorithm Conclusions
Outline of the Algorithm Split the items into different partitions, and only consider the pairwise comparisons data within each partition (inspired by (Vu, 2014) for community detection) Apply the previous algorithm to each data partition, and cluster the users based on the information in each partition Can result in inconsistent clusters: users 1 and 2 may be in the same cluster in one partition, but not in another partition. Which one of these clusters is correct? Use simple majority voting to correct errors, i.e., assign the user to the cluster to which it belongs most often
(Note: some data is lost) Data Partitioning Split the items into 𝐿 sets, Example: 𝐿 =2 (1, 2) (1, 3) (1,4) (1, 5) (1, 6) (2,3) (2, 4) (2, 5) (2, 6) (3, 4) (3, 5) (3, 6) (4,5) (4,6) (5,6) 1 -1 (1, 2) (1, 3) (2,3) 1 -1 (4, 5) (4, 6) (5,6) 1 -1 𝐿 pairwise comparison matrices 𝐿 Net Wins matrices (Note: some data is lost)
Cluster Users Based on Each Partition Item 1 3 4 18 -1 Partition 1 1 r Spectral clustering Partition L Item 2 5 19 33 1 -1 1 r L Net Wins matrices L different clusterings
Numbering the Clusters Number the clusters 1, 2, … , r arbitrarily in the first data partition For the second partition, the cluster which overlaps the most with cluster 1 in Partition 1 is called cluster 1, the cluster which overlaps the most with cluster 2 in Partition 1 is called cluster 2, and so on Partition 1 Partition 2 Partition 3 Partition 4 1 2 3 ? ? ? ? ? ? ? ? ? 2016/5/2
Numbering the Clusters Number the clusters 1, 2, … , r in the results from the first data partition For the second partition, the cluster which overlaps the most with cluster 1 in Partition 1 is called cluster 1, the cluster which overlaps the most with cluster 2 in Partition 1 is called cluster 2, and so on Partition 1 Partition 2 Partition 3 Partition 4 1 2 3 3 2 1 1 3 2 2 1 3 2016/5/2
Clustering the Users A user may belong to cluster 1 in one partition, but may belong to some other cluster in another partition Majority voting determines the correct cluster for each user. Partition 1 Partition 2 Partition 3 Partition 4 1 2 3 1 2 3 1 2 3 1 2 3 = User 𝒖 e.g. Here # of data partitions 𝐿 = 4, # clusters 𝑟 = 3 e.g. Here user 𝑢 is assigned to cluster 2
Summary of the algorithm Partition items uniformly into L sets Partition 1 Item 1 2 3 4 -1 Net Wins matrix Partition L Item 1 2 3 4 -1 Partition 1 1 r Majority voting 1 r Spectral Clustering Final clustering of users Partition L 1 r
Main Result Previous result: If more than 𝒏 𝒓 𝟐 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user are available, all users are clustered w.p. at least (1 – 1/n). New result: If more than 𝒓 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user are available, all users are correctly clustered w.p. at least (1 – 1/n). Key Idea: Spectral clustering results in many incorrectly clustered users Split the items into many groups, perform spectral clustering on each, and combine the results using majority voting Works despite loss of data in the partitioning process Idea works for more general models than the B-T model
Outline of the Proof: Part I Two rows of the expected Net Wins matrix belonging to different clusters are well separated: 𝑆 𝑢 − 𝑆 𝑣 2 > 𝐶 1 1−𝜖 𝑛 (by assumption) Let 𝑃 𝑟 (⋅) be the rank r projection. Using concentration inequalities 𝑃 𝑟 𝑆 𝑢 − 𝑆 𝑢 2 ≤ 𝐶 2 log 3 2 𝑛 1−𝜖
Outline of Proof: Part II All the clusters are well separated with high probability if we have a lot of measurements (as in the previous paper) But with fewer measurements, the probability of misclustering is 𝛿, which does not go to zero when 𝑛→∞ 𝑺 𝒖 𝑺 𝒗 𝑺 𝒖 𝑺 𝒗
Outline of the Proof: Part III Partition items into 𝐿 sets In each set, user 𝑢 is misclustered w.p. δ By the Chernoff bound, 𝑃( 𝑢 is misclustered in more than 𝐿/2 sets) < 𝑒𝑥𝑝(− 𝛿−0.5 2 𝐿/2) For 𝐿=𝐶 log(𝑛), majority voting clusters all users correctly
Lower Bound on Sample Complexity Event A: Two users from different clusters have no pairwise comparisons. If A occurs, all users cannot be clustered correctly. P(A) →1 as n→∞ when 1 −𝜖<𝑂( log 𝑛 𝑛 2 )
Main Result If more than 𝒓 𝐥𝐨𝐠 𝟓 𝒏 pairwise comparisons/user are available, all users are correctly clustered w.p. at least 1 – 1/n. The number of comparisons required is within a polylog factor of the lower bound Assumption required for the main result: The rows in different clusters of the expected Net Wins matrix are well separated
Related Work Vu (2014) Lu-Negahban (2014) Exact cluster recovery in community detection through spectral methods Partition data into two sets, use one for clustering and other to correct errors in the recovered clusters Lu-Negahban (2014) Bradley-Terry parameters are different for each user, but form a low-rank matrix Park, Neeman, Zhang, Sanghavi (2015) Related to the model above, but with a different algorithm Oh, Thekumparampil, Xu (2015) Generalization to multi-item rankings
Conclusions Algorithm to achieve perfect clustering with high probability Majority voting from spectral clustering over different data partitions Number of samples required is within poly( log 𝑛 ) factor of a lower bound