Correlation Clustering Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum
Document Clustering Given a bunch of documents, classify them into salient topics Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown No predefined topics – desirable to figure them out as part of the algorithm Shuchi Chawla, Carnegie Mellon University
Research Communities Given data on research papers, divide researchers into communities by co-authorship Typical characteristics: How to divide really depends on the given set of researchers Fuzzy boundaries Shuchi Chawla, Carnegie Mellon University
Traditional Approaches to Clustering Approximation algorithms k-means, k-median, k-min sum Matrix methods Spectral Clustering AI techniques EM, classification algorithms Shuchi Chawla, Carnegie Mellon University
Problems with traditional approaches Dependence on underlying metric Objective functions are meaningless without a metric eg. k-means Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods Shuchi Chawla, Carnegie Mellon University
Problems with traditional approaches Fixed number of clusters Meaningless without prespecified number of clusters eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster Shuchi Chawla, Carnegie Mellon University
Problems with traditional approaches No clean notion of “quality” of clustering Objective functions do not directly translate to how many items have been grouped wrongly Heuristic approaches Objective functions derived from generative models Shuchi Chawla, Carnegie Mellon University
Cohen, McCallum & Richman’s idea “Learn” a similarity measure on documents may not be a metric! f(x,y) = amount of similarity between x and y Use labeled data to train up this function Classify all pairs with the learned function Find the “most consistent” clustering Our Task Shuchi Chawla, Carnegie Mellon University
An example Consistent clustering: + edges inside clusters Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Consistent clustering: + edges inside clusters - edges between clusters Shuchi Chawla, Carnegie Mellon University
An example Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Disagreement H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University
An example Harry B. Harry Bovik +: Same -: Different Disagreement H. Bovik Tom X. Task: Find most consistent clustering or, fewest possible disagreements equivalently, maximum possible agreements Shuchi Chawla, Carnegie Mellon University
Correlation clustering Given a complete graph – Each edge labeled ‘+’ or ‘-’ Our measure of clustering – How many labels does it agree with? Number of clusters depends on the edge labels NP-complete; We consider approximations Shuchi Chawla, Carnegie Mellon University
Compared to traditional approaches… Do not have to specify k No condition on weights – can be arbitrary Clean notion of quality of clustering – number of examples where the clustering differs from f If a good (perfect) clustering exists, it is easy to find Shuchi Chawla, Carnegie Mellon University
Some machine learning justification Noise Removal There is some true classification function f But there are a few errors in the data We want to find the true function Agnostic Learning There is no inherent clustering Try to find the best representation using a hypothesis with limited expressivity Shuchi Chawla, Carnegie Mellon University
Our results Constant factor approximation for minimizing disagreements PTAS for maximizing agreements Results for the random noise case Shuchi Chawla, Carnegie Mellon University
Minimizing Disagreements Goal: constant approximation Problem: Even if we find a cluster as good as one in OPT, we are headed towards a log n approximation (a set-cover like bound) Idea: lower bound DOPT Shuchi Chawla, Carnegie Mellon University
Lower Bounding Idea: Bad Triangles Consider + - “Bad Triangle” + We know any clustering has to disagree with at least one of these edges. Shuchi Chawla, Carnegie Mellon University
Lower Bounding Idea: Bad Triangles If several edge-disjoint bad triangles, then any clustering makes a mistake on each one - + + 1 2 Edge disjoint Bad Triangles (1,2,3), (1,4,5) 5 2 4 3 Dopt #{Edge disjoint bad triangles} Shuchi Chawla, Carnegie Mellon University
Using the lower bound d-clean cluster: cluster C where each node has fewer than d|C| “bad” edges d-clean clusters have few bad triangles => few mistakes Possible solution: find a d-clean clustering Caveat: It may not exist Shuchi Chawla, Carnegie Mellon University
Using the lower bound Caveat: A d-clean clustering may not exist We show: a clustering with clusters that are d-clean or singleton Further, it has few mistakes Nice structure helps us find it easily. Shuchi Chawla, Carnegie Mellon University
Maximizing Agreements Easy to obtain a 2-approximation If #(pos. edges) > #(neg. edges) everything in one cluster Otherwise, n singleton clusters Get at least half the edges correct Max score possible = total number of edges 2-approximation ! Shuchi Chawla, Carnegie Mellon University
Maximizing Agreements Max possible score = ½n2 Goal: obtain an additive approx of en2 Standard approach: Draw small sample Guess partition of sample Compute partition of remainder Running time doubly exp’l in e, or singly with bad exponent. Shuchi Chawla, Carnegie Mellon University
Extensions & Open Problems Weighted edges or incomplete graph Recent work by Bartal et al log-approximation based on multiway cut Better constant for unweighted case Can we use bad triangles (or polygons) more directly for a tighter bound? Experimental performance Shuchi Chawla, Carnegie Mellon University
Other problems I have worked on Game Theory and Mechanism Design Approx for Orienteering & related problems Online search algorithms based on Machine Learning approaches Theoretical properties of Power Law graphs Currently working on Privacy with Cynthia Shuchi Chawla, Carnegie Mellon University
Thanks! Shuchi Chawla, Carnegie Mellon University
using the lower bound: delta clean clusters give proof that delta clean is \leq 4 opt Shuchi Chawla, Carnegie Mellon University
proof outline delta clean <= 4opt but there may not be a delta clean clustering!! show that there is a clustering that is close to delta clean – clusters are either delta clean or singleton there exists such a clustering close to opt we will try to find this clustering (copy nikhil’s slide) Shuchi Chawla, Carnegie Mellon University
existence of opt(delta) proof Shuchi Chawla, Carnegie Mellon University
algorithm pictorially – use nikhil’s slides brief outline of how it does that Shuchi Chawla, Carnegie Mellon University
bounding the cost clusters containing opt(delta)’s clusters are ¼ clean rest have at most as many mistakes as opt(delta) Shuchi Chawla, Carnegie Mellon University
random noise? Shuchi Chawla, Carnegie Mellon University