Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Clustering

Similar presentations


Presentation on theme: "Correlation Clustering"— Presentation transcript:

1 Correlation Clustering
Shuchi Chawla Carnegie Mellon University Joint work with Nikhil Bansal and Avrim Blum

2 Document Clustering Given a bunch of documents, classify them into salient topics Typical characteristics: No well-defined “similarity metric” Number of clusters is unknown No predefined topics – desirable to figure them out as part of the algorithm Shuchi Chawla, Carnegie Mellon University

3 Research Communities Given data on research papers, divide researchers into communities by co-authorship Typical characteristics: How to divide really depends on the given set of researchers Fuzzy boundaries Shuchi Chawla, Carnegie Mellon University

4 Traditional Approaches to Clustering
Approximation algorithms k-means, k-median, k-min sum Matrix methods Spectral Clustering AI techniques EM, classification algorithms Shuchi Chawla, Carnegie Mellon University

5 Problems with traditional approaches
Dependence on underlying metric Objective functions are meaningless without a metric eg. k-means Algorithm works only on specific metrics (such as Euclidean) eg. spectral methods Shuchi Chawla, Carnegie Mellon University

6 Problems with traditional approaches
Fixed number of clusters Meaningless without prespecified number of clusters eg. for k-means or k-median, if k is unspecified, it is best to put everything in their own cluster Shuchi Chawla, Carnegie Mellon University

7 Problems with traditional approaches
No clean notion of “quality” of clustering Objective functions do not directly translate to how many items have been grouped wrongly Heuristic approaches Objective functions derived from generative models Shuchi Chawla, Carnegie Mellon University

8 Cohen, McCallum & Richman’s idea
“Learn” a similarity measure on documents may not be a metric! f(x,y) = amount of similarity between x and y Use labeled data to train up this function Classify all pairs with the learned function Find the “most consistent” clustering Our Task Shuchi Chawla, Carnegie Mellon University

9 An example Consistent clustering: + edges inside clusters
Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X. Consistent clustering: + edges inside clusters - edges between clusters Shuchi Chawla, Carnegie Mellon University

10 An example Harry B. Harry Bovik +: Same -: Different H. Bovik Tom X.
Disagreement H. Bovik Tom X. Shuchi Chawla, Carnegie Mellon University

11 An example Harry B. Harry Bovik +: Same -: Different Disagreement
H. Bovik Tom X. Task: Find most consistent clustering or, fewest possible disagreements equivalently, maximum possible agreements Shuchi Chawla, Carnegie Mellon University

12 Correlation clustering
Given a complete graph – Each edge labeled ‘+’ or ‘-’ Our measure of clustering – How many labels does it agree with? Number of clusters depends on the edge labels NP-complete; We consider approximations Shuchi Chawla, Carnegie Mellon University

13 Compared to traditional approaches…
Do not have to specify k No condition on weights – can be arbitrary Clean notion of quality of clustering – number of examples where the clustering differs from f If a good (perfect) clustering exists, it is easy to find Shuchi Chawla, Carnegie Mellon University

14 Some machine learning justification
Noise Removal There is some true classification function f But there are a few errors in the data We want to find the true function Agnostic Learning There is no inherent clustering Try to find the best representation using a hypothesis with limited expressivity Shuchi Chawla, Carnegie Mellon University

15 Our results Constant factor approximation for minimizing disagreements
PTAS for maximizing agreements Results for the random noise case Shuchi Chawla, Carnegie Mellon University

16 Minimizing Disagreements
Goal: constant approximation Problem: Even if we find a cluster as good as one in OPT, we are headed towards a log n approximation (a set-cover like bound) Idea: lower bound DOPT Shuchi Chawla, Carnegie Mellon University

17 Lower Bounding Idea: Bad Triangles
Consider + - “Bad Triangle” + We know any clustering has to disagree with at least one of these edges. Shuchi Chawla, Carnegie Mellon University

18 Lower Bounding Idea: Bad Triangles
If several edge-disjoint bad triangles, then any clustering makes a mistake on each one - + + 1 2 Edge disjoint Bad Triangles (1,2,3), (1,4,5) 5 2 4 3 Dopt  #{Edge disjoint bad triangles} Shuchi Chawla, Carnegie Mellon University

19 Using the lower bound d-clean cluster: cluster C where each node has fewer than d|C| “bad” edges d-clean clusters have few bad triangles => few mistakes Possible solution: find a d-clean clustering Caveat: It may not exist Shuchi Chawla, Carnegie Mellon University

20 Using the lower bound Caveat: A d-clean clustering may not exist
We show:  a clustering with clusters that are d-clean or singleton Further, it has few mistakes Nice structure helps us find it easily. Shuchi Chawla, Carnegie Mellon University

21 Maximizing Agreements
Easy to obtain a 2-approximation If #(pos. edges) > #(neg. edges) everything in one cluster Otherwise, n singleton clusters Get at least half the edges correct Max score possible = total number of edges 2-approximation ! Shuchi Chawla, Carnegie Mellon University

22 Maximizing Agreements
Max possible score = ½n2 Goal: obtain an additive approx of en2 Standard approach: Draw small sample Guess partition of sample Compute partition of remainder Running time doubly exp’l in e, or singly with bad exponent. Shuchi Chawla, Carnegie Mellon University

23 Extensions & Open Problems
Weighted edges or incomplete graph Recent work by Bartal et al log-approximation based on multiway cut Better constant for unweighted case Can we use bad triangles (or polygons) more directly for a tighter bound? Experimental performance Shuchi Chawla, Carnegie Mellon University

24 Other problems I have worked on
Game Theory and Mechanism Design Approx for Orienteering & related problems Online search algorithms based on Machine Learning approaches Theoretical properties of Power Law graphs Currently working on Privacy with Cynthia Shuchi Chawla, Carnegie Mellon University

25 Thanks! Shuchi Chawla, Carnegie Mellon University

26 using the lower bound: delta clean clusters
give proof that delta clean is \leq 4 opt Shuchi Chawla, Carnegie Mellon University

27 proof outline delta clean <= 4opt
but there may not be a delta clean clustering!! show that there is a clustering that is close to delta clean – clusters are either delta clean or singleton there exists such a clustering close to opt we will try to find this clustering (copy nikhil’s slide) Shuchi Chawla, Carnegie Mellon University

28 existence of opt(delta)
proof Shuchi Chawla, Carnegie Mellon University

29 algorithm pictorially – use nikhil’s slides
brief outline of how it does that Shuchi Chawla, Carnegie Mellon University

30 bounding the cost clusters containing opt(delta)’s clusters are ¼ clean rest have at most as many mistakes as opt(delta) Shuchi Chawla, Carnegie Mellon University

31 random noise? Shuchi Chawla, Carnegie Mellon University


Download ppt "Correlation Clustering"

Similar presentations


Ads by Google