Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.

Similar presentations


Presentation on theme: "Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla."— Presentation transcript:

1 Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla

2 2 Clustering Say we want to cluster n objects of some kind (documents, images, text strings) But we don’t have a meaningful way to project into Euclidean space. Idea of Cohen, McCallum, Richman: use past data to train up f(x,y)=same/different. Then run f on all pairs and try to find most consistent clustering.

3 3 The problem Harry Bovik H. Bovik Harry B. Tom X.

4 4 The problem + + +: Same -: Different Harry Bovik H. Bovik Tom X. Harry B. Train up f(x)= same/different Run f on all pairs +

5 5 The problem + + +: Same -: Different Harry Bovik H. Bovik Tom X. Harry B. + Totally consistent: 1. + edges inside clusters 2. – edges outside clusters

6 6 The problem + +: Same -: Different Harry Bovik H. Bovik Harry B. Tom X. Train up f(x)= same/different Run f on all pairs +

7 7 The problem + + +: Same -: Different Disagreement Harry Bovik H. Bovik Tom X. Harry B. Train up f(x)= same/different Run f on all pairs Find most consistent clustering

8 8 The problem + + +: Same -: Different Disagreement H. Bovik Harry Bovik Harry B. Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering

9 9 Problem: Given a complete graph on n vertices. Each edge labeled + or -. Goal is to find partition of vertices as consistent as possible with edge labels. Max #(agreements) or Min #( disagreements) There is no k : # of clusters could be anything The problem + + +: Same -: Different Disagreement Harry Bovik H. Bovik Harry B. Tom X.

10 10 The Problem Noise Removal: There is some true clustering. However some edges incorrect. Still want to do well. Agnostic Learning: No inherent clustering. Try to find the best representation using hypothesis with limited representation power. Eg: Research communities via collaboration graph

11 11 Our results Constant-factor approx for minimizing disagreements. PTAS for maximizing agreements. Results for random noise case.

12 12 PTAS for maximizing agreements Easy to get ½ of the edges Goal: additive apx of  n 2. Standard approach: Draw small sample, Guess partition of sample, Compute partition of remainder. Can do directly, or plug into General Property Tester of [GGR]. Running time doubly exp’l in , or singly with bad exponent.

13 13 Minimizing Disagreements Goal: Get a constant factor approx. Problem: Even if we can find a cluster that’s as good as best cluster of OPT, we’re headed toward O(log n) [Set-Cover like analysis] Need a way of lower-bounding D opt.

14 14 Lower bounding idea: bad triangles Consider + + We know any clustering has to disagree with at least one of these edges.

15 15 Lower bounding idea: bad triangles If several such disjoint, then mistake on each one D opt ¸ #{Edge disjoint bad triangles} + + ++ + 1 43 25 2 Edge disjoint Bad Triangles (1,2,3), (3,4,5) (Not Tight)

16 16 Lower bounding idea: bad triangles If several such, then mistake on each one How can we use this ? D opt ¸ #{Edge disjoint bad triangles} + + ++ + 1 43 25 Edge disjoint Bad Triangles (1,2,3), (3,4,5)

17 17  -clean Clusters Given a clustering, vertex  -good if few disagreements Cluster C  -clean if all v 2 C are  -good v is  -good N - (v) Within C <  |C| N + (v) Outside C <  |C| C +: Similar -: Dissimilar Essentially, N + (v) ¼ C(v)

18 18 Observation Any  -clean clustering is 8 approx for  <1/4 Idea: Charging mistakes to bad triangles u v w + + - Intuitively, enough choices of w for each wrong edge (u,v)

19 19 Observation Any  -clean clustering is 8 approx for  <1/4 Idea: Charging mistakes to bad triangles u v w + + - Intuitively, enough choices of w for each wrong edge (u,v) Can find edge-disjoint bad triangles Similar argument for +ve edges between 2 clusters - + u’

20 20 General Structure of Argument Any  -clean clustering is 8 approx for  <1/4 Consequence: Just need to produce a  -clean clustering for  <1/4

21 21 General Structure of Argument Any  -clean clustering is 8 approx for  <1/4 Consequence: Just need to produce a  -clean clustering for  <1/4 Bad News: May not be possible !!

22 22 General Structure of Argument Any  -clean clustering is 8 approx for  <1/4 Consequence: Just need to produce a  -clean clustering for  <1/4 Approach: 1) Clustering of a special form: Opt(  ) Clusters either  clean or singletons Not many more mistakes than Opt 2) Produce something “close’’ to Opt(  ) Bad News: May not be possible !!

23 23 Existence of OPT(  ) Opt C1C1 C2C2

24 24 Existence of OPT(  ) Identify  -bad vertices Opt  /3-bad vertices C1C1 C2C2

25 25 Existence of OPT(  ) 1) Move  /3-bad vertices out 2) If “many” ( ¸  /3)  /3-bad, “split” Opt OPT(  ) Split C1C1 C2C2 Opt(  ) : Singletons and  -clean clusters D Opt(  ) = O(1) D Opt

26 26 Main Result Opt(  )  -clean

27 27 Main Result Opt(  ) Algorithm 11  -clean  -clean Guarantee: Non-Singletons 11  clean Singletons subset of Opt(  )

28 28 Main Result Opt(  ) Algorithm 11  -clean  -clean Choose  =1/44 Non-singletons: ¼-clean Bound mistakes among non- singletons by bad trianges Involving singletons By those of Opt(  ) Guarantee: Non-Singletons 11  clean Singletons subset of Opt(  )

29 29 Main Result Opt(  ) Algorithm 11  -clean  -clean Choose  =1/44 Non-singletons: ¼-clean Bound mistakes among non- singletons by bad trianges Involving singletons By those of Opt(  ) Approx. ratio= 9/  2 + 8 Guarantee: Non-Singletons 11  clean Singletons subset of Opt(  )

30 30 Open Problems How about a small constant? Is D opt · 2 #{Edge disjoint bad triangles} ? Is D opt · 2 #{Fractional e.d. bad triangles}? Extend to {-1, 0, +1} weights: apx to constant or even log factor? Extend to weighted case? Clique Partitioning Problem Cutting plane algorithms [Grotschel et al]

31 31

32 32

33 33 The problem + + +: Same -: Different Disagreement H. Bovik Harry Bovik Harry B. Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering

34 34 There’s no k. (OPT can have anywhere from 1 to n clusters) If a perfect solution exists, then it’s easy to find: C(v) = N + (v). Easy to get agreement on ½ of edges. Nice features of formulation

35 35 Algorithm 1. Pick vertex v. Let C(v) = N (v) 2. Modify C(v) (a) Remove 3  -bad vertices from C(v). (b) Add 7  good vertices into C(v). 3. Delete C(v). Repeat until done, or above always makes empty clusters. 4. Output nodes left as singletons. +

36 36 Step 1 Choose v, C= + neighbors of v C1C1 C2C2 v C

37 37 Step 2 v C Vertex Removal Phase: If x is 3  bad, C=C-{x} C1C1 C2C2 1)No vertex in C 1 removed. 2)All vertices in C 2 removed

38 38 Step 3 v C Vertex Addition Phase: Add 7  -good vertices to C C1C1 C2C2 1)All remaining vertices in C 1 will be added 2)None in C 2 added 3)Cluster C is 11  -clean

39 39 Case 2: v Singleton in OPT(  ) v C Choose v, C= +1 neighbors of v Same idea works

40 40 The problem 1 2 3 4


Download ppt "Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla."

Similar presentations


Ads by Google