Download presentation
Presentation is loading. Please wait.
Published byVictoria Campbell Modified over 8 years ago
1
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla
2
2 Clustering Say we want to cluster n objects of some kind (documents, images, text strings) But we don’t have a meaningful way to project into Euclidean space. Idea of Cohen, McCallum, Richman: use past data to train up f(x,y)=same/different. Then run f on all pairs and try to find most consistent clustering.
3
3 The problem Harry Bovik H. Bovik Harry B. Tom X.
4
4 The problem + + +: Same -: Different Harry Bovik H. Bovik Tom X. Harry B. Train up f(x)= same/different Run f on all pairs +
5
5 The problem + + +: Same -: Different Harry Bovik H. Bovik Tom X. Harry B. + Totally consistent: 1. + edges inside clusters 2. – edges outside clusters
6
6 The problem + +: Same -: Different Harry Bovik H. Bovik Harry B. Tom X. Train up f(x)= same/different Run f on all pairs +
7
7 The problem + + +: Same -: Different Disagreement Harry Bovik H. Bovik Tom X. Harry B. Train up f(x)= same/different Run f on all pairs Find most consistent clustering
8
8 The problem + + +: Same -: Different Disagreement H. Bovik Harry Bovik Harry B. Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering
9
9 Problem: Given a complete graph on n vertices. Each edge labeled + or -. Goal is to find partition of vertices as consistent as possible with edge labels. Max #(agreements) or Min #( disagreements) There is no k : # of clusters could be anything The problem + + +: Same -: Different Disagreement Harry Bovik H. Bovik Harry B. Tom X.
10
10 The Problem Noise Removal: There is some true clustering. However some edges incorrect. Still want to do well. Agnostic Learning: No inherent clustering. Try to find the best representation using hypothesis with limited representation power. Eg: Research communities via collaboration graph
11
11 Our results Constant-factor approx for minimizing disagreements. PTAS for maximizing agreements. Results for random noise case.
12
12 PTAS for maximizing agreements Easy to get ½ of the edges Goal: additive apx of n 2. Standard approach: Draw small sample, Guess partition of sample, Compute partition of remainder. Can do directly, or plug into General Property Tester of [GGR]. Running time doubly exp’l in , or singly with bad exponent.
13
13 Minimizing Disagreements Goal: Get a constant factor approx. Problem: Even if we can find a cluster that’s as good as best cluster of OPT, we’re headed toward O(log n) [Set-Cover like analysis] Need a way of lower-bounding D opt.
14
14 Lower bounding idea: bad triangles Consider + + We know any clustering has to disagree with at least one of these edges.
15
15 Lower bounding idea: bad triangles If several such disjoint, then mistake on each one D opt ¸ #{Edge disjoint bad triangles} + + ++ + 1 43 25 2 Edge disjoint Bad Triangles (1,2,3), (3,4,5) (Not Tight)
16
16 Lower bounding idea: bad triangles If several such, then mistake on each one How can we use this ? D opt ¸ #{Edge disjoint bad triangles} + + ++ + 1 43 25 Edge disjoint Bad Triangles (1,2,3), (3,4,5)
17
17 -clean Clusters Given a clustering, vertex -good if few disagreements Cluster C -clean if all v 2 C are -good v is -good N - (v) Within C < |C| N + (v) Outside C < |C| C +: Similar -: Dissimilar Essentially, N + (v) ¼ C(v)
18
18 Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles u v w + + - Intuitively, enough choices of w for each wrong edge (u,v)
19
19 Observation Any -clean clustering is 8 approx for <1/4 Idea: Charging mistakes to bad triangles u v w + + - Intuitively, enough choices of w for each wrong edge (u,v) Can find edge-disjoint bad triangles Similar argument for +ve edges between 2 clusters - + u’
20
20 General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a -clean clustering for <1/4
21
21 General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a -clean clustering for <1/4 Bad News: May not be possible !!
22
22 General Structure of Argument Any -clean clustering is 8 approx for <1/4 Consequence: Just need to produce a -clean clustering for <1/4 Approach: 1) Clustering of a special form: Opt( ) Clusters either clean or singletons Not many more mistakes than Opt 2) Produce something “close’’ to Opt( ) Bad News: May not be possible !!
23
23 Existence of OPT( ) Opt C1C1 C2C2
24
24 Existence of OPT( ) Identify -bad vertices Opt /3-bad vertices C1C1 C2C2
25
25 Existence of OPT( ) 1) Move /3-bad vertices out 2) If “many” ( ¸ /3) /3-bad, “split” Opt OPT( ) Split C1C1 C2C2 Opt( ) : Singletons and -clean clusters D Opt( ) = O(1) D Opt
26
26 Main Result Opt( ) -clean
27
27 Main Result Opt( ) Algorithm 11 -clean -clean Guarantee: Non-Singletons 11 clean Singletons subset of Opt( )
28
28 Main Result Opt( ) Algorithm 11 -clean -clean Choose =1/44 Non-singletons: ¼-clean Bound mistakes among non- singletons by bad trianges Involving singletons By those of Opt( ) Guarantee: Non-Singletons 11 clean Singletons subset of Opt( )
29
29 Main Result Opt( ) Algorithm 11 -clean -clean Choose =1/44 Non-singletons: ¼-clean Bound mistakes among non- singletons by bad trianges Involving singletons By those of Opt( ) Approx. ratio= 9/ 2 + 8 Guarantee: Non-Singletons 11 clean Singletons subset of Opt( )
30
30 Open Problems How about a small constant? Is D opt · 2 #{Edge disjoint bad triangles} ? Is D opt · 2 #{Fractional e.d. bad triangles}? Extend to {-1, 0, +1} weights: apx to constant or even log factor? Extend to weighted case? Clique Partitioning Problem Cutting plane algorithms [Grotschel et al]
31
31
32
32
33
33 The problem + + +: Same -: Different Disagreement H. Bovik Harry Bovik Harry B. Tom X. Train up f(x)= same/different Run f on all pairs Find most consistent clustering
34
34 There’s no k. (OPT can have anywhere from 1 to n clusters) If a perfect solution exists, then it’s easy to find: C(v) = N + (v). Easy to get agreement on ½ of edges. Nice features of formulation
35
35 Algorithm 1. Pick vertex v. Let C(v) = N (v) 2. Modify C(v) (a) Remove 3 -bad vertices from C(v). (b) Add 7 good vertices into C(v). 3. Delete C(v). Repeat until done, or above always makes empty clusters. 4. Output nodes left as singletons. +
36
36 Step 1 Choose v, C= + neighbors of v C1C1 C2C2 v C
37
37 Step 2 v C Vertex Removal Phase: If x is 3 bad, C=C-{x} C1C1 C2C2 1)No vertex in C 1 removed. 2)All vertices in C 2 removed
38
38 Step 3 v C Vertex Addition Phase: Add 7 -good vertices to C C1C1 C2C2 1)All remaining vertices in C 1 will be added 2)None in C 2 added 3)Cluster C is 11 -clean
39
39 Case 2: v Singleton in OPT( ) v C Choose v, C= +1 neighbors of v Same idea works
40
40 The problem 1 2 3 4
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.