Finding Low Error Clusterings TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA Maria-Florina Balcan.

Slides:



Advertisements
Similar presentations
The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.
Advertisements

Cuts, Trees, and Electrical Flows Aleksander Mądry.
Great Theoretical Ideas in Computer Science for Some.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Why do we want a good ratio anyway? Approximation stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work joint with Pranjal.
1 Truthful Mechanism for Facility Allocation: A Characterization and Improvement of Approximation Ratio Pinyan Lu, MSR Asia Yajun Wang, MSR Asia Yuan Zhou,
Maria-Florina Balcan Approximation Algorithms and Online Mechanisms for Item Pricing Maria-Florina Balcan & Avrim Blum CMU, CSD.
Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Great Theoretical Ideas in Computer Science.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Introduction to Bioinformatics Algorithms Clustering.
CPSC 411, Fall 2008: Set 4 1 CPSC 411 Design and Analysis of Algorithms Set 4: Greedy Algorithms Prof. Jennifer Welch Fall 2008.
Analysis of Algorithms CS 477/677
Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
A Discriminative Framework for Clustering via Similarity Functions Maria-Florina Balcan Carnegie Mellon University Joint with Avrim Blum and Santosh Vempala.
Clustering and greedy algorithms Prof. Noah Snavely CS1114
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
A Theory of Learning and Clustering via Similarity Functions Maria-Florina Balcan 09/17/2007 Joint work with Avrim Blum and Santosh Vempala Carnegie Mellon.
Gene expression & Clustering (Chapter 10)
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
Algorithms for Network Optimization Problems This handout: Minimum Spanning Tree Problem Approximation Algorithms Traveling Salesman Problem.
Fixed Parameter Complexity Algorithms and Networks.
The Best Algorithms are Randomized Algorithms N. Harvey C&O Dept TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AAAA.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
Approximation Algorithms for NP-hard Combinatorial Problems Magnús M. Halldórsson Reykjavik University
Finding dense components in weighted graphs Paul Horn
An Algorithmic Proof of the Lopsided Lovasz Local Lemma Nick Harvey University of British Columbia Jan Vondrak IBM Almaden TexPoint fonts used in EMF.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
Bahman Bahmani Stanford University
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Maria-Florina Balcan Active Learning Maria Florina Balcan Lecture 26th.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Harnessing implicit assumptions in problem formulations: Approximation-stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work.
Improved Equilibria via Public Service Advertising Maria-Florina Balcan TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Stability Yields a PTAS for k-Median and k-Means Clustering
1 Approximation algorithms Algorithms and Networks 2015/2016 Hans L. Bodlaender Johan M. M. van Rooij TexPoint fonts used in EMF. Read the TexPoint manual.
Correlation Clustering Nikhil Bansal Joint Work with Avrim Blum and Shuchi Chawla.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Why almost all satisfiable k - CNF formulas are easy? Danny Vilenchik Joint work with A. Coja-Oghlan and M. Krivelevich.
Tommy Messelis * Stefaan Haspeslagh Burak Bilgin Patrick De Causmaecker Greet Vanden Berghe *
Clustering Data Streams A presentation by George Toderici.
Correlation Clustering
The Best Algorithms are Randomized Algorithms
Unsupervised Learning
Lecture 18: Uniformity Testing Monotonicity Testing
Chapter 5. Optimal Matchings
Possibilities and Limitations in Computation
k-center Clustering under Perturbation Resilience
Graph Partitioning Problems
Clustering.
On the effect of randomness on planted 3-coloring models
Lecture 6: Counting triangles Dynamic graphs & sampling
the k-cut problem better approximate and exact algorithms
Fragment Assembly 7/30/2019.
Clustering.
Presentation transcript:

Finding Low Error Clusterings TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA Maria-Florina Balcan

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA Approximate clustering without the approximation Maria-Florina Balcan

Cluster news articles or web pages by topic Cluster protein sequences by function Cluster images by who is in them Clustering comes up everywhere

Formal Clustering Setup [documents, web pages] [true clustering by topic] S set of n objects. 9 “ground truth” clustering Goal: clustering C’ 1,…,C’ k of low error. [sports] [fashion] C 1, C 2,..., C k. error(C’) = fraction of pts misclassified up to re-indexing of clusters

Formal Clustering Setup [documents, web pages] [true clustering by topic] 9 “ground truth” clustering Goal: clustering C’ 1,…,C’ k of low error. [sports] [fashion] C 1, C 2,..., C k. Also have a distance/dissimilarity measure. E.g., # keywords in common, edit distance, wavelets coef., etc. S set of n objects.

View data points as nodes in weighted graph based on the distances. Standard theoretical approach Pick some objective to optimize. E.g: –k-median: find center pts c’ 1, c’ 2, …, c’ k to minimize  x min i d(x,c’ i ) –k-means: find center pts c’ 1, c’ 2, …, c’ k to minimize  x min i d 2 (x,c’ i ) –Min-sum: find partition C’ 1, …, C’ k to minimize  x,y in C’_i d(x,y) z x y c1c1 c2c2 s c3c3

Standard theoretical approach Pick some objective to optimize. E.g: E.g., best known for k-median is 3+  approx. Beating 1 + 2/e ¼ 1.7 is NP-hard. Our real goal to get the points right!! View data points as nodes in weighted graph based on the distances –k-median: find center pts c’ 1, c’ 2, …, c’ k to minimize  x min i d(x,c’ i )

Formal Clustering Setup, Our Perspective [sports] [fashion] So, if use c-apx to objective  (e.g, k-median), want to minimize error rate, implicit assumption: Goal: clustering C’ of low error.

Formal Clustering Setup, Our Perspective So, if use c-apx to objective  (e.g, k-median), want to minimize error rate: (c, ² )-Property Under this (c, ² )-property, we are able to cluster well without approximating the objective at all. Under this (c, ² )- property, the pb. of finding a c-approx. is as hard as in the general case.

Formal Clustering Setup, Our Perspective So, if use c-apx to objective  (e.g, k-median), want to minimize error rate: (c, ² )-Property For k-median, for any c>1, under the (c, ² )- property we get O( ² )-close to the target clustering. Even exactly  -close, if all clusters are “sufficiently large”. Even for values where getting c-approx is NP-hard.

Note on the Standard Approx. Algos Approach Motivation for improving ratio from c 1 to c 2 : maybe data satisfies this condition for c 2 but not c 1. –Legitimate: for any c 2 < c 1 can construct dataset and target satisfying the (c 2,  ) property, but not even the (c 1, 0.49) property. However we do even better! So, if use c-apx to objective  (e.g, k-median), want to minimize error rate: (c, ² )-Property

If data satisfies (c,  ) property, then get O(  /(c-1))-close to the target. If data satisfies (c,  ) property and if target clusters are large, then get  -close to the target. If data satisfies (c,  ) property, then get O(  /(c-1))-close to the target. Balcan–Blum-Gupta, SODA 09 Positive Results

Balcan–Braverman, COLT 09 If data satisfies (c,  ) property and if target clusters are large, then get O(  /(c-1))-close to the target. If data satisfies (c,  ) property, for arbitrarily small target clusters If k < log n/log log n, then get O(  /(c-1))-close to the target. If k > log n/log log n, then output a list of size log log n, s. t. target clustering close to one of those in the list.

How can we use the (c,  ) k-median property to cluster, without solving k-median?

Clustering from (c,  ) k-median prop Suppose any c-apx k-median solution must be  -close to the target. (for simplicity say target is k-median opt, & all cluster sizes > 2  n) x For any x, let w(x)=dist to own center, w 2 (x)=dist to 2 nd -closest center Let w avg =avg x w(x). 1.At most  n pts can have w 2 (x) < (c-1)w avg / . All the rest (the good pts) have a big gap. w(x)<(c-1)w avg /5  andw 2 (x) ¸ (c-1)w avg /  2.At most 5  n/(c-1) pts can have w(x) ¸ (c-1)w avg /5 . Then: [OPT =n ¢ w avg ]

Clustering from (c,  ) k-median prop 1.At most  n pts can have w 2 (x) < (c-1)w avg / . All the rest (the good pts) have a big gap. 2.At most 5  n/(c-1) pts can have w(x) ¸ (c-1)w avg /5 . Then: Define critical distance d crit =(c-1)w avg /5 . So, a 1-O(  ) fraction of pts look like: x y z · d crit > 4d crit · 2d crit

Clustering from (c,  ) k-median prop So if we define a graph G connecting any two pts x, y if d(x,y) · 2d crit, then: So, a 1-O(  ) fraction of pts look like: Good pts within cluster form a clique. Good pts in different clusters have no common nbrs. x y z · d crit > 4d crit · 2d crit

Clustering from (c,  ) k-median prop So, the world now looks like: So if we define a graph G connecting any two pts x, y if d(x,y) · 2d crit, then: Good pts within cluster form a clique. Good pts in different clusters have no common nbrs.

Clustering from (c,  ) k-median prop Create graph H where connect x,y if share ¸ b nbrs in common in G. Output k largest components in H. If furthermore all clusters have size > 2b+1, where b = # bad pts = O(  n/(c-1)), then: (so get error O(  /(c-1)).

Clustering from (c,  ) k-median prop If clusters not so large, then need to be a bit more careful but can still get error O(  c-1)).

Clustering from (c,  ) k-median prop Could even have some clusters dominated by bad pts. However, show a greedy algorithm will work! If clusters not so large, then need to be a bit more careful but can still get error O(  c-1)).

Clustering from (c,  ) k-median prop Remove v j and its neighborhood from G, call this C(v j ). For j = 1 to k do: Output the k clusters C(v 1 ),...,C(v k−1 ), S−  i C(v i ). Pick v j of highest degree in G. (Simple alg., but need to be careful with analysis) If clusters not so large, then need to be a bit more careful but can still get error O(  c-1)).

Clustering from (c,  ) k-median prop Remove v j and its neighborhood from G, call this C(v j ). For j = 1 to k do: Output the k clusters C(v 1 ),...,C(v k−1 ), S−  i C(v i ). Pick v j of highest degree in G. v If clusters not so large, then need to be a bit more careful but can still get error O(  c-1)).

Clustering from (c,  ) k-median prop Remove v j and its neighborhood from G, call this C(v j ). For j = 1 to k do: Output the k clusters C(v 1 ),...,C(v k−1 ), S−  i C(v i ). Pick v j of highest degree in G. v Overall: number of mistakes is O(b) = O(  n/(c-1)). If clusters not so large, then need to be a bit more careful but can still get error O(  c-1)).

O(  )-close !  -close Back to the large-cluster case: can actually get  -close. (for any c>1, but “large” depends on c). Idea: two kinds of bad pts. –At most  n “confused”: w 2 (x)-w(x) < (c-1)w avg / . –Rest not confused, just far: w(x) ¸ (c-1)w avg /5 . · d crit non-confused bad pt w(x) w 2 (x) w 2 (x) – w(x) ¸ 5 d crit

O(  )-close !  -close Back to the large-cluster case: can actually get  -close. (for any c>1, but “large” depends on c). Given output C’ from alg so far, reclassify each x into cluster of lowest median distance –Median is controlled by good pts, which will pull the non- confused points in the right direction. (so get error  ). · d crit non-confused bad pt w(x) w 2 (x) w 2 (x) – w(x) ¸ 5 d crit

Technical issues… One solution (large cluster case): –Keep increasing guesses until first time that all components output are large and · b points left unclustered. –Guess might be too small, but sufficient for 2 nd stage to work as desired, getting  -close to target. Definition of graph G depends on d crit =(c-1)w avg /5  which depends on knowing w avg = OPT/n.

Technical issues… Another solution (large and small clusters): –Use approx alg to get constant-factor apx to OPT. –Approximation factor just goes into O(  ) error term. Definition of graph G depends on d crit =(c-1)w avg /5  which depends on knowing w avg = OPT/n.

Extensions: The inductive setting Insert new points as they arrive. Draw sample S, cluster S. Inductive Setting Key property used (large clusters case) In each cluster, we have more good points than bad points. based on their median distance to the clusters produced. If target clusters large enough and S is large enough, this happens over the sample as well. [Balcan –Roeglin-Teng, 2009]

Similar alg & argument for k-means. Extension to exact  -closeness breaks. (c,  ) k-means and min-sum properties For min-sum, more involved argument. - But don’t have a uniform d crit (could have big clusters with small distances and small clusters with big distances) - Still possible to solve for c>2 and large clusters. Connect to balanced k-median: find center pts c 1, …, c k and partition C’ 1, …, C’ k to minimize  x d(x,c i ) |C i |. (Note that the known approx log 1+ ± n) [BBG, SODA 2009] Even better, with a more involved argument. [Braveman, COLT 2009]

(c,  ) property; s (c,  ) property; summary Can view usual approach as saying: Not really: make an implicit assumption about how distances and closeness to target relate. We make it explicit. Get around inapproximability results by using structure implied by assumptions we were making anyway!

Data satisfies the (c, ² ) property after a small º fraction of ill-behaved points have been removed. ( º,c,  ) Extensions: the ( º,c,  ) k-median prop List clustering output, necessary. [Balcan –Roeglin-Teng, 2009] Definition: ( º,c,  ) property Even if most of the pts from large target clusters!. two different sets of outliers could result in two different clusterings satisfying the ( º,c, ² ) property

Data satisfies the (c, ² ) property after a small º fraction of ill-behaved points have been removed. ( º,c,  ) Extensions: the ( º,c,  ) k-median prop List clustering output, necessary. [Balcan –Roeglin-Teng, 2009] Definition: ( º,c,  ) property x1x1 x2x2 x3x3 A1A1 A2A2 A3A3

Data satisfies the (c, ² ) property after a small º fraction of ill-behaved points have been removed. ( º,c,  ) Extensions: the ( º,c,  ) k-median prop List clustering output, necessary. [Balcan –Roeglin-Teng, 2009] Definition: ( º,c,  ) property x1x1 x2x2 x3x3 A1A1 A2A2 A3A3 Optimal k-median sol. on A 1 [ A 2 [ A 3 is C 1 =A 1, A 2, A 3 Instance (A 1 [ A 2 [ A 3,d) satisfies the (c, ² ) property wrt C 1

Data satisfies the (c, ² ) property after a small º fraction of ill-behaved points have been removed. ( º,c,  ) Extensions: the ( º,c,  ) k-median prop List clustering output, necessary. [Balcan –Roeglin- Teng, 2009] Definition: ( º,c,  ) property x1x1 x2x2 x3x3 A1A1 A2A2 A3A3

Data satisfies the (c, ² ) property after a small º fraction of ill-behaved points have been removed. ( º,c,  ) Extensions: the ( º,c,  ) k-median prop List clustering output, necessary. [Balcan –Roeglin- Teng, 2009] Definition: ( º,c,  ) property x1x1 x2x2 x3x3 A1A1 A2A2 A3A3 Optimal k-median sol. on A 1 [ A 2 [ A 3 [ {x 1 } is C 2 ={A 1, A 2 }, A 3, {x 1 } Instance (A 1 [ A 2 [ A 3 [ {x 1 },d) satisfies the (c, ² ) property wrt C 2

Data satisfies the (c, ² ) property after a small º fraction of ill-behaved points have been removed. ( º,c,  ) Extensions: the ( º,c,  ) k-median prop List clustering output, necessary. [Balcan –Roeglin- Teng, 2009] Definition: ( º,c,  ) property When most points come from large clusters can efficiently output a list of size at most k.

(c,  ) Property: (c,  ) Property: Open problems Other problems where std objective is just a proxy and implicit assumptions could be exploited? Other clustering objectives? E.g., sparsest cut?

Clustering: More General Picture [documents] [topic] S set of n objects. 9 ground truth clustering. Goal: h of low error where x, l (x) in {1,…,t}. err(h) = min  Pr x~S [  (h(x))  l (x)] Have a Similarity/Disimilarity Information! [sports] [fashion] Has to be related to the ground-truth!!!

Fundamental, Natural Question 40 Clustering: More General Picture [sports] [fashion] What properties on a similarity function would be sufficient to allow one to cluster well? (c,  ) (c,  ) prop for objective Á just one example

Fundamental, Natural Question 41 Clustering: More General Picture [sports] [fashion] What properties on a similarity function would be sufficient to allow one to cluster well? See Avrim Blum ’s talk on Mon for a general theory!! Balcan–Blum-Vempala, STOC 08

42

43 Contrast with Standard Approaches Contrast with Standard Approaches Approximation algorithms - analyze algs to optimize various criteria over edges - score algs based on apx ratios Input: graph or embedding into R d Much better when input graph/ similarity is based on heuristics. Mixture models Clustering Theoretical Frameworks Our Approach Discriminative, not generative. Input: embedding into R d - score algs based on error rate - strong probabilistic assumptions Input: graph or similarity info - score algs based on error rate - no strong probabilistic assumptions E.g., clustering documents by topic, web search results by category