Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,

Similar presentations


Presentation on theme: "Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,"— Presentation transcript:

1 Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology, Delhi)

2 Mixture of Distributions Given k distributions F 1, …, F k with relative weights w 1,…, w k A point is sampled by first choosing F i with prob. w i and then sampling from it.

3 Mixture of Distributions p(x) =  i f i (x ) w i

4 Learning mixture of dist. Goal : Given samples, in d space, from a mixture of k components, (d>>k), learn the mixture. Classify samples. Distributions : Gaussians, heavy-tailed,… Many applications : data-mining, vision,…

5 Notation ii ii ¾ i : maximum variance of the projection of F i along any line ¾ = max i ¾ i

6 Gaussian Components How many standard deviations apart do the means need to be? Early results: function of d,k. But d>>k Later: function of k alone for spherical Gaussians, some non- spherical too Vempala, Wang….

7 Other Distributions Product Distributions : |  i  j | ¸ ¾ poly(k/ ² ) [Dasgupta et. al. ’05, Chaudhuri, Rao ’08] Planted Partition Model [McSherry ’01] These algorithms are quite different from each other and often quite non-trivial. Gaussians, Planted partition, have exponentially falling tails. What about heavier tails? Say only mean, variance bounded?

8 Our Approach A deterministic separation condition on the samples which will guarantee correct clustering A : set of n points Clustered into k clusters C 1, …, C k  i : mean of cluster C i

9 … Our Approach How to model ¾ ? A= x1x1 x2x2 x3x3 xnxn … d C= 11 22 33 nn d Recall : ||T|| = max |x|=1 |Tx| Max _u (Mean squared distance to centroid in direction u)

10 Proximity Condition rr ss x DrDr DsDs contains k, 1/w x 2 C r satisfies proximity condition if the above holds for all s  r

11 Proximity Condition rr ss x DrDr DsDs Proximity in projection Proximity in whole space Is much weaker than

12 Our Result rr ss x DrDr DsDs Thm : If all points satisfy proximity condition, then we can correctly classify all the points. Answers an open question posed in [Kannan Vempala ‘09] Much weaker condition

13 Our Result (Approx. version) rr ss x DrDr DsDs Thm : If all but ² fraction of points satisfy proximity condition, then we can correctly classify all but O( ² ) fraction of the points.

14 Our Result rr ss x DrDr DsDs Applies to many settings of learning mixtures. Algorithm : spectral clustering + Lloyd’s k-means shows that Lloyd’s algorithm converges if the initial seed is chosen carefully.

15 Applications : Gaussians [Dasgupta et. al. ’07] rr x DrDr D r · O( ¾ ) with high prob. So inter-center separation of Ω(σ) implies proximity.

16 Applications : Planted Models V1V1 V2V2 V3V3 V= V 1 [ … [ V k k x k matrix P of probabilities 0.10.90.7 0.4 0.2 0.1 Edge between x 2 V i, y 2 V j with prob P ij Goal : Given an instance G, recover the partitions V 1, …, V k

17 Applications : Planted Models V1V1 V2V2 V3V3 A : n x n adjacency matrix Rows of A : points in n-dimensions Each entry of A : ind. Bernoulli r.v. with std. dev. σ ||A-C|| = O( ¾ n 1/2 ) [Wigner] Our algorithm matches best-known results. [McSherry ‘01] So inter-row separation (in C) of Ω(σ) implies proximity. implies

18 Applications : distributions with bounded variance |  s -  r | ¸ ¾ / ² All but O( ² ) fraction of the points satisfy the proximity condition Do not need product distributions Assuming

19 Algorithm Two key steps. 1. Compute best rank-k approx to the points. Run any constant approx. algorithm for the k-means problem on these projected points. Yields a set of initial candidate means. Message to practioners – ML, Statistics, and Others who knew long before theory that k-means Works: You are right (this time), but do start with ``natural’’ step 1 Starting points.

20 Algorithm Theorem For each original center  r, we get an estimated center r close to it from step 1(SVD).

21 Algorithm Step 2 : 1. Let 1, …, k be the current centers. 2. Assign each point to the closer center partitions the points into S 1, …, S k 3. Update new centers as the means of S 1, …, S k Repeat

22 22 Step 2 : Example

23 23 Step 2 : Example

24 24 Step 2 : Example

25 Key Technical Lemma 1. Let 1, …, k be the current centers. 2. Assign each point to the closer center partitions the points into S 1, …, S k 3. Update new centers as the means of S 1, …, S k ´ 1, …, ´ k be the new centers.

26 Misclassified Point 11 1 2 22 4¾4¾ t¾t¾ ±¾ u How many such points ? Otherwise, (A-C).y has at least these many coordinates with value at least u. y But then, |(A-C).y| is more than ¾ n 1/2

27 Misclassified Point Number of misclassified points

28 Misclassified Point Number of misclassified points Removing these points from C 1 shifts the mean by at most ±¾ /t Similarly for addition of misclassified points from other clusters

29 Misclassified Points Mean of misclassified points is at distance at most from overall mean of the cluster. This will shift the mean of remaining points by M

30 Open Problems Weaker proximity conditions that will yield separation between means of C r and C s depending on ¾ r and ¾ s only ? Better dependence on w min ? Distributions with unbounded variance ? How to capture other separation conditions (e.g., [Brubaker Vempala ’08]) ?


Download ppt "Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,"

Similar presentations


Ads by Google