The Stability of a Good Clustering Marina Meila University of Washington

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Globally Optimal Estimates for Geometric Reconstruction Problems Tom Gilat, Adi Lakritz Advanced Topics in Computer Vision Seminar Faculty of Mathematics.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Sketching for M-Estimators: A Unified Approach to Robust Regression
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Lecture 21: Spectral Clustering
Learning in Spectral Clustering Susan Shortreed Department of Statistics University of Washington joint work with Marina Meilă.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis.
Chapter 5 Orthogonality
3.II.1. Representing Linear Maps with Matrices 3.II.2. Any Matrix Represents a Linear Map 3.II. Computing Linear Maps.
3D Geometry for Computer Graphics. 2 The plan today Least squares approach  General / Polynomial fitting  Linear systems of equations  Local polynomial.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts
Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
CISS Princeton, March Optimization via Communication Networks Matthew Andrews Alcatel-Lucent Bell Labs.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Visual Recognition Tutorial
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
T he Separability Problem and its Variants in Quantum Entanglement Theory Nathaniel Johnston Institute for Quantum Computing University of Waterloo.
Today Wrap up of probability Vectors, Matrices. Calculus
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Summarized by Soo-Jin Kim
Chapter 5: The Orthogonality and Least Squares
CHAPTER FIVE Orthogonality Why orthogonal? Least square problem Accuracy of Numerical computation.
AN ORTHOGONAL PROJECTION
Orthogonality and Least Squares
CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.
Chapter 2 Nonnegative Matrices. 2-1 Introduction.
4.8 Rank Rank enables one to relate matrices to vectors, and vice versa. Definition Let A be an m  n matrix. The rows of A may be viewed as row vectors.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality reduction
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Krylov-Subspace Methods - I Lecture 6 Alessandra Nardi Thanks to Prof. Jacob White, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Hartmut Klauck Centre for Quantum Technologies Nanyang Technological University Singapore.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
6 6.5 © 2016 Pearson Education, Ltd. Orthogonality and Least Squares LEAST-SQUARES PROBLEMS.
Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.
Spectral Methods for Dimensionality
Basic Algorithms Christina Gallner
LSI, SVD and Data Management
Orthogonality and Least Squares
Problem Solving 4.
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Feature space tansformation methods
Generally Discriminant Analysis
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
Maths for Signals and Systems Linear Algebra in Engineering Lecture 6, Friday 21st October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
On Clusterings: Good, Bad, and Spectral
CIS 700: “algorithms for Big Data”
Non-Negative Matrix Factorization
Orthogonality and Least Squares
Introduction to Machine Learning
Presentation transcript:

The Stability of a Good Clustering Marina Meila University of Washington

Optimizing these criteria is NP-hard’  Data  Objective  Algorithm similarities Spectral clustering K-means...but “spectral clustering, K-means work well when good clustering exists” worst case interesting case This talk: If a “good” clustering exists, it is “unique” If “good” clustering found, it is provably good

Results summary  Given  objective = NCut, K-means distortion  data  clustering Y with K clusters  Spectral lower bound on distortion  If small  Then small where = best clustering with K clusters

distortion A graphical view clusterings lower bound

Overview  Introduction  Matrix representations for clusterings  Quadratic representation for clustering cost  The misclassification error distance  Results for NCut (easier)  Results for K-means distortion (harder)  Discussion

Clusterings as matrices  Clustering of { 1,2,..., n } with K clusters (C 1, C 2,...C K )  Represented by n x K matrix  unnormalized  normalized  All matrices have orthogonal columns

Distortion is quadratic in X NCut K-means similarities

k k’ m kk’ The Confusion Matrix Two clusterings  (C 1, C 2,... C K ) with  (C’ 1, C’ 2,... C’ K’ ) with  Confusion matrix (K x K’) =

The Misclassification Error distance  computed by the maximal bipartite matching algorithm between clusters confusion matrix classification error k k’

Results for NCut  given  data A (n x n)  clustering X (n x K)  Lower bound for NCut (M02, YS03, BJ03)  Upper bound for (MSX’05) whenever largest e-values of A

small w.r.t eigengap K+1 - K X close to X * Two clusterings X,X’ close to X * trace X T X’ large small convexity proof Relaxed minimization for s.t. X = n x K orthogonal matrix Solution: X * = K principal e-vectors of A

Distances between clusterings  The “  2 ” distance  Pearson’s  2 functional  1 ·  2 · K   2(C, C’) = K iff C = C ’  minimum at independence  define “distance” (not a metric) a variant used by Bach & Jordan 03, Huber & Arabie 85

  2 is Pearson’s statistic  0 ·  2 · K-1   2( ,  ’) = K-1 iff  =  ’  measures how “close” are two clusterings  define “distance”  Theorem For any S and any clusterings ,  ’ with K clusters (M & Xu, 03) “Stability” of the best clustering

 Stability Theorem 2 Let be two clusterings with Then, with ` Proof: linear algebra convexity of  2 Tighter bounds possible d CE d2d2

Tighter bounds  ( , C ) C non-uniform C uniform d CE d2d2 d2d2

Why the eigengap matters  Example  A has 3 diagonal blocks  K = 2  gap( C ) = gap( C’ ) = 0 but C, C’ not close CC’

Remarks on stability results  No explicit conditions on S  Different flavor from other stability results, e.g Kannan & al 00, Ng & al 01 which assume S “almost” block diagonal  But…results apply only if a good clustering is found  There are S matrices for which no clustering satisfies theorem  Bound depends on aggregate quantities like  K  cluster sizes (=probabilities)  Points are weighted by their volumes (degrees)  good in some applications  bounds for unweighted distances can be obtained

Is the bound ever informative?  An experiment: S perfect + additive noise

 We can do the same... ...but, K-th principal subspace typically not stable K-means distortion 4 K = 4 dim = 30

New approach: Use K-1 vectors  Non-redundant representation Y  Distortion – new expression ...and new (relaxed) optimization problem

Solution of the new problem  Relaxed optimization problem given  Solution  U = K-1 principal e-vectors of A  W = KxK orthogonal matrix with on first row

Clusterings Y,Y’ close to Y * ||Y T Y’|| F large Solve relaxed minimization small Y close to Y * ||Y T Y’|| F large small

 Theorem For any two clusterings Y,Y’ with  Y,  Y’ > 0 whenever Corollary: Bound for d(Y,Y opt )

Experiments 20 replicates K = 4 dim = 30 true error bound p min

B A D

Conclusions  First (?) distribution independent bounds on the clustering error  data dependent  hold when data well clustered (this is the case of interest)  Tight? – not yet...  In addition  Improved variational bound for the K-means cost  Showed local equivalence between “misclassification error” distance and “Frobenius norm distance” (also known as  2 distance)  Related work  Bounds for mixtures of Gaussians (Dasgupta, Vempala)  Nearest K-flat to n points (Tseng)  Variational bounds for sparse PCA (Mogghadan)