Statistical perturbation theory for spectral clustering Harrachov, 2007 A. Spence and Z. Stoyanov
Plan of the Talk A. Clustering (Brief overview). B. Deterministic Perturbation Theory. C. Statistical Perturbation Theory.
Graph Clustering
Graph Clustering + Perturbation ?
Gene Expression Data Clustering An Application There are over genes expressed in any one tissue; DNA arrays typically produce very noisy data. 1.Genes in same cluster behave similarly? 2. Genes in different clusters behave differently? 1.Genes in same cluster behave similarly? 2. Genes in different clusters behave differently? Issues:
Bi-partite Graphs
Matrix Form
A Real Data Matrix (Leukemia)
Spectral Clustering: General Idea Discrete Optimisation Problem (NP - Hard) Discrete Optimisation Problem (NP - Hard) Real Optimisation Problem (Tractable) Real Optimisation Problem (Tractable) Approximation Exact - Impractical Heuristic - Practical
Discrete Optimisation SVD Active Inactive Active Solution: Singular Value Decomposition of W scaled
Clustering Algorithm: Summary ACTIVE INACTIVE
Literature
Types of Graph Matrices
How we Cluster
Leukemia Data
Clustered Leukemia Data
Inaccuracies in the Data (Perturbation Theory)
Perturbation Theory (Deterministic Noise)
Deterministic Perturbation (Symmetric Matrix)
Linear Solve
Taylor Expansions
Rectangular Case Symmetric
Random Perturbations (plan) The Model Issues with the Theory A Possible Solution via Simulations? Experiments
The Model
Difficulties with Random Matrix Theory (RMT)
Deterministic Perturbation Stochastic Perturbation (simple eigenvector)
Deterministic Perturbation Stochastic Perturbation (simple eigenvalues)
PP Plot -Test for Normality (Largest eigenvalue of a Symmetric Matrix)
Simulated Random Perturbation (Largest eigenvalue of a Symmetric Matrix)
Deterministic Perturbation Stochastic Perturbation (simple eigenvectors)
Results for Laplacian Matrices
Functional of the Eigenvector
Results for h T v 2
PP Plot of h T v’(0) - Test for Normality (h = e j )
Histogram of h T v’(0) - Simulations (h = e j )
PP Plot of Simulated v [j] ( ) (Distribution close to Normal)
Histogram of Simulated v [j] ( ) (Distribution close to Normal)
Extension to the Rectangular Case
Probability of “Wrong Clustering”
Issues with Numerics
Efficient Simulations
Solution via Simulations?
Solution via Simulations? (Algorithm)
Comparing: Direct Calculation Vs. Repeated Linear Solve