Affine-invariant Principal Components Charlie Brubaker and Santosh Vempala Georgia Tech School of Computer Science Algorithms and Randomness Center.

Slides:



Advertisements
Similar presentations
FMRI Methods Lecture 10 – Using natural stimuli. Reductionism Reducing complex things into simpler components Explaining the whole as a sum of its parts.
Advertisements

Independent Component Analysis
3D Geometry for Computer Graphics
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
The Complexity of Unsupervised Learning Santosh Vempala, Georgia Tech.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Slides by Olga Sorkine, Tel Aviv University. 2 The plan today Singular Value Decomposition  Basic intuition  Formal definition  Applications.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
© 2003 by Davi GeigerComputer Vision September 2003 L1.1 Face Recognition Recognized Person Face Recognition.
Principal Component Analysis
Computer Graphics Recitation 5.
3D Geometry for Computer Graphics
Dimensionality Reduction and Embeddings
Computer Graphics Recitation 6. 2 Last week - eigendecomposition A We want to learn how the transformation A works:
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5.
3D Geometry for Computer Graphics
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Dimensionality Reduction
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
The Multivariate Normal Distribution, Part 2 BMTRY 726 1/14/2014.
Radial Basis Function Networks
SVD(Singular Value Decomposition) and Its Applications
Robust PCA in Stata Vincenzo Verardi FUNDP (Namur) and ULB (Brussels), Belgium FNRS Associate Researcher.
Summarized by Soo-Jin Kim
Linear Least Squares Approximation. 2 Definition (point set case) Given a point set x 1, x 2, …, x n  R d, linear least squares fitting amounts to find.
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Presented By Wanchen Lu 2/25/2013
Object Orie’d Data Analysis, Last Time
Classification (Supervised Clustering) Naomi Altman Nov '06.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
Principles of Pattern Recognition
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Spectral Algorithms for Learning and Clustering Santosh Vempala Georgia Tech.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
CSE 185 Introduction to Computer Vision Face Recognition.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Models for Classification
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
3D Geometry for Computer Graphics Class 3. 2 Last week - eigendecomposition A We want to learn how the transformation A works:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Chapter 61 Chapter 7 Review of Matrix Methods Including: Eigen Vectors, Eigen Values, Principle Components, Singular Value Decomposition.
Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
CSE 554 Lecture 8: Alignment
LECTURE 10: DISCRIMINANT ANALYSIS
Probabilistic Models with Latent Variables
Principal Component Analysis
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Lecture 13: Singular Value Decomposition (SVD)
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Affine-invariant Principal Components Charlie Brubaker and Santosh Vempala Georgia Tech School of Computer Science Algorithms and Randomness Center

What is PCA? “PCA is a mathematical tool for finding directions in which a distribution is stretched out.” Widely used in practice Gives best-known results for some problems

History First discussed by Euler in a work on inertia of rigid bodies (1730). Principal Axes identified as eigenvectors by Lagrange. Power method for finding eigenvectors published in 1929, before computers Ubiquitous in practice today: Bioinformatics, Econometrics, Data Mining, Computer Vision,...

4 Principal Components Analysis For points a 1 …a m in R n, the principal components are orthogonal vectors v 1 …v n s.t. V k = span{v 1 …v k } minimizes among all k-subspaces. Like regression. Computed via SVD.

Singular Value Decomposition (SVD) Real m x n matrix A can be decomposed as:

6 PCA (continued) Example: for a Gaussian the principal components are the axes of the ellipsoidal level sets. v1v1 v2v2 “top” principal components = where the data is “stretched out.”

7 Why Use PCA? 1.Reduces computation or space. Space goes from O(mn) to O(mk+nk). --- Random Projection, Random Sampling also reduce space requirement 2.Reveals interesting structure that is hidden in high dimension.

Problem Learn a mixture of Gaussians Classify unlabeled samples Each component is a logconcave distribution (e.g., Gaussian). Means, variances and mixing weights are unknown

9 Distance-based Classification “ Points from the same component should be closer to each other than those from different components.”

Mixture models Easy to unravel if components are far enough apart Impossible if components are too close

Distance-based classification How far apart? Thus, suffices to have [Dasgupta ‘99] [Dasgupta, Schulman ‘00] [Arora, Kannan ‘01] (more general)

PCA Project to span of top k principal components of the data Replace A with Apply distance-based classification in this subspace

Main idea Subspace of top k principal components spans the means of all k Gaussians

SVD in geometric terms Rank 1 approximation is the projection to the line through the origin that minimizes the sum of squared distances. Rank k approximation is projection k-dimensional subspace minimizing sum of squared distances.

Why? Best line for 1 Gaussian? - Line through the mean Best k-subspace for 1 Gaussian? - Any k-subspace through the mean Best k-subspace for k Gaussians? - The k-subspace through all k means!

How general is this? Theorem [V-Wang’02]. For any mixture of weakly isotropic distributions, the best k- subspace is the span of the means of the k components. “weakly isotropic”: Covariance matrix = multiple of identity

17 PCA Projection to span of means gives For spherical Gaussians, Span(means) = PCA subspace of dim k.

Sample SVD Sample SVD subspace is “close” to mixture’s SVD subspace. Doesn’t span means but is close to them.

2 Gaussians in 20 Dimensions

4 Gaussians in 49 Dimensions

Mixtures of Logconcave Distributions Theorem [Kannan-Salmasian-V ’04]. For any mixture of k distributions with SVD subspace V,

22 Mixtures of Nonisotropic, Logconcave Distributions Theorem [Kannan, Salmasian, V, ‘04]. The PCA subspace V is “close” to the span of the means, provided that means are well- separated. where is the maximum directional variance. Polynomial was improved by Achlioptas-McSherry. Required separation:

However,… PCA collapses separable “pancakes”

Limits of PCA Algorithm is not affine invariant. Any instance can be made bad by an affine transformation. Spherical Gaussians become parallel pancakes but remain separable.

25 Parallel Pancakes Still separable, but previous algorithms don’t work.

Separability

27 Hyperplane Separability PCA is not affine-invariant. Is hyperplane separability sufficient to learn a mixture?

Affine-invariant principal components? What is an affine-invariant property that distinguishes 1 Gaussian from 2 pancakes? Or a ball from a cylinder?

29 Isotropic PCA 1.Make point set isotropic via an affine transformation. 2.Reweight points according to a spherically symmetric function f(|x|). 3.Return the 1 st and 2 nd moments of reweighted points.

30 Isotropic PCA [BV’08] Goal: Go beyond 1 st and 2 nd moments to find “interesting” directions. Why? What if all 2 nd moments are equal? v?v? v?v? v?v? v?v? This isotropy can always be achieved by an affine transformation.

Ball vs Cylinder

32 Algorithm 1.Make distribution isotropic. 2.Reweight points. 3.If mean shifts, partition along this direction. Recurse. 4.Otherwise, partition along top principle component. Recurse.

33 Step1: Enforcing Isotropy Isotropy: a. Mean = 0 and b. Variance = 1 in every direction Step 1a: move the origin to the mean (translation). Step 1b: apply linear transformation

34 Step 1: Enforcing Isotropy

35 Step 1: Enforcing Isotropy

36 Step 1: Enforcing Isotropy Turns every well-separated mixture into (almost) parallel pancakes, separable along the intermean direction. PCA no longer helps us!

37 Algorithm 1.Make distribution isotropic. 2.Reweight points (using a Gaussian). 3.If mean shifts, partition along this direction. Recurse. 4.Otherwise, partition along top principle component. Recurse.

Two parallel pancakes Isotropy pulls apart the components If one is heavier, then overall mean shifts along the separating direction If not, principal component is along the separating direction

39 Steps 3 & 4: Illustrative Examples Imbalanced Pancakes: Balanced Pancakes: Mean

40 Step 3: Imbalanced Case Mean shifts toward heavier component

41 Step 4: Balanced Case Mean doesn’t move by symmetry. Top principle component is inter-mean direction

Unraveling Gaussian Mixtures Theorem [Brubaker-V. ’08] The algorithm correctly classifies samples from two arbitrary Gaussians “separable by a hyperplane” with high probability.

Original Data 40 dimensions, 8000 samples (subsampled for visualization) Means of (0,0) and (1,1).

Random Projection

PCA

Isotropic PCA

47 Results:k=2 Theorem: For k=2, algorithm succeeds if there is some direction v such that: (i.e., hyperplane separability.)

48 Fisher Criterion For a direction p, intra-component variance along p J(p) = total variance along p Overlap: Min J(p) over all directions p. (small overlap => well-separated) Theorem: For k=2, algorithm suceeds if overlap is

49 Results:k>2 For k > 2, we need k-1 orthogonal directions with small overlap

50 Fisher Criterion J(S)= max intra-component variance within S Make F isotropic. For subspace S Overlap is affine-invariant. Overlap = Min J(S), S: k-1 dim subspace Theorem [BV ’08]: For k>2, the algorithm succeeds if the overlap is at most

51 Original Data (k=3) 40 dimensions, samples (subsampled for visualization)

52 Random Projection

53 PCA

54 Isotropic PCA

Conclusion Most of this in a new book: “Spectral Algorithms,” with Ravi Kannan IsoPCA gives an affine-invariant clustering (independent of a model) What do Iso-PCA directions mean? Robust PCA (Brubaker 08; robust to small changes in point set); applied to noisy/best-fit mixtures. PCA for tensors?