1 LING 696B: PCA and other linear projection methods.

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Dimensional reduction, PCA
Principle Component Analysis What is it? Why use it? –Filter on your data –Gain insight on important processes The PCA Machinery –How to do it –Examples.
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
Face Recognition Jeremy Wyatt.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Continuous Latent Variables --Bishop
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Basics of Neural Networks Neural Network Topologies.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.
CSE 185 Introduction to Computer Vision Face Recognition.
LING 696B: Mixture model and linear dimension reduction
1 LING 696B: MDS and non-linear methods of dimension reduction.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Jakob Verbeek December 11, 2009
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Feature Extraction 主講人:虞台文.
Unsupervised Learning II Feature Extraction
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Dimensionality Reduction and Principle Components Analysis
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Principal Component Analysis
EE513 Audio Signals and Systems
Dimensionality Reduction
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Presentation transcript:

1 LING 696B: PCA and other linear projection methods

2 Curse of dimensionality The higher the dimension, the more data is needed to draw any conclusion Probability density estimation: Continuous: histograms Discrete: k-factorial designs Decision rules: Nearest-neighbor and K-nearest neighbor

3 How to reduce dimension? Assume we know something about the distribution Parametric approach: assume data follow distributions within a family H Example: counting histograms for 10-D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian (Number of bins) 10 v.s. ( *11/2)

4 Linear dimension reduction Pancake/Gaussian assumption is crucial for linear methods Examples: Principle Components Analysis Multidimensional Scaling Factor Analysis

5 Covariance structure of multivariate Gaussian 2-dimensional example No correlations --> diagonal covariance matrix, e.g. Special case:  = I - log likelihood  Euclidean distance to the center Variance in each dimension Correlation between dimensions

6 Covariance structure of multivariate Gaussian Non-zero correlations --> full covariance matrix, COV(X 1,X 2 )  0 E.g.  = Nice property of Gaussians: closed under linear transformation This means we can remove correlation by rotation

7 Covariance structure of multivariate Gaussian Rotation matrix: R = (w 1, w 2 ), where w 1, w 2 are two unit vectors perpendicular to each other Rotation by 90 degree Rotation by 45 degree w 1 w 2 w1w1 w2w2 w1w1 w2w2

8 Covariance structure of multivariate Gaussian Matrix diagonalization: any 2X2 covariance matrix A can be written as: Interpretation: we can always find a rotation to make the covariance look “nice” -- no correlation between dimensions This IS PCA when applied to N dimensions Rotation!

9 Computation of PCA The new coordinates uniquely identify the rotation In computation, it’s easier to identify one coordinate at a time. Step 1: centering the data X <-- X - mean(X) Want to rotate around the center w1w1 w2w2 w3w3 3-D: 3 coordinates

10 Computation of PCA Step 2: finding a direction of projection that has the maximal “stretch” Linear projection of X onto vector w: Proj w (X) = X NXd * w dX1 (X centered) Now measure the stretch This is sample variance = Var(X*w) w x X w

11 Computation of PCA Step 3: formulate this as a constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can explode), only consider the direction So formally: max ||w||=1 Var(X*w), find w

12 Computation of PCA Some algebra (homework): Var(x) = E[(x - E[x]) 2 = E[x 2 ] - (E[x]) 2 Apply to matrices (homework) Var(X*w) = w T X T X w = w T Cov(X) w (why) Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, y T Cov(X) y >= 0 (tricky)

13 Computation of PCA Going back to the optimization problem: = max ||w||=1 Var(X*w) = max ||w||=1 w T COV(X) w The answer is the largest eigenvalue for COV(X) w1w1 The first Principle Component! (see demo)

14 More principle components We keep looking among all the projections perpendicular to w 1 Formally: max ||w 2 ||=1,w2  w1 w T Cov(X) w This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue (see demo) w2 New coordinates!

15 Rotation Can keep going until we find all projections/coordinates w 1,w 2,…,w d Putting them together, we have a big matrix W=(w 1,w 2,…,w d ) W is called an orthogonal matrix This corresponds to a rotation (sometimes plus reflection) of the pancake This pancake has no correlation between dimensions (see demo)

16 When does dimension reduction occur? Decomposition of covariance matrix If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X

17 Measuring “degree” of reduction a2a2 a1a1 Pancake data in 3D

18 Reconstruction from principle components Perfect reconstruction (x centered): Reconstruction error: w1w1 w2w2 x length direction Another fomulation of PCA x Many pieces The bigger pieces

19 A creative interpretation/ implementation of PCA Any x can be reconstructed from principle components (PC form a basis for the whole space) Output X Input X hidden = W W When (# of hidden) < (# of input), the network does dimension reduction This can be used to implement PCA “neural firing”Connection weights “encoding”

20 An intuitive application of PCA: (Story and Titze) and others Vocal tract measurements are high dimensional (different articulators) Measurements from different positions are correlated Underlying geometry: a few articulatory parameters, possibly pancake-like after collapsing a number of different sounds Big question: relate low-dimensional articulatory parameters (tongue shape) to low dimensional acoustic parameters (F1/F2)

21 Story and Titze’s application of PCA Source data: area function data obtained from MRI (d=44) Step 1: Calculate the mean Interestingly, the mean produces a schwa- like frequency response

22 Story and Titze’s application of PCA Step 2: substract the mean from the area function (center the data) Step 3: form the covariance matrix R = X T X (dXd matrix), X ~  (x, p)

23 Story and Titze’s application of PCA Step 4: eigen-decomposition of the covariance matrix, get PC’s Story calls them “empirical modes” Length of projection: Reconstruction:

24 Story and Titze’s application of PCA Story’s principle components The first 2 PC’s can do most of the reconstruction Can be seen as a perturbation of overall tongue shape (from the mean) Constriction < 0 Expansion > 0

25 Story and Titze’s application of PCA The principle components are interpretable as control parameters Acoustic-to-Articulatory mapping almost one-to-one after dimension reduction

26 Applying PCA to ultrasound data? Another imaging technique Generate a tongue profile similar to X-ray and MRI High-dimensional Correlated Need dimension reduction to interpret articulatory parameters See demo

27 An unintuitive application of PCA Latent Semantic Indexing in document retrieval Documents as vectors of word counts Try to extract some “features” by linear combination of word counts The underlying geometry unclear (mean? Distance?) The meaning of principle components unclear (rotation?) “market” “stock” “bonds”

28 Summary of PCA: PCA looks for: A sequence of linear, orthogonal projections that reveal interesting structure in data (rotation) Defining “interesting”: Maximal variance under each projection Uncorrelated structure after projection

29 Departure from PCA 3 directions of divergence Other definitions of “interesting”? Linear Discriminant Analysis Independent Component Analysis Other methods of projection? Linear but not orthogonal: sparse coding Implicit, non-linear mapping Turning PCA into a generative model Factor Analysis

30 Re-thinking “interestingness” It all depends on what you want Linear Disciminant Analysis (LDA): supervised learning Example: separating 2 classes Maximal variance Maximal separation

31 Re-thinking “interestingness” Most high-dimensional data look like Gaussian under linear projections Maybe non-Gaussian is more interesting Independent Component Analysis Projection pursuits Example: ICA projection of 2-class data Most unlike Gaussian (e.g. maximize kurtosis)

32 The “efficient coding” perspective Sparse coding: Projections do not have to be orthogonal There can be more basis vectors than the dimension of the space Neural interpretation (Dana Ballard’s talk last week) x w2w2 w1w1 w3w3 w4w4 p << d; compact coding (PCA) p > d; sparse coding Basis expansion

33 “Interesting” can be expensive Often faces difficult optimization problems Need many constraints Lots of parameter sharing Expensive to compute, no longer an eigenvalue problem

34 PCA’s relatives: Factor Analysis PCA is not a generative model: reconstruction error is not likelihood Sensitive to outliers Hard to build into bigger models Factor Analysis: adding a measurement noise to account for variability Factors: spherical Gaussian N(0,I) observation Loading matrix (scaled PC’s) Measurement noise N(0,R), R diagonal

35 PCA’s relatives: Factor Analysis Generative view: sphere --> stretch and rotate --> add noise Learning: a version of EM algorithm (see demo and synthesis)

36 Mixture of Factor Analyzers Same intuition as other mixture models: there may be several pancakes out there, each with its own center/rotation

37 PCA’s relatives: Metric multidimensional scaling Approach the problem in a different way No measurements from stimuli, but pairwise “distance” between stimuli Intend to recover some psychological space for the stimuli See Jeff’s talk