Download presentation
Presentation is loading. Please wait.
Published byMarion Horton Modified over 8 years ago
1
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA
2
Overview The purpose of dimension reduction: Data simplification Data visualization Reduce noise (if we can assume only the dominating dimensions are signals) Variable selection for prediction
3
Overview Data separationDimension reduction Outcome variable y exists (learning the association rule) Classification, regression SIR, Class- preserving projection, Partial least squares No outcome variable (learning intrinsic structure) ClusteringPCA, MDS, Factor Analysis, ICA, NCA…
4
PCA Explain the variance-covariance structure among a set of random variables by a few linear combinations of the variables; Does not require normality!
5
PCA
7
7 Reminder of some results for random vectors
8
8 Proof of the first (and second) point of the previous slide.
9
PCA The eigen values are the variance components: Proportion of total variance explained by the kth PC:
10
PCA
11
The geometrical interpretation of PCA:
12
PCA using the correlation matrix, instead of the covariance matrix? This is equivalent to first standardizing all X vectors. PCA
13
Using the correlation matrix avoids the domination from one X variable due to scaling (unit changes), for example using inch instead of foot. Example: PCA
14
Selecting the number of components? Based on eigen values (% variation explained). Assumption: the small amount of variation explained by low- rank PCs is noise.
15
Sparse PCA In high-dimensional data, loadings of a single PC on 10,000 genes doesn’t make much sense. To obtain sparse loadings, and make the interpretation easier, and the model more robust. SCoTLASS
16
Zhou, Hastie, Tibshirani’s SPCA by regression: Sparse PCA
17
Factor Analysis If we take the first several PCs that explain most of the variation in the data, we have one form of factor model. L: loading matrix F: unobserved random vector (latent variables). ε: unobserved random vector (noise)
18
Factor Analysis Orthogonal factor model assumes no correlation between the factor RVs. is a diagonal matrix
19
Factor Analysis
20
Rotations in the m-dimensional subspace defined by the factors make the solution non-unique: PCA is one unique solution, as the vectors are sequentially selected. Maximum likelihood estimator is another solution:
21
Factor Analysis As we said, rotations within the m-dimensional subspace doesn’t change the overall amount of variation explained. Do rotation to make the results more interpretable:
22
Factor Analysis Varimax criterion: Find T such that is maximized. V is proportional to the summation of the variance of the squared loadings. Maximizing V makes the squared loadings as spread out as possible --- some are real small, and some are real big.
23
23 Orthogonal simple factor rotation: Rotate the orthogonal factors around the origin until the system is maximally aligned with the separate clusters of variables. Oblique Simple Structure Rotation: Allow the factors to become correlated. Each factor is rotated individually to fit a cluster. Factor Analysis
24
MDS Multidimensional scaling is a dimension reduction procedure that maps the distances between observations to a lower dimensional space. Minimize this objective function: D: distance in the original space d: distance in the reduced dimension space. Numerical method is used for the minimization.
25
Projection pursuit A very broad term: finding the most “interesting” direction of projection. How the projection is done depends on the definition of “interesting”. If it is maximal variation, then PP leads to PCA. In a narrower sense: Finding non-Gaussian projections. For most high-dimensional clouds, most low-dimensional projections are close to Gaussian important information in the data is in the directions for which the projected data is far from Gaussian.
26
Projection pursuit It boils down to objective functions – each kind of “interesting” has its own objective function to maximize.
27
PCA Projection pursuit with multi-modality as objective. Projection pursuit
28
One objective function to measure multi-modality: It uses the first three moments of the distribution. It can help finding clusters through visualization. To find w, the function is maximized over w by gradient ascent:
29
Projection pursuit Can think of PCA as a case of PP, with the objective function: For other PC directions, find projection onto space orthogonal to the previously found PCs.
30
Projection pursuit Some other objective functions (y is the RV generated by projection w’x) The Kurtosis as defined here has value 0 for normal distribution. Higher Kertusis: peaked and fat-tailed. http://www.riskglossary.co m/link/kurtosis.htm
31
ICA finds a unique solution by requiring the factors to be statistically independent, rather than just uncorrelated. Lack of correlation only determines the second-degree cross- moment, while statistical independence means for any functions g1() and g2(), For multivariate Gaussian, uncorrelatedness = independence Independent component analysis Again, another view of dimension reduction is factorization into latent variables.
32
ICA Multivariate Gaussian is determined by second moments alone. Thus if the true hidden factors are Gaussian, then still they can be determined only up to a rotation. In ICA, the latent variables are assumed to be independent and non-Gaussian. The matrix A must have full column rank.
33
Independent component analysis ICA is a special case of PP. The key is again for y being non-Gaussian. Several ways to measure non-Gaussianity: (1)Kurtotis (zero for Gaussian RV, sensitive to outliers) (2)Entropy (Gaussian RV has the largest entropy given the first and second moments) (3) Negentropy: y gauss is a Gaussian RV with the same covariance matrix as y.
34
ICA To measure statistical independence, use mutual information, Sum of marginal entropies minus the overall entropy Non-negative ; Zero if and only if independent.
35
ICA The computation: There is no closed form solution, hence gradient descent is used. Approximation to negentropy (for less intensive computation and better resistance to outliers) Two commonly used G(): v is standard gaussian. G() is some nonquadratic function. When G(x)=x 4 this is Kurtosis.
36
Normal density Whitening transform:
37
ICA FastICA: Center the X vectors to mean zero. Whiten the X vectors such that E(xx’)=I. This is done through eigen value decomposition. Initialize the weight vector w Iterate: w + =E{xg(w T x)}-E{g’(w T x)}w w=w + /||w + || until convergence g() is the derivative of the non-quadratic function
38
Figure 14.28: Mixtures of independent uniform random variables. The upper left panel shows 500 realizations from the two independent uniform sources, the upper-right panel their mixed versions. The lower two panels show the PCA and ICA solutions, respectively. ICA
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.