Download presentation
1
Lecture 7: Principal component analysis (PCA)
Rationale and use of PCA The underlying model (what is a principal component anyway?) Eigenvectors and eigenvalues of the sample covariance matrix revisited! PCA scores and loadings The use of and rationale for rotations Orthogonal and oblique rotations Component retention, significance, and reliability. Bio 8100s Applied Multivariate Biostatistics 2001
2
What is PCA? From a set of p variables X1, X2,…, Xp, we try and find (“extract”) a set of ordered indices Z1, Z2,…, Zp that are uncorrelated and ordered in terms of their variability: Var(Z1) > Var(Z2) > … > Var(Zp) Because the Zi s (principal components) are uncorrelated, they measure different “dimensions” in the data. The hope (sometimes faint) is that most of the variability in the original set of p variables will be accounted for by c < p components. Bio 8100s Applied Multivariate Biostatistics 2001
3
Why use PCA? PCA is generally used to reduce the number of variables considered in subsequent analyses, i.e. reduce the “dimensionality” of the data. Examples include: Reduce number of dependent variables in MANOVA, mutivariate regression, correlation analysis, etc. Reduce number of independent variables (predictors) in regression analysis Bio 8100s Applied Multivariate Biostatistics 2001
4
Estimating principal components
The second principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z2), subject to: The first principal component is obtained by “fitting” (i.e. estimating the coefficients of) the linear function which maximizes Var(Z1), subject to: Bio 8100s Applied Multivariate Biostatistics 2001
5
Estimating principal components (cont’d)
The third principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z3), subject to: …as well as the additional constraints... … and Bio 8100s Applied Multivariate Biostatistics 2001
6
Estimating principal components
Estimation of the coefficients for each principal component can be accomplished through several different methods (e.g. least-square estimation, maximum likelihood estimation, iterated principal axis, etc.)… The extracted principal components may differ depending on the method of estimation. Bio 8100s Applied Multivariate Biostatistics 2001
7
The geometry of principal components
X1 X2 Principal components (Zi) are linear functions of the original variables, and as such, define hyperplanes in the p dimensional space of Z and the original variables. Because the Zi s are uncorrelated, these planes meet at right angles. Z Z1 X2 X1 Bio 8100s Applied Multivariate Biostatistics 2001
8
Multivariate variance: a geometric interpretation
Larger variance Smaller variance Univariate variance is a measure of the “volume” occupied by sample points in one dimension. Multivariate variance involving p variables is the volume occupied by sample points in an p -dimensional space. X X X1 X2 Occupied volume Bio 8100s Applied Multivariate Biostatistics 2001
9
Multivariate variance: effects of correlations among variables
No correlation Multivariate variance: effects of correlations among variables X1 X2 Correlations between pairs of variables reduce the volume occupied by sample points… …and hence, reduce the multivariate variance. Positive correlation Negative correlation X1 Occupied volume X2 Bio 8100s Applied Multivariate Biostatistics 2001
10
C and the generalized multivariate variance
The determinant of the sample covariance matrix C is a generalized multivariate variance… … because area2 of a parallelogram with sides given by the individual standard deviations and angle determined by the correlation between variables equals the determinant of C. Bio 8100s Applied Multivariate Biostatistics 2001
11
Eigenvalues and eigenvectors of C
No correlation Eigenvectors of the covariance matrix C are orthogonal directed line segments that “span” the variation in the data, and the corresponding (unsigned) eigenvalues are the length of these segments. … so the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X1 Positive correlation X2 Negative correlation X1 X2 Bio 8100s Applied Multivariate Biostatistics 2001
12
The geometry of principal components (cont’d)
X1 The coefficients (aij) of the principal components (Zi) define vectors in the space of coefficients. These vectors are the eigenvectors (ai) of the sample covariance matrix C, and the corresponding (unsigned) eigenvalues (li) are the variances of each component, i.e. Var(Zi)... … and the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X2 1 l2 a2 a1 a1 l1 -1 -1 1 a2 Bio 8100s Applied Multivariate Biostatistics 2001
13
Another important relationship!
The sum of the eigenvalues of the covariance matrix C equals the sum of the diagonal elements of C, i.e. the trace of C. So, the sum of the variances of the principal components equals the sum of the variances of the original variables. Bio 8100s Applied Multivariate Biostatistics 2001
14
Scale and the correlation matrix
Since variables may be measured on different scales, and we want to eliminate scale effects, we usually work with standardized values so that each variable is scaled to have zero mean and unit variance. The sample covariance matrix of standardized variables is the sample correlation matrix R. Bio 8100s Applied Multivariate Biostatistics 2001
15
Principal component scores
Because principal components are functions, we can “plug in” the values for each variable for each observation, and calculate a PC score for each observation and each principal component. Bio 8100s Applied Multivariate Biostatistics 2001
16
Principal component loadings
Component loadings (Lij) are the covariances (correlations for standardized values) of the original variables used in the PCA with the components, and are proportional to the component coefficients (aij). For each component, the (loading)2 for each variable summed over all variables equals the variance of the component. Bio 8100s Applied Multivariate Biostatistics 2001
17
More on loadings Sometimes components have variables with similar loadings, which form a “natural” group. To assist in interpretation, we may want to choose another component frame which emphasizes these differences among groups. FACTOR(2) Factor plot Bio 8100s Applied Multivariate Biostatistics 2001
18
Orthogonal rotations: varimax
unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each component loads high on a small number of variables and low on other variables (simplifies factors) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001
19
Orthogonal rotations: quartimax
unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each variable loads mainly on one factor (simplified variables) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001
20
Orthogonal rotations: Equamax
unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Equamax: Combines varimax and quartimax. Number of variables that load highly on a factor and the number of factors needed to explain the variable are optimized. FACTOR(2) Equamax Bio 8100s Applied Multivariate Biostatistics 2001
21
Oblique rotations, e.g. Oblimin
unrotated FACTOR(2) Oblique (non-angle preserving): new (rotated) components are now correlated Most reasonable when significant intercorrelations among factors exist. FACTOR(2) Oblimin Bio 8100s Applied Multivariate Biostatistics 2001
22
The consequences of rotation
Unrotated components are (1) uncorrelated; (2) ordered in terms of decreasing variance (i.e., Var(Z1) > Var (Z2) >…). Orthogonally rotated components are (1) still uncorrelated, but (2) need not be ordered in terms of decreasing variance (e.g. for Varimax rotation). Obliquely rotated components are (1) correlated; (2) unordered (in general). Bio 8100s Applied Multivariate Biostatistics 2001
23
The rotated pattern matrix for obliquely rotated factors
The elements of the matrix are analogous to standardized partial regression coefficients from a multiple regression analysis. So each element quantifies the importance of the variable in question to the component, once the effects of other variables are controlled. Rotated Pattern Matrix (OBLIMIN, Gamma = ) HEIGHT ARM_SPAN FOREARM LOWERLEG WEIGHT BITRO CHESTGIR CHESTWID Bio 8100s Applied Multivariate Biostatistics 2001
24
The rotated structure matrix for obliquely rotated factors
HEIGHT ARM_SPAN FOREARM LOWERLEG WEIGHT BITRO CHESTGIR CHESTWID The elements of the rotated structure matrix are the simple correlations of the variable in question with the factor, i.e. the component loadings. For orthogonal factors, the factor pattern and factor structure matrices are identical. Bio 8100s Applied Multivariate Biostatistics 2001
25
Which rotation is the best?
Object: find the rotation which achieves the simplest structure among component loadings, thereby making interpretation comparatively easy. Thurstone’s criteria: for p variables and m < p components: (1) each component should have at least m near-zero loadings; (2) few components should have non-zero loadings on the same variable. Bio 8100s Applied Multivariate Biostatistics 2001
26
A final word on rotations
“You cannot say that any rotation is better than any other rotation from a statistical point of view: all rotations are equally good statistically. Therefore, the choice among different rotations must be based on non-statistical grounds…” SAS STAT User’s guide, Vol. 1, p. 776. Bio 8100s Applied Multivariate Biostatistics 2001
27
How many components to retain in subsequent analysis?
Kaiser rule: retain only components with eigenvalues > 1. Scree test: plot eigenvalues against their ordinal numbers, retain all components in “steep decent” part of the curve. Retain as many factors as required to account for a specified amount of the total variance (e.g. 85%) Scree plot Kaiser threshold Eigenvalue Bio 8100s Applied Multivariate Biostatistics 2001
28
More on interpretation: the significance of loadings
Since loadings are correlation coefficients (r), we can test the null that each correlation equals zero. But analytic estimates of standard errors are often too small, especially for rotated loadings. So, as a rule of thumb, use double the critical value to test significance. E.g., for N = 100, r(a = 0.01) = 0.286, so “significant” factors have loadings greater than 2(0.286). Bio 8100s Applied Multivariate Biostatistics 2001
29
Component reliability: rules of thumb
The absolute magnitude and number of loadings are crucial for determining reliability Components with at least 4 loadings > |0.60| or with at least 3 loadings > |0.80| are reliable. For N > 150, components with at least 10 loadings > |0.40| are reliable. Bio 8100s Applied Multivariate Biostatistics 2001
30
PCA: the procedure 1. Calculate sample covariance matrix or correlation matrix. If all variables are on same scale, use sample covariance matrix, otherwise use correlation matrix. 2. Run PCA to extract unrotated components (“initial extraction”). 3. Decide which components to use in subsequent analysis based on Kaiser rule, Scree plots, etc. 4. Based on (3), rerun analysis using different orthogonal and oblique rotations and compare using factor plots (‘follow-up extraction”) Bio 8100s Applied Multivariate Biostatistics 2001
31
PCA: the procedure (cont’d)
5. For obliquely rotated components, calculate correlations among components. Small correlations suggest that orthogonal rotations are reasonable. 6. Evaluate statistical significance of component loadings obtained from “best” rotation. 7. Check component reliability by redoing steps (1) - (6) with another (independent) data set, and compare the component loadings obtained from the two data sets. Are they close? Bio 8100s Applied Multivariate Biostatistics 2001
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.