Lecture 7: Principal component analysis (PCA) Rationale and use of PCA The underlying model (what is a principal component anyway?) Eigenvectors and eigenvalues of the sample covariance matrix revisited! PCA scores and loadings The use of and rationale for rotations Orthogonal and oblique rotations Component retention, significance, and reliability. Bio 8100s Applied Multivariate Biostatistics 2001
What is PCA? From a set of p variables X1, X2,…, Xp, we try and find (“extract”) a set of ordered indices Z1, Z2,…, Zp that are uncorrelated and ordered in terms of their variability: Var(Z1) > Var(Z2) > … > Var(Zp) Because the Zi s (principal components) are uncorrelated, they measure different “dimensions” in the data. The hope (sometimes faint) is that most of the variability in the original set of p variables will be accounted for by c < p components. Bio 8100s Applied Multivariate Biostatistics 2001
Why use PCA? PCA is generally used to reduce the number of variables considered in subsequent analyses, i.e. reduce the “dimensionality” of the data. Examples include: Reduce number of dependent variables in MANOVA, mutivariate regression, correlation analysis, etc. Reduce number of independent variables (predictors) in regression analysis Bio 8100s Applied Multivariate Biostatistics 2001
Estimating principal components The second principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z2), subject to: The first principal component is obtained by “fitting” (i.e. estimating the coefficients of) the linear function which maximizes Var(Z1), subject to: Bio 8100s Applied Multivariate Biostatistics 2001
Estimating principal components (cont’d) The third principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z3), subject to: …as well as the additional constraints... … and Bio 8100s Applied Multivariate Biostatistics 2001
Estimating principal components Estimation of the coefficients for each principal component can be accomplished through several different methods (e.g. least-square estimation, maximum likelihood estimation, iterated principal axis, etc.)… The extracted principal components may differ depending on the method of estimation. Bio 8100s Applied Multivariate Biostatistics 2001
The geometry of principal components X1 X2 Principal components (Zi) are linear functions of the original variables, and as such, define hyperplanes in the p + 1 - dimensional space of Z and the original variables. Because the Zi s are uncorrelated, these planes meet at right angles. Z Z1 X2 X1 Bio 8100s Applied Multivariate Biostatistics 2001
Multivariate variance: a geometric interpretation Larger variance Smaller variance Univariate variance is a measure of the “volume” occupied by sample points in one dimension. Multivariate variance involving p variables is the volume occupied by sample points in an p -dimensional space. X X X1 X2 Occupied volume Bio 8100s Applied Multivariate Biostatistics 2001
Multivariate variance: effects of correlations among variables No correlation Multivariate variance: effects of correlations among variables X1 X2 Correlations between pairs of variables reduce the volume occupied by sample points… …and hence, reduce the multivariate variance. Positive correlation Negative correlation X1 Occupied volume X2 Bio 8100s Applied Multivariate Biostatistics 2001
C and the generalized multivariate variance The determinant of the sample covariance matrix C is a generalized multivariate variance… … because area2 of a parallelogram with sides given by the individual standard deviations and angle determined by the correlation between variables equals the determinant of C. Bio 8100s Applied Multivariate Biostatistics 2001
Eigenvalues and eigenvectors of C No correlation Eigenvectors of the covariance matrix C are orthogonal directed line segments that “span” the variation in the data, and the corresponding (unsigned) eigenvalues are the length of these segments. … so the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X1 Positive correlation X2 Negative correlation X1 X2 Bio 8100s Applied Multivariate Biostatistics 2001
The geometry of principal components (cont’d) X1 The coefficients (aij) of the principal components (Zi) define vectors in the space of coefficients. These vectors are the eigenvectors (ai) of the sample covariance matrix C, and the corresponding (unsigned) eigenvalues (li) are the variances of each component, i.e. Var(Zi)... … and the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X2 1 l2 a2 a1 a1 l1 -1 -1 1 a2 Bio 8100s Applied Multivariate Biostatistics 2001
Another important relationship! The sum of the eigenvalues of the covariance matrix C equals the sum of the diagonal elements of C, i.e. the trace of C. So, the sum of the variances of the principal components equals the sum of the variances of the original variables. Bio 8100s Applied Multivariate Biostatistics 2001
Scale and the correlation matrix Since variables may be measured on different scales, and we want to eliminate scale effects, we usually work with standardized values so that each variable is scaled to have zero mean and unit variance. The sample covariance matrix of standardized variables is the sample correlation matrix R. Bio 8100s Applied Multivariate Biostatistics 2001
Principal component scores Because principal components are functions, we can “plug in” the values for each variable for each observation, and calculate a PC score for each observation and each principal component. Bio 8100s Applied Multivariate Biostatistics 2001
Principal component loadings Component loadings (Lij) are the covariances (correlations for standardized values) of the original variables used in the PCA with the components, and are proportional to the component coefficients (aij). For each component, the (loading)2 for each variable summed over all variables equals the variance of the component. Bio 8100s Applied Multivariate Biostatistics 2001
More on loadings Sometimes components have variables with similar loadings, which form a “natural” group. To assist in interpretation, we may want to choose another component frame which emphasizes these differences among groups. FACTOR(2) Factor plot Bio 8100s Applied Multivariate Biostatistics 2001
Orthogonal rotations: varimax unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each component loads high on a small number of variables and low on other variables (simplifies factors) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001
Orthogonal rotations: quartimax unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each variable loads mainly on one factor (simplified variables) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001
Orthogonal rotations: Equamax unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Equamax: Combines varimax and quartimax. Number of variables that load highly on a factor and the number of factors needed to explain the variable are optimized. FACTOR(2) Equamax Bio 8100s Applied Multivariate Biostatistics 2001
Oblique rotations, e.g. Oblimin unrotated FACTOR(2) Oblique (non-angle preserving): new (rotated) components are now correlated Most reasonable when significant intercorrelations among factors exist. FACTOR(2) Oblimin Bio 8100s Applied Multivariate Biostatistics 2001
The consequences of rotation Unrotated components are (1) uncorrelated; (2) ordered in terms of decreasing variance (i.e., Var(Z1) > Var (Z2) >…). Orthogonally rotated components are (1) still uncorrelated, but (2) need not be ordered in terms of decreasing variance (e.g. for Varimax rotation). Obliquely rotated components are (1) correlated; (2) unordered (in general). Bio 8100s Applied Multivariate Biostatistics 2001
The rotated pattern matrix for obliquely rotated factors The elements of the matrix are analogous to standardized partial regression coefficients from a multiple regression analysis. So each element quantifies the importance of the variable in question to the component, once the effects of other variables are controlled. Rotated Pattern Matrix (OBLIMIN, Gamma = 0.0000) 1 2 HEIGHT 0.909 0.060 ARM_SPAN 0.957 -0.017 FOREARM 0.953 -0.048 LOWERLEG 0.916 0.028 WEIGHT 0.054 0.897 BITRO -0.011 0.864 CHESTGIR -0.090 0.882 CHESTWID 0.088 0.749 Bio 8100s Applied Multivariate Biostatistics 2001
The rotated structure matrix for obliquely rotated factors 1 2 HEIGHT 0.933 0.363 ARM_SPAN 0.935 0.452 FOREARM 0.950 0.396 LOWERLEG 0.928 0.423 WEIGHT 0.441 0.921 BITRO 0.410 0.787 CHESTGIR 0.362 0.860 CHESTWID 0.290 0.843 The elements of the rotated structure matrix are the simple correlations of the variable in question with the factor, i.e. the component loadings. For orthogonal factors, the factor pattern and factor structure matrices are identical. Bio 8100s Applied Multivariate Biostatistics 2001
Which rotation is the best? Object: find the rotation which achieves the simplest structure among component loadings, thereby making interpretation comparatively easy. Thurstone’s criteria: for p variables and m < p components: (1) each component should have at least m near-zero loadings; (2) few components should have non-zero loadings on the same variable. Bio 8100s Applied Multivariate Biostatistics 2001
A final word on rotations “You cannot say that any rotation is better than any other rotation from a statistical point of view: all rotations are equally good statistically. Therefore, the choice among different rotations must be based on non-statistical grounds…” SAS STAT User’s guide, Vol. 1, p. 776. Bio 8100s Applied Multivariate Biostatistics 2001
How many components to retain in subsequent analysis? Kaiser rule: retain only components with eigenvalues > 1. Scree test: plot eigenvalues against their ordinal numbers, retain all components in “steep decent” part of the curve. Retain as many factors as required to account for a specified amount of the total variance (e.g. 85%) Scree plot Kaiser threshold Eigenvalue Bio 8100s Applied Multivariate Biostatistics 2001
More on interpretation: the significance of loadings Since loadings are correlation coefficients (r), we can test the null that each correlation equals zero. But analytic estimates of standard errors are often too small, especially for rotated loadings. So, as a rule of thumb, use double the critical value to test significance. E.g., for N = 100, r(a = 0.01) = 0.286, so “significant” factors have loadings greater than 2(0.286). Bio 8100s Applied Multivariate Biostatistics 2001
Component reliability: rules of thumb The absolute magnitude and number of loadings are crucial for determining reliability Components with at least 4 loadings > |0.60| or with at least 3 loadings > |0.80| are reliable. For N > 150, components with at least 10 loadings > |0.40| are reliable. Bio 8100s Applied Multivariate Biostatistics 2001
PCA: the procedure 1. Calculate sample covariance matrix or correlation matrix. If all variables are on same scale, use sample covariance matrix, otherwise use correlation matrix. 2. Run PCA to extract unrotated components (“initial extraction”). 3. Decide which components to use in subsequent analysis based on Kaiser rule, Scree plots, etc. 4. Based on (3), rerun analysis using different orthogonal and oblique rotations and compare using factor plots (‘follow-up extraction”) Bio 8100s Applied Multivariate Biostatistics 2001
PCA: the procedure (cont’d) 5. For obliquely rotated components, calculate correlations among components. Small correlations suggest that orthogonal rotations are reasonable. 6. Evaluate statistical significance of component loadings obtained from “best” rotation. 7. Check component reliability by redoing steps (1) - (6) with another (independent) data set, and compare the component loadings obtained from the two data sets. Are they close? Bio 8100s Applied Multivariate Biostatistics 2001