Lecture 7: Principal component analysis (PCA)

Slides:

Advertisements

Similar presentations

Lecture 3: A brief background to multivariate statistics

Advertisements

Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.

Factor Analysis Continued

Exploratory Factor Analysis

Chapter Nineteen Factor Analysis.

Dimension reduction (1)

1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Allied Multivariate Biostatistics L6.1 Lecture 6: Single-classification multivariate ANOVA (k-group.

Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.

Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.

LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.

Factor Analysis Purpose of Factor Analysis Maximum likelihood Factor Analysis Least-squares Factor rotation techniques R commands for factor analysis References.

Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.

Factor Analysis Purpose of Factor Analysis

Principal component analysis (PCA)

Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.

A quick introduction to the analysis of questionnaire data John Richardson.

Education 795 Class Notes Factor Analysis II Note set 7.

Tables, Figures, and Equations

Principal Component Analysis & Factor Analysis Psych 818 DeShon.

Chapter 2 Dimensionality Reduction. Linear Methods

Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.

Some matrix stuff.

Statistics and Linear Algebra (the real thing). Vector A vector is a rectangular arrangement of number in several rows and one column. A vector is denoted.

Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Canonical Correlation Analysis and Related Techniques Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia.

Chapter 9 Factor Analysis

Advanced Correlational Analyses D/RS 1013 Factor Analysis.

Applied Quantitative Analysis and Practices

Factor Analysis Psy 524 Ainsworth. Assumptions Assumes reliable correlations Highly affected by missing data, outlying cases and truncated data Data screening.

© 2007 Prentice Hall19-1 Chapter Nineteen Factor Analysis © 2007 Prentice Hall.

Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.

Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.

Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.

Lecture 12 Factor Analysis.

Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.

Copyright © 2010 Pearson Education, Inc Chapter Nineteen Factor Analysis.

Exploratory Factor Analysis. Principal components analysis seeks linear combinations that best capture the variation in the original variables. Factor.

Education 795 Class Notes Factor Analysis Note set 6.

Department of Cognitive Science Michael Kalsher Adv. Experimental Methods & Statistics PSYC 4310 / COGS 6310 Factor Analysis 1 PSYC 4310 Advanced Experimental.

Applied Quantitative Analysis and Practices LECTURE#19 By Dr. Osman Sadiq Paracha.

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

FACTOR ANALYSIS 1. What is Factor Analysis (FA)? Method of data reduction o take many variables and explain them with a few “factors” or “components”

Principal Component Analysis

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)

Feature Extraction 主講人：虞台文.

FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.

Basic statistical concepts Variance Covariance Correlation and covariance Standardisation.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.

Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.

Exploratory Factor Analysis

Factor analysis Advanced Quantitative Research Methods

Principal Component Analysis (PCA)

Descriptive Statistics vs. Factor Analysis

Measuring latent variables

Principal Component Analysis

Chapter_19 Factor Analysis

Principal Component Analysis

Lecture 8: Factor analysis (FA)

Exploratory Factor Analysis. Factor Analysis: The Measurement Model D1D1 D8D8 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 F1F1 F2F2.

Measuring latent variables

Presentation transcript:

Lecture 7: Principal component analysis (PCA) Rationale and use of PCA The underlying model (what is a principal component anyway?) Eigenvectors and eigenvalues of the sample covariance matrix revisited! PCA scores and loadings The use of and rationale for rotations Orthogonal and oblique rotations Component retention, significance, and reliability. Bio 8100s Applied Multivariate Biostatistics 2001

What is PCA? From a set of p variables X1, X2,…, Xp, we try and find (“extract”) a set of ordered indices Z1, Z2,…, Zp that are uncorrelated and ordered in terms of their variability: Var(Z1) > Var(Z2) > … > Var(Zp) Because the Zi s (principal components) are uncorrelated, they measure different “dimensions” in the data. The hope (sometimes faint) is that most of the variability in the original set of p variables will be accounted for by c < p components. Bio 8100s Applied Multivariate Biostatistics 2001

Why use PCA? PCA is generally used to reduce the number of variables considered in subsequent analyses, i.e. reduce the “dimensionality” of the data. Examples include: Reduce number of dependent variables in MANOVA, mutivariate regression, correlation analysis, etc. Reduce number of independent variables (predictors) in regression analysis Bio 8100s Applied Multivariate Biostatistics 2001

Estimating principal components The second principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z2), subject to: The first principal component is obtained by “fitting” (i.e. estimating the coefficients of) the linear function which maximizes Var(Z1), subject to: Bio 8100s Applied Multivariate Biostatistics 2001

Estimating principal components (cont’d) The third principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z3), subject to: …as well as the additional constraints... … and Bio 8100s Applied Multivariate Biostatistics 2001

Estimating principal components Estimation of the coefficients for each principal component can be accomplished through several different methods (e.g. least-square estimation, maximum likelihood estimation, iterated principal axis, etc.)… The extracted principal components may differ depending on the method of estimation. Bio 8100s Applied Multivariate Biostatistics 2001

The geometry of principal components X1 X2 Principal components (Zi) are linear functions of the original variables, and as such, define hyperplanes in the p + 1 - dimensional space of Z and the original variables. Because the Zi s are uncorrelated, these planes meet at right angles. Z Z1 X2 X1 Bio 8100s Applied Multivariate Biostatistics 2001

Multivariate variance: a geometric interpretation Larger variance Smaller variance Univariate variance is a measure of the “volume” occupied by sample points in one dimension. Multivariate variance involving p variables is the volume occupied by sample points in an p -dimensional space. X X X1 X2 Occupied volume Bio 8100s Applied Multivariate Biostatistics 2001

Multivariate variance: effects of correlations among variables No correlation Multivariate variance: effects of correlations among variables X1 X2 Correlations between pairs of variables reduce the volume occupied by sample points… …and hence, reduce the multivariate variance. Positive correlation Negative correlation X1 Occupied volume X2 Bio 8100s Applied Multivariate Biostatistics 2001

C and the generalized multivariate variance The determinant of the sample covariance matrix C is a generalized multivariate variance… … because area2 of a parallelogram with sides given by the individual standard deviations and angle determined by the correlation between variables equals the determinant of C. Bio 8100s Applied Multivariate Biostatistics 2001

Eigenvalues and eigenvectors of C No correlation Eigenvectors of the covariance matrix C are orthogonal directed line segments that “span” the variation in the data, and the corresponding (unsigned) eigenvalues are the length of these segments. … so the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X1 Positive correlation X2 Negative correlation X1 X2 Bio 8100s Applied Multivariate Biostatistics 2001

The geometry of principal components (cont’d) X1 The coefficients (aij) of the principal components (Zi) define vectors in the space of coefficients. These vectors are the eigenvectors (ai) of the sample covariance matrix C, and the corresponding (unsigned) eigenvalues (li) are the variances of each component, i.e. Var(Zi)... … and the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X2 1 l2 a2 a1 a1 l1 -1 -1 1 a2 Bio 8100s Applied Multivariate Biostatistics 2001

Another important relationship! The sum of the eigenvalues of the covariance matrix C equals the sum of the diagonal elements of C, i.e. the trace of C. So, the sum of the variances of the principal components equals the sum of the variances of the original variables. Bio 8100s Applied Multivariate Biostatistics 2001

Scale and the correlation matrix Since variables may be measured on different scales, and we want to eliminate scale effects, we usually work with standardized values so that each variable is scaled to have zero mean and unit variance. The sample covariance matrix of standardized variables is the sample correlation matrix R. Bio 8100s Applied Multivariate Biostatistics 2001

Principal component scores Because principal components are functions, we can “plug in” the values for each variable for each observation, and calculate a PC score for each observation and each principal component. Bio 8100s Applied Multivariate Biostatistics 2001

Principal component loadings Component loadings (Lij) are the covariances (correlations for standardized values) of the original variables used in the PCA with the components, and are proportional to the component coefficients (aij). For each component, the (loading)2 for each variable summed over all variables equals the variance of the component. Bio 8100s Applied Multivariate Biostatistics 2001

More on loadings Sometimes components have variables with similar loadings, which form a “natural” group. To assist in interpretation, we may want to choose another component frame which emphasizes these differences among groups. FACTOR(2) Factor plot Bio 8100s Applied Multivariate Biostatistics 2001

Orthogonal rotations: varimax unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each component loads high on a small number of variables and low on other variables (simplifies factors) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001

Orthogonal rotations: quartimax unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each variable loads mainly on one factor (simplified variables) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001

Orthogonal rotations: Equamax unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Equamax: Combines varimax and quartimax. Number of variables that load highly on a factor and the number of factors needed to explain the variable are optimized. FACTOR(2) Equamax Bio 8100s Applied Multivariate Biostatistics 2001

Oblique rotations, e.g. Oblimin unrotated FACTOR(2) Oblique (non-angle preserving): new (rotated) components are now correlated Most reasonable when significant intercorrelations among factors exist. FACTOR(2) Oblimin Bio 8100s Applied Multivariate Biostatistics 2001

The consequences of rotation Unrotated components are (1) uncorrelated; (2) ordered in terms of decreasing variance (i.e., Var(Z1) > Var (Z2) >…). Orthogonally rotated components are (1) still uncorrelated, but (2) need not be ordered in terms of decreasing variance (e.g. for Varimax rotation). Obliquely rotated components are (1) correlated; (2) unordered (in general). Bio 8100s Applied Multivariate Biostatistics 2001

The rotated pattern matrix for obliquely rotated factors The elements of the matrix are analogous to standardized partial regression coefficients from a multiple regression analysis. So each element quantifies the importance of the variable in question to the component, once the effects of other variables are controlled. Rotated Pattern Matrix (OBLIMIN, Gamma = 0.0000) 1 2 HEIGHT 0.909 0.060 ARM_SPAN 0.957 -0.017 FOREARM 0.953 -0.048 LOWERLEG 0.916 0.028 WEIGHT 0.054 0.897 BITRO -0.011 0.864 CHESTGIR -0.090 0.882 CHESTWID 0.088 0.749 Bio 8100s Applied Multivariate Biostatistics 2001

The rotated structure matrix for obliquely rotated factors 1 2 HEIGHT 0.933 0.363 ARM_SPAN 0.935 0.452 FOREARM 0.950 0.396 LOWERLEG 0.928 0.423 WEIGHT 0.441 0.921 BITRO 0.410 0.787 CHESTGIR 0.362 0.860 CHESTWID 0.290 0.843 The elements of the rotated structure matrix are the simple correlations of the variable in question with the factor, i.e. the component loadings. For orthogonal factors, the factor pattern and factor structure matrices are identical. Bio 8100s Applied Multivariate Biostatistics 2001

Which rotation is the best? Object: find the rotation which achieves the simplest structure among component loadings, thereby making interpretation comparatively easy. Thurstone’s criteria: for p variables and m < p components: (1) each component should have at least m near-zero loadings; (2) few components should have non-zero loadings on the same variable. Bio 8100s Applied Multivariate Biostatistics 2001

A final word on rotations “You cannot say that any rotation is better than any other rotation from a statistical point of view: all rotations are equally good statistically. Therefore, the choice among different rotations must be based on non-statistical grounds…” SAS STAT User’s guide, Vol. 1, p. 776. Bio 8100s Applied Multivariate Biostatistics 2001

How many components to retain in subsequent analysis? Kaiser rule: retain only components with eigenvalues > 1. Scree test: plot eigenvalues against their ordinal numbers, retain all components in “steep decent” part of the curve. Retain as many factors as required to account for a specified amount of the total variance (e.g. 85%) Scree plot Kaiser threshold Eigenvalue Bio 8100s Applied Multivariate Biostatistics 2001

More on interpretation: the significance of loadings Since loadings are correlation coefficients (r), we can test the null that each correlation equals zero. But analytic estimates of standard errors are often too small, especially for rotated loadings. So, as a rule of thumb, use double the critical value to test significance. E.g., for N = 100, r(a = 0.01) = 0.286, so “significant” factors have loadings greater than 2(0.286). Bio 8100s Applied Multivariate Biostatistics 2001

Component reliability: rules of thumb The absolute magnitude and number of loadings are crucial for determining reliability Components with at least 4 loadings > |0.60| or with at least 3 loadings > |0.80| are reliable. For N > 150, components with at least 10 loadings > |0.40| are reliable. Bio 8100s Applied Multivariate Biostatistics 2001

PCA: the procedure 1. Calculate sample covariance matrix or correlation matrix. If all variables are on same scale, use sample covariance matrix, otherwise use correlation matrix. 2. Run PCA to extract unrotated components (“initial extraction”). 3. Decide which components to use in subsequent analysis based on Kaiser rule, Scree plots, etc. 4. Based on (3), rerun analysis using different orthogonal and oblique rotations and compare using factor plots (‘follow-up extraction”) Bio 8100s Applied Multivariate Biostatistics 2001

PCA: the procedure (cont’d) 5. For obliquely rotated components, calculate correlations among components. Small correlations suggest that orthogonal rotations are reasonable. 6. Evaluate statistical significance of component loadings obtained from “best” rotation. 7. Check component reliability by redoing steps (1) - (6) with another (independent) data set, and compare the component loadings obtained from the two data sets. Are they close? Bio 8100s Applied Multivariate Biostatistics 2001