Xuhua Xia Slide 1 Principal Components Analysis Objectives: –Understand the principles of principal components analysis (PCA) –Recognize conditions under which PCA may be useful –Use SAS procedure PRINCOMP to perform a principal components analysis interpret PRINCOMP output.
Xuhua Xia Slide 2 Typical Form of Data A data set in a 8x3 matrix. The rows could be species and columns sampling sites X = A matrix is often referred to as a n x p matrix (n for number of rows and p for number of columns). Our matrix has 8 rows and 3 columns, and is an 8x3 matrix. A variance-covariance matrix has n = p, and is called n-dimensional square matrix.
Xuhua Xia Slide 3 What are Principal Components? Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet three criteria What are the three criteria? Y = b 1 X 1 + b 2 X 2 + … b n X n
Xuhua Xia Slide 4 What are Principal Components? The three criteria: –There are exactly p principal components (PCs), each being a linear combination of the observed variables; –The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated); –The components are extracted in order of decreasing variance.
Xuhua Xia Slide 5 A Simple Data Set XYX11Y11XYX11Y11 XY X Y Correlation matrix Covariance matrix
Xuhua Xia Slide 6 General Patterns The total variance is 3 (= 1 + 2) The two variables, X and Y, are perfectly correlated, with all points fall on the regression line. The spatial relationship among the 5 points can therefore be represented by a single dimension. PCA is a dimension-reduction technique. What would happen if we apply PCA to the data?
Xuhua Xia Slide 7 Graphic PCA X Y
Xuhua Xia Slide 8 SAS Program data pca; input x y; cards; ; proc princomp cov out=pcscore; proc print; var prin1 prin2; proc princomp data=pca out=pcscore; proc print; var prin1 prin2; run; Requesting the PCA to be carried out on the covariance matrix rather than the correlation matrix. Without specifying the covariance option, PCA will be carried out on the correlation matrix.
Xuhua Xia Slide 9 A positive definite matrix When you run the SAS program, the log file will warn that “The Correlation Matrix is not positive definite.”. What does that mean? A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non- zero vectors z with real entries, where z’ is the transpose of z. Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite: Replace the correlation matrix with the covariance matrix and solve for z.
Xuhua Xia Slide 10 SAS Output Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative PRIN PRIN Eigenvectors PRIN1 PRIN2 X Y OBS PRIN1 PRIN Variance accounted for by each principal components Principal component scores What’s the variance in PC1? How are the values computed? PC1 = *X *X2
Xuhua Xia Slide 11 SAS Output OBS PRIN1 PRIN
Xuhua Xia Slide 12 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative PRIN PRIN Eigenvectors PRIN1 PRIN2 X Y OBS PRIN1 PRIN SAS Output Variance accounted for by each principal components Principal component scores What’s the variance in PC1?
Xuhua Xia Slide 13 Steps in a PCA Have at least two variables Generate a correlation or variance-covariance matrix Obtain eigenvalues and eigenvectors (This is called an eigenvalue problem, and will be illustrated with a simple numerical example) Generate principal component (PC) scores Plot the PC scores in the space with reduced dimensions All these can be automated by using SAS.
Xuhua Xia Slide 14 Covariance or Correlation Matrix? Abundance Sp1 Sp2
Xuhua Xia Slide 15 Covariance or Correlation Matrix?
Xuhua Xia Slide 16 Covariance or Correlation Matrix?
Xuhua Xia Slide 17 The Eigenvalue Problem The covariance matrix. The Eigenvalue is the set of values that satisfy this condition. The resulting eigenvalues (There are n eigenvalues for n variables). The sum of eigenvalues is equal to the sum of variances in the covariance matrix. Finding the eigenvalues and eigenvectors is called an eigenvalue problem (or a characteristic value problem).
Xuhua Xia Slide 18 Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition: A x = x In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x 1 and x 2.
Xuhua Xia Slide 19 Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x x 2 2 = 1 We therefore have From Previous Slide The first eigenvector is one associated with the largest eigenvalue. Solve x 1
Xuhua Xia Slide 20 Get the PC Scores First PC score Second PC score Original data (x and y)Eigenvectors The original data in a two dimensional space is reduced to one dimension..
Xuhua Xia Slide 21 What Are Principal Components? Principal components are a new set of variables, which are linear combinations of the observed ones, with these properties: –Because of the decreasing variance property, much of the variance (information in the original set of p variables) tends to be concentrated in the first few PCs. This implies that we can drop the last few PCs without losing much information. PCA is therefore considered as a dimension- reduction technique. –Because PCs are orthogonal, they can be used instead of the original variables in situations where having orthogonal variables is desirable (e.g., regression).
Xuhua Xia Slide 22 Index of hidden variables The ranking of Asian universities by the Asian Week –HKU is ranked second in financial resources, but seventh in academic research –How did HKU get ranked third? –Is there a more objective way of ranking? An illustrative example:
Xuhua Xia Slide 23 A Simple Data Set School 5 is clearly the best school School 1 is clearly the worst school
Xuhua Xia Slide 24 Graphic PCA
Xuhua Xia Slide 25 Crime Data in 50 States STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA HAWAII IDAHO ILLINOIS PROC PRINCOMP OUT=CRIMCOMP;
DATA CRIME; TITLE 'CRIME RATES PER 100,000 POP BY STATE'; INPUT STATENAME $1-15 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO; CARDS; Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York
North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming ; PROC PRINCOMP out=crimcomp; run; PROC PRINT; ID STATENAME; VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO; run; PROC GPLOT; PLOT PRIN2*PRIN1=STATENAME; TITLE2 'PLOT OF THE FIRST TWO PRINCIPAL COMPONENTS'; run; PROC PRINCOMP data=CRIME COV OUT=crimcomp; run; PROC PRINT; ID STATENAME; VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO; run; /* Add to have a map view*/ proc sort data=crimcomp out=crimcomp; by STATENAME; run; proc sort data=maps.us2 out=mymap; by STATENAME; run; data both; merge mymap crimcomp; by STATENAME; run; proc gmap data=both; id _map_geometry_; choro PRIN1 PRIN2/levels=15; /* choro PRIN1/discrete; */ run;
Xuhua Xia Slide 28 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO Correlation Matrix If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix.
Xuhua Xia Slide 29 Eigenvalue Difference Proportion Cumulative PRIN PRIN PRIN PRIN PRIN PRIN PRIN Eigenvalues
Xuhua Xia Slide 30 Eigenvectors PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO Do these eigenvectors mean anything? –All crimes are positively correlated with the first eigenvector, which is therefore interpreted as a measure of overall crime rate. –The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…...
Xuhua Xia Slide 31 PC Plot: Crime Data North and South Dakota Nevada, New York, California Mississippi, Alabama, Louisiana, South Carolina Maryland
Plot of PC1
Plot of PC2