PCA Example Air pollution in 41 cities in the USA. R data “USairpollution” Variables: SO2: SO2 content of air in micrograms per cubic meter temp: average annual temperature in degrees Fahrenheit manu: number of manufacturing enterprises employing 20 or more workers popul: population size (1970 census) in thousands wind: average annual wind speed in miles per hour precip: average annual precipitation in inches predays: average number of days with precipitation per year
PCA Example Read the US air pollution data: >USairpollution_data=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv",header=T) Remove the first two columns (City, SO2) and leave only the variables related to human ecology (popul, manu) and four to climate (temp, wind, precip, predays). >USairpollution=USairpollution_data[,-c(1,2)] Transform temperature to negative values. This way, high values for all six variables indicate unfriendly environment. > USairpollution$negtemp=USairpollution$temp*(-1) > USairpollution$temp=NULL We need to extract the principal components from the correlation matrix rather than covariance matrix, since the variables are on very different scales
PCA Example Correlation matrix for Usairpollution: >round(cor(USairpollution),2) manu popul wind precip predays negtemp manu 1.00 0.96 0.24 -0.03 0.13 0.19 popul 0.96 1.00 0.21 -0.03 0.04 0.06 wind 0.24 0.21 1.00 -0.01 0.16 0.35 precip -0.03 -0.03 -0.01 1.00 0.50 -0.39 predays 0.13 0.04 0.16 0.50 1.00 0.43 negtemp 0.19 0.06 0.35 -0.39 0.43 1.00 High correlations between popul and manu.
PCA Example We construct a scatter plot of the six variables and we include the histograms for each variable on the main diagonal. For this we need the package “HSAUR2” >library(HSAUR2) Build the function “panel.hist” to insert the histograms on the main diagonal. >panel.hist<-function(x, ...){ usr<-par("usr");on.exit(par(usr)) par(usr=c(usr[1:2],0,1.5)) # set the plotting coordinates h=hist(x,plot=FALSE) breaks<-h$breaks;nB<-length(breaks) y<-h$counts;y<-y/max(y) rect(breaks[-nB],0,breaks[-1],y,col="grey",...) } Plot the scatterplot matrix: >pairs(USairpollution,diag.panel=panel.hist,pch=".",cex=1.5)
PCA Example Outliers
PCA Example Extract the principal components: > USairpollution_PCA=princomp(USairpollution,cor=TRUE) > summary(USairpollution_PCA,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 1.4819456 1.2247218 1.1809526 0.8719099 0.33848287 Proportion of Variance 0.3660271 0.2499906 0.2324415 0.1267045 0.01909511 Cumulative Proportion 0.3660271 0.6160177 0.8484592 0.9751637 0.99425879 Comp.6 Standard deviation 0.185599752 Proportion of Variance 0.005741211 Cumulative Proportion 1.000000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 manu -0.612 0.168 -0.273 -0.137 0.102 0.703 popul -0.578 0.222 -0.350 -0.695 wind -0.354 -0.131 0.297 0.869 -0.113 precip -0.623 -0.505 0.171 0.568 Predays -0.238 -0.708 -0.311 -0.580 negtemp -0.330 -0.128 0.672 -0.306 0.558 -0.136
PCA Example Eigenvectors and eigenvalues > eigen(cor(USairpollution)) $values [1] 2.19616264 1.49994343 1.39464912 0.76022689 0.11457065 0.03444727 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.61154243 0.1680577 0.27288633 -0.13684076 -0.10204211 0.70297051 [2,] -0.57782195 0.2224533 0.35037413 -0.07248126 0.07806551 -0.69464131 [3,] -0.35383877 -0.1307915 -0.29725334 0.86942583 0.11326688 0.02452501 [4,] 0.04080701 -0.6228578 0.50456294 0.17114826 -0.56818342 -0.06062222 [5,] -0.23791593 -0.7077653 -0.09308852 -0.31130693 0.58000387 0.02196062 [6,] -0.32964613 -0.1275974 -0.67168611 -0.30645728 -0.55805638 -0.13618780
PCA Example The first three components accounts for almost 85% of the variance of the original variables and have eigenvalues greater than one. Equations of the first three principal components: The first component is an indicator of quality of life with high values showing poor environment, the second component is concerned with city’s rainfall with high values for precip and predays, and the third component is a contrast between precip and negtemp.
PCA Example Scatter plot matrix of the first three principal components: > library(MVA) > pairs(USairpollution_PCA$scores[,1:3],xlim=c(-6,4),ylim=c(-6,4), panel=function(x,y,...){ text(x,y,abbreviate(USairpollution_data[,1]), cex=0.75) bvbox(cbind(x,y),add=TRUE) })
PCA Example The scatter plots indicate that Chicago and possible Phoenix and Philadelphia are outliers.
PCA Example We want to determine which of the climate and human ecology variables are the best predictors of the degree of air pollution in a city as measured by the SO2 in the air. We use multiple regression to address this problem. Scatter plots of the 6 principal components vs. SO2 concentration: par(mfrow=c(2,3)) out=sapply(1:6,function(i){ plot(USairpollution_data$SO2,USairpollution_PCA$scores[,i], xlab=paste("PC",i,sep=""), ylab="SO2 concentration")}) The first principal component is the most predictive of sulfur dioxide concentration (p-value =2.28e-07).
PCA Example
PCA Example > usair_reg=lm(SO2~USairpollution_PCA$scores,data=USairpollution_data) > summary(usair_reg) Call: lm(formula = SO2 ~ USairpollution_PCA$scores, data = USairpollution_data) Residuals: Min 1Q Median 3Q Max -23.004 -8.542 -0.991 5.758 48.758 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 30.049 2.286 13.146 6.91e-15 *** USairpollution_PCA$scoresComp.1 -9.942 1.542 -6.446 2.28e-07 *** USairpollution_PCA$scoresComp.2 -2.240 1.866 -1.200 0.23845 USairpollution_PCA$scoresComp.3 -0.375 1.935 -0.194 0.84752 USairpollution_PCA$scoresComp.4 -8.549 2.622 -3.261 0.00253 ** USairpollution_PCA$scoresComp.5 15.176 6.753 2.247 0.03122 * USairpollution_PCA$scoresComp.6 39.271 12.316 3.189 0.00306 ** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.64 on 34 degrees of freedom Multiple R-squared: 0.6695, Adjusted R-squared: 0.6112 F-statistic: 11.48 on 6 and 34 DF, p-value: 5.419e-07