USairpollution=USairpollution_data[,-c(1,2)] Transform temperature to negative values. This way, high values for all six variables indicate unfriendly environment. > USairpollution$negtemp=USairpollution$temp*(-1) > USairpollution$temp=NULL We need to extract the principal components from the correlation matrix rather than covariance matrix, since the variables are on very different scales"> USairpollution=USairpollution_data[,-c(1,2)] Transform temperature to negative values. This way, high values for all six variables indicate unfriendly environment. > USairpollution$negtemp=USairpollution$temp*(-1) > USairpollution$temp=NULL We need to extract the principal components from the correlation matrix rather than covariance matrix, since the variables are on very different scales">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

PCA Example Air pollution in 41 cities in the USA.

Similar presentations


Presentation on theme: "PCA Example Air pollution in 41 cities in the USA."— Presentation transcript:

1 PCA Example Air pollution in 41 cities in the USA.
R data “USairpollution” Variables: SO2: SO2 content of air in micrograms per cubic meter temp: average annual temperature in degrees Fahrenheit manu: number of manufacturing enterprises employing 20 or more workers popul: population size (1970 census) in thousands wind: average annual wind speed in miles per hour precip: average annual precipitation in inches predays: average number of days with precipitation per year

2 PCA Example Read the US air pollution data:
>USairpollution_data=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv",header=T) Remove the first two columns (City, SO2) and leave only the variables related to human ecology (popul, manu) and four to climate (temp, wind, precip, predays). >USairpollution=USairpollution_data[,-c(1,2)] Transform temperature to negative values. This way, high values for all six variables indicate unfriendly environment. > USairpollution$negtemp=USairpollution$temp*(-1) > USairpollution$temp=NULL We need to extract the principal components from the correlation matrix rather than covariance matrix, since the variables are on very different scales

3 PCA Example Correlation matrix for Usairpollution:
>round(cor(USairpollution),2) manu popul wind precip predays negtemp manu popul wind precip predays negtemp High correlations between popul and manu.

4 PCA Example We construct a scatter plot of the six variables and we include the histograms for each variable on the main diagonal. For this we need the package “HSAUR2” >library(HSAUR2) Build the function “panel.hist” to insert the histograms on the main diagonal. >panel.hist<-function(x, ...){ usr<-par("usr");on.exit(par(usr)) par(usr=c(usr[1:2],0,1.5)) # set the plotting coordinates h=hist(x,plot=FALSE) breaks<-h$breaks;nB<-length(breaks) y<-h$counts;y<-y/max(y) rect(breaks[-nB],0,breaks[-1],y,col="grey",...) } Plot the scatterplot matrix: >pairs(USairpollution,diag.panel=panel.hist,pch=".",cex=1.5)

5 PCA Example Outliers

6 PCA Example Extract the principal components:
> USairpollution_PCA=princomp(USairpollution,cor=TRUE) > summary(USairpollution_PCA,loadings=TRUE) Importance of components: Comp Comp Comp Comp Comp.5 Standard deviation Proportion of Variance Cumulative Proportion Comp.6 Standard deviation Proportion of Variance Cumulative Proportion Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 manu popul wind precip Predays negtemp

7 PCA Example Eigenvectors and eigenvalues
> eigen(cor(USairpollution)) $values [1] $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] [2,] [3,] [4,] [5,] [6,]

8 PCA Example The first three components accounts for almost 85% of the variance of the original variables and have eigenvalues greater than one. Equations of the first three principal components: The first component is an indicator of quality of life with high values showing poor environment, the second component is concerned with city’s rainfall with high values for precip and predays, and the third component is a contrast between precip and negtemp.

9 PCA Example Scatter plot matrix of the first three principal components: > library(MVA) > pairs(USairpollution_PCA$scores[,1:3],xlim=c(-6,4),ylim=c(-6,4), panel=function(x,y,...){ text(x,y,abbreviate(USairpollution_data[,1]), cex=0.75) bvbox(cbind(x,y),add=TRUE) })

10 PCA Example The scatter plots indicate that Chicago and possible Phoenix and Philadelphia are outliers.

11 PCA Example We want to determine which of the climate and human ecology variables are the best predictors of the degree of air pollution in a city as measured by the SO2 in the air. We use multiple regression to address this problem. Scatter plots of the 6 principal components vs. SO2 concentration: par(mfrow=c(2,3)) out=sapply(1:6,function(i){ plot(USairpollution_data$SO2,USairpollution_PCA$scores[,i], xlab=paste("PC",i,sep=""), ylab="SO2 concentration")}) The first principal component is the most predictive of sulfur dioxide concentration (p-value =2.28e-07).

12 PCA Example

13 PCA Example > usair_reg=lm(SO2~USairpollution_PCA$scores,data=USairpollution_data) > summary(usair_reg) Call: lm(formula = SO2 ~ USairpollution_PCA$scores, data = USairpollution_data) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** USairpollution_PCA$scoresComp e-07 *** USairpollution_PCA$scoresComp USairpollution_PCA$scoresComp USairpollution_PCA$scoresComp ** USairpollution_PCA$scoresComp * USairpollution_PCA$scoresComp ** Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 34 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 6 and 34 DF, p-value: 5.419e-07


Download ppt "PCA Example Air pollution in 41 cities in the USA."

Similar presentations


Ads by Google