PCA Example Air pollution in 41 cities in the USA.

Slides:



Advertisements
Similar presentations
Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
Advertisements

Factor Analysis Continued
Review of Univariate Linear Regression BMTRY 726 3/4/14.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
5/11/ lecture 71 STATS 330: Lecture 7. 5/11/ lecture 72 Prediction Aims of today’s lecture  Describe how to use the regression model to.
5/18/ lecture 101 STATS 330: Lecture 10. 5/18/ lecture 102 Diagnostics 2 Aim of today’s lecture  To describe some more remedies for non-planar.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Multiple Regression Predicting a response with multiple explanatory variables.
Zinc Data SPH 247 Statistical Analysis of Laboratory Data.
Linear Regression Exploring relationships between two metric variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs., April 8th
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
MATH 3359 Introduction to Mathematical Modeling Linear System, Simple Linear Regression.
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Correlation 1. Correlation - degree to which variables are associated or covary. (Changes in the value of one tends to be associated with changes in the.
Correlation and Regression Analysis
Multiple Regression Analysis. General Linear Models  This framework includes:  Linear Regression  Analysis of Variance (ANOVA)  Analysis of Covariance.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Data Forensics: A Compare and Contrast Analysis of Multiple Methods Christie Plackner.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Correlation and Regression
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
SWC Methodology - TWG February 19, 2015 Settlement Document Subject to I.R.E. 408.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Lecture 5: SLR Diagnostics (Continued) Correlation Introduction to Multiple Linear Regression BMTRY 701 Biostatistical Methods II.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 13 Diagnostics in MLR Variance Inflation Factors Added variable plots Identifying outliers BMTRY 701 Biostatistical Methods II.
Regression and Analysis Variance Linear Models in R.
Collaboration and Data Sharing What have I been doing that’s so bad, and how could it be better? August 1 st, 2010.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Regression Model Building LPGA Golf Performance
Using R for Marketing Research Dan Toomey 2/23/2015
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
Data set Proteins consumption shows the estimates of the average protein consumption from different food sources for the inhabitants of 25 European countries.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Linear Models Alan Lee Sample presentation for STATS 760.
Air pollution is the introduction of chemicals and biological materials into the atmosphere that causes damage to the natural environment. We focused.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Lecture 13 Diagnostics in MLR Added variable plots Identifying outliers Variance Inflation Factor BMTRY 701 Biostatistical Methods II.
EPP 245 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Nemours Biomedical Research Statistics April 9, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
Chapter 12 Simple Linear Regression and Correlation
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
Mixed models and their uses in meta-analysis
Correlation and regression
Console Editeur : myProg.R 1
Chapter 12 Simple Linear Regression and Correlation
Multi Linear Regression Lab
Presentation transcript:

PCA Example Air pollution in 41 cities in the USA. R data “USairpollution” Variables: SO2: SO2 content of air in micrograms per cubic meter temp: average annual temperature in degrees Fahrenheit manu: number of manufacturing enterprises employing 20 or more workers popul: population size (1970 census) in thousands wind: average annual wind speed in miles per hour precip: average annual precipitation in inches predays: average number of days with precipitation per year

PCA Example Read the US air pollution data: >USairpollution_data=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv",header=T) Remove the first two columns (City, SO2) and leave only the variables related to human ecology (popul, manu) and four to climate (temp, wind, precip, predays). >USairpollution=USairpollution_data[,-c(1,2)] Transform temperature to negative values. This way, high values for all six variables indicate unfriendly environment. > USairpollution$negtemp=USairpollution$temp*(-1) > USairpollution$temp=NULL We need to extract the principal components from the correlation matrix rather than covariance matrix, since the variables are on very different scales

PCA Example Correlation matrix for Usairpollution: >round(cor(USairpollution),2) manu popul wind precip predays negtemp manu 1.00 0.96 0.24 -0.03 0.13 0.19 popul 0.96 1.00 0.21 -0.03 0.04 0.06 wind 0.24 0.21 1.00 -0.01 0.16 0.35 precip -0.03 -0.03 -0.01 1.00 0.50 -0.39 predays 0.13 0.04 0.16 0.50 1.00 0.43 negtemp 0.19 0.06 0.35 -0.39 0.43 1.00 High correlations between popul and manu.

PCA Example We construct a scatter plot of the six variables and we include the histograms for each variable on the main diagonal. For this we need the package “HSAUR2” >library(HSAUR2) Build the function “panel.hist” to insert the histograms on the main diagonal. >panel.hist<-function(x, ...){ usr<-par("usr");on.exit(par(usr)) par(usr=c(usr[1:2],0,1.5)) # set the plotting coordinates h=hist(x,plot=FALSE) breaks<-h$breaks;nB<-length(breaks) y<-h$counts;y<-y/max(y) rect(breaks[-nB],0,breaks[-1],y,col="grey",...) } Plot the scatterplot matrix: >pairs(USairpollution,diag.panel=panel.hist,pch=".",cex=1.5)

PCA Example Outliers

PCA Example Extract the principal components: > USairpollution_PCA=princomp(USairpollution,cor=TRUE) > summary(USairpollution_PCA,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 1.4819456 1.2247218 1.1809526 0.8719099 0.33848287 Proportion of Variance 0.3660271 0.2499906 0.2324415 0.1267045 0.01909511 Cumulative Proportion 0.3660271 0.6160177 0.8484592 0.9751637 0.99425879 Comp.6 Standard deviation 0.185599752 Proportion of Variance 0.005741211 Cumulative Proportion 1.000000000 Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 manu -0.612 0.168 -0.273 -0.137 0.102 0.703 popul -0.578 0.222 -0.350 -0.695 wind -0.354 -0.131 0.297 0.869 -0.113 precip -0.623 -0.505 0.171 0.568 Predays -0.238 -0.708 -0.311 -0.580 negtemp -0.330 -0.128 0.672 -0.306 0.558 -0.136

PCA Example Eigenvectors and eigenvalues > eigen(cor(USairpollution)) $values [1] 2.19616264 1.49994343 1.39464912 0.76022689 0.11457065 0.03444727 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.61154243 0.1680577 0.27288633 -0.13684076 -0.10204211 0.70297051 [2,] -0.57782195 0.2224533 0.35037413 -0.07248126 0.07806551 -0.69464131 [3,] -0.35383877 -0.1307915 -0.29725334 0.86942583 0.11326688 0.02452501 [4,] 0.04080701 -0.6228578 0.50456294 0.17114826 -0.56818342 -0.06062222 [5,] -0.23791593 -0.7077653 -0.09308852 -0.31130693 0.58000387 0.02196062 [6,] -0.32964613 -0.1275974 -0.67168611 -0.30645728 -0.55805638 -0.13618780

PCA Example The first three components accounts for almost 85% of the variance of the original variables and have eigenvalues greater than one. Equations of the first three principal components: The first component is an indicator of quality of life with high values showing poor environment, the second component is concerned with city’s rainfall with high values for precip and predays, and the third component is a contrast between precip and negtemp.

PCA Example Scatter plot matrix of the first three principal components: > library(MVA) > pairs(USairpollution_PCA$scores[,1:3],xlim=c(-6,4),ylim=c(-6,4), panel=function(x,y,...){ text(x,y,abbreviate(USairpollution_data[,1]), cex=0.75) bvbox(cbind(x,y),add=TRUE) })

PCA Example The scatter plots indicate that Chicago and possible Phoenix and Philadelphia are outliers.

PCA Example We want to determine which of the climate and human ecology variables are the best predictors of the degree of air pollution in a city as measured by the SO2 in the air. We use multiple regression to address this problem. Scatter plots of the 6 principal components vs. SO2 concentration: par(mfrow=c(2,3)) out=sapply(1:6,function(i){ plot(USairpollution_data$SO2,USairpollution_PCA$scores[,i], xlab=paste("PC",i,sep=""), ylab="SO2 concentration")}) The first principal component is the most predictive of sulfur dioxide concentration (p-value =2.28e-07).

PCA Example

PCA Example > usair_reg=lm(SO2~USairpollution_PCA$scores,data=USairpollution_data) > summary(usair_reg) Call: lm(formula = SO2 ~ USairpollution_PCA$scores, data = USairpollution_data) Residuals: Min 1Q Median 3Q Max -23.004 -8.542 -0.991 5.758 48.758 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 30.049 2.286 13.146 6.91e-15 *** USairpollution_PCA$scoresComp.1 -9.942 1.542 -6.446 2.28e-07 *** USairpollution_PCA$scoresComp.2 -2.240 1.866 -1.200 0.23845 USairpollution_PCA$scoresComp.3 -0.375 1.935 -0.194 0.84752 USairpollution_PCA$scoresComp.4 -8.549 2.622 -3.261 0.00253 ** USairpollution_PCA$scoresComp.5 15.176 6.753 2.247 0.03122 * USairpollution_PCA$scoresComp.6 39.271 12.316 3.189 0.00306 ** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.64 on 34 degrees of freedom Multiple R-squared: 0.6695, Adjusted R-squared: 0.6112 F-statistic: 11.48 on 6 and 34 DF, p-value: 5.419e-07