Principal Component Analysis Canonical Correlation Cluster Analysis Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Let me first emphasize the point that taking a systems biology perspective in doing biological research is not new. Its root can be traced at least to Ronald Fisher...... 1
Multivariate statistics PCA: principal component analysis Canonical correlation Cluster analysis ...... Xuhua Xia
PCA Given a set of variables x1, x2, …, xn, find a set of coefficients a11, a12, …, a1n, so that PC1 = a11x1 + a12x2 + …+ a1nxn has the maximum variance (v1) subject to the constraint that a1 is a unit vector, i.e., sqrt(a112+ a122 …+ a1n2) = 1 find a 2nd set of coefficients a2 so that PC2 has the maximum variance (v2) subject to the unit vector constraint and the additional constraint that a2 is orthogonal to a1 find 3rd, 4th,… nth set of coefficients so that PC3, PC4, … have the maximum variance (v3, v4, …) subject to the unit vector constraint and that ai is orthogonal to all ai-1 vectors. It turns out that v1, v2, … are eigenvalues and a1, a2, … are eigenvectors of the variance-covariance matrix of x1, x2, …, xn (or of the correlation matrix if x1, x2, …, xn are standardized) Demonstrate how to find the eigenvalues, eigenvectors, and PC_Scores manually in EXCEL
Toy Datasets X1 X2 -1.264911064 -1.788854382 -0.632455532 -0.894427191 0 0 0.632455532 0.894427191 1.264911064 1.788854382 X1 X2 1.31832344 1.064630266 2.341993124 2.921219905 1.913119064 2.304482133 3.00557377 2.691546491 4.183120958 3.897404623 To demonstrate the physical principles involved in forming a rainbow, it is not necessary to create a rainbow spanning the sky – a small one is sufficient. --C.C. Li Xuhua Xia
R functions Don’t use scientific notation. Requesting the PCA to be carried out on the covariance matrix (default) rather than the correlation matrix. Use scale.=TRUE to request PCA on correlation matrix options("scipen"=100, "digits"=6) fit.cov<-prcomp(~X1+X2) # or prcomp(md) fit.cor<-prcomp(md,scale.=T) predict(fit.cov,md) # centered PC scores predict(fit.cov,data.frame(X1=0.3,X2=0.5)) screeplot(fit.cov) Help decide how many PCs to keep when there are many variables Xuhua Xia
prcomp Output Standard deviations: [1] 1.7320714226 0.0000440773 Rotation: PC1 PC2 X1 0.577347 0.816499 X2 0.816499 -0.577347 PC1 PC2 [1,] -2.19092 0.0000278767 [2,] -1.09545 -0.0000557540 [3,] 0.00000 0.0000000000 [4,] 1.09545 0.0000557540 [5,] 2.19092 -0.0000278767 better to output in variance (eigenvalue) accounted for by each PC eigenvectors: PC1 = 0.57735X1+0.81650X2 Principal component scores, generated by as.matrix(md) %*% as.matrix(fit$rotation) Centered ones is obtained by fit$x Xuhua Xia
PCA on correlation matrix (scale=T) Standard deviations: [1] 1.4142135619 0.0000381717 Rotation: PC1 PC2 X1 0.707107 0.707107 X2 0.707107 -0.707107 PC1 PC2 [1,] -1.788850 0.0000241421 [2,] -0.894435 -0.0000482837 [3,] 0.000000 0.0000000000 [4,] 0.894435 0.0000482837 [5,] 1.788850 -0.0000241421 Xuhua Xia
Crime Data in 50 States STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 ALASKA 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 ARIZONA 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 ARKANSAS 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 CALIFORNIA 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 COLORADO 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 CONNECTICUT 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 DELAWARE 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 FLORIDA 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 GEORGIA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 HAWAII 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 IDAHO 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 ILLINOIS 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 (Full data set in EXCEL) nd<-md[,2:8]; rownames(nd)<-md$STATE PCA.cor<-prcomp(nd,scale=T) # use correlation matrix PCA.cor summary(PCA.cor) PCScore<-predict(PCA.cor,nd) # Centered PC scores screeplot(PCA.cor, type="l")
Correlation Matrix MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER 1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688 RAPE 0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489 ROBBERY 0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907 ASSAULT 0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758 BURGLARY 0.3858 0.7121 0.6372 0.6229 1.0000 0.7921 0.5580 LARCENY 0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442 AUTO 0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000 If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix. Xuhua Xia
Eigenvalues screeplot(objPCA,type = "lines") > summary(objPCA) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 2.029 1.113 0.852 0.5625 0.5079 0.4712 0.3522 Proportion of Variance 0.588 0.177 0.104 0.0452 0.0369 0.0317 0.0177 Cumulative Proportion 0.588 0.765 0.869 0.9137 0.9506 0.9823 1.0000 screeplot(objPCA,type = "lines") Xuhua Xia
Eigenvectors Do these eigenvectors mean anything? PC1 PC2 PC3 PC4 PC5 PC6 PC7 MURDER -0.3003 -0.62918 0.17824 -0.23216 0.53810 0.25912 0.267589 RAPE -0.4318 -0.16944 -0.24421 0.06219 0.18848 -0.77327 -0.296490 ROBBE -0.3969 0.04224 0.49588 -0.55793 -0.52002 -0.11439 -0.003902 ASSAU -0.3966 -0.34353 -0.06953 0.62984 -0.50660 0.17235 0.191751 BURGLA -0.4402 0.20334 -0.20990 -0.05757 0.10101 0.53599 -0.648117 LARCEN -0.3574 0.40233 -0.53922 -0.23491 0.03008 0.03941 0.601688 AUTO -0.2952 0.50241 0.56837 0.41922 0.36980 -0.05729 0.147044 Do these eigenvectors mean anything? All crimes are negatively correlated with the first eigenvector, which is therefore interpreted as a measure of overall safety. The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…... Xuhua Xia
PC Plot: Crime Data Maryland Nevada, New York, California North and South Dakota Mississippi, Alabama, Louisiana, South Carolina
Multivariate statistics PCA: principal component analysis Canonical correlation Cluster analysis ...... Xuhua Xia
Correlation Simple correlation between two variables Multiple and Partial correlations between one variable and a set of other variables Canonical Correlation between two sets of variables each containing more than one variable. Simple and multiple correlations are special cases of canonical correlation. Partial: between X and Y with Z being controlled for Multiple: x1 on x2 and x3 Xuhua Xia
Review of correlation X Z Y 1 4 14.0000 1 5 17.9087 1 6 16.3255 2 3 14.4441 2 4 15.2952 2 5 19.1587 2 6 16.0299 2 5 17.0000 3 3 14.7556 3 4 17.6823 3 5 20.5301 3 6 21.6408 4 3 15.0903 4 4 18.1603 4 5 22.2471 5 2 14.4450 5 3 16.5554 5 4 21.0047 5 5 22.0000 6 1 19.0000 6 2 18.0000 6 3 18.1863 6 4 21.0000 Compute Pearson correlation coefficients between X and Z, X and Y and Z and Y. Compute partial correlation coefficient between X and Y, controlling for Z (i.e., the correlation coefficient between X and Y when Z is held constant), by using the equation in the previous slide. Run R to verify your calculation: #install.packages("ggm") library(ggm) md<-read.table("XYZ.txt",header=T) cor(md) s<-var(md) parcor(s) # or parcor(cor(md)) # install.packages("psych") library(psych) smc(s) # squared multiple correlation Xuhua Xia
Data for canonical correlation # First three variables: physical # Last three variables: exercise # Middle-aged men weight waist pulse chins situps jumps 191 36 50 5 162 60 193 38 58 12 101 101 189 35 46 13 145 58 211 38 56 8 151 38 176 31 74 15 200 40 169 34 50 17 120 38 154 34 64 14 215 105 193 36 46 6 170 31 176 37 54 4 160 25 156 33 54 15 215 73 189 37 52 2 130 60 162 35 62 12 145 37 182 36 56 4 141 42 167 34 60 6 155 40 154 30 56 17 251 250 166 33 52 13 210 115 247 46 50 1 50 50 202 37 62 12 120 120 157 32 52 11 230 80 138 33 68 2 150 43 Use EXCEL to manually calculate the first canonical correlation
Canonical correlation (cc) install.packages("ggplot2") install.packages("Ggally") install.packages("CCA") install.packages("CCP") require(ggplot2) require(GGally) require(CCA) require(CCP) phys<-md[,1:3] exer<-md[,4:6] matcor(phys,exer) cc1<-cc(phys,exer) cc1 http://www.ats.ucla.edu/stat/r/dae/canonical.htm
cc output [1] 0.87857805 0.26499182 0.06266112 canonical correlations $xcoef [,1] [,2] [,3] weight 0.007691932 0.08206036 -0.01089895 waist -0.352367502 -0.46672576 0.12741976 pulse -0.016888712 0.04500996 0.14113157 $ycoef [,1] [,2] [,3] chins 0.063996632 0.19132168 0.116137756 situps 0.017876736 -0.01743903 0.001201433 jumps -0.002949483 0.00494516 -0.022700322 canonical correlations raw canonical coefficients matrices: U and V phys*U: raw canonical variates for phys exer*V: raw canonical variates for exer $scores$xscores: standardized canonical variates.
standardized canonical variates $scores$xscores [,1] [,2] [,3] [1,] -0.06587452 0.39294336 -0.90048466 [2,] -0.89033536 -0.01630780 0.46160952 [3,] 0.33866397 0.51550858 -1.57063280 [4,] -0.71810315 1.37075870 -0.01683463 [5,] 1.17525491 2.57590579 2.01305832 [6,] 0.46963797 -0.47893295 -0.91554740 [7,] 0.11781701 -1.07969890 1.22377873 [8,] 0.01706419 0.37702424 -1.48680882 [9,] -0.60117586 -1.12464792 -0.04505445 [10,] 0.65445550 -0.89895199 -0.33675460 [11,] -0.46740331 -0.14788320 -0.46900387 [12,] -0.13923760 -0.97996173 0.98174380 [13,] -0.23643419 -0.07554011 0.04439525 [14,] 0.28536698 -0.19295410 0.51756617 [15,] 1.66239672 0.42712450 -0.41495287 [16,] 0.76515225 -0.16836835 -0.72800719 [17,] -3.15880133 0.32106568 -0.23662794 [18,] -0.53629531 1.36900100 0.80062552 [19,] 1.04829236 -0.44018579 -0.75733645 [20,] 0.27955875 -1.74589901 1.83526836 $scores$yscores [,1] [,2] [,3] [1,] -0.23742244 -0.91888370 -0.28185833 [2,] -1.00085572 1.68690015 -0.47289464 [3,] -0.02345494 0.89826285 0.67222000 [4,] -0.17718803 -0.26188291 0.55274626 [5,] 1.14084951 0.23274696 1.37918010 [6,] -0.15539717 2.00062200 1.56074166 [7,] 1.15328755 0.10127530 -0.19445711 [8,] 0.05512308 -1.01048386 0.50220023 [9,] -0.23394065 -1.24840794 0.39411232 [10,] 1.31166763 0.13435186 0.64809096 [11,] -1.00146790 -0.93479995 -0.66871744 [12,] -0.02551244 0.60309281 1.03278901 [13,] -0.62373985 -0.83299874 -0.01462037 [14,] -0.23957331 -0.70439205 0.27987584 [15,] 1.56116497 0.76448365 -3.09433899 [16,] 0.97041241 0.04660035 -0.54360525 [17,] -2.46610861 0.21954878 -0.65396658 [18,] -0.71723790 1.44951672 -0.88137354 [19,] 1.30318577 -0.85790412 0.04265917 [20,] -0.59379197 -1.36764817 -0.25878331
Canonical structure: Correlations $scores$corr.X.xscores [,1] [,2] [,3] weight -0.8028458 0.53345479 -0.2662041 waist -0.9871691 0.07372001 -0.1416419 pulse 0.2061478 0.10981908 0.9723389 $scores$corr.Y.xscores [,1] [,2] [,3] chins 0.6101751 0.18985890 0.004125743 situps 0.8442193 -0.05748754 -0.010784582 jumps 0.3638095 0.09727830 -0.052192182 $scores$corr.X.yscores [,1] [,2] [,3] weight -0.7053627 0.14136116 -0.016680651 waist -0.8673051 0.01953520 -0.008875444 pulse 0.1811170 0.02910116 0.060927845 $scores$corr.Y.yscores [,1] [,2] [,3] chins 0.6945030 0.7164708 0.06584216 situps 0.9608928 -0.2169408 -0.17210961 jumps 0.4140890 0.3670993 -0.83292764 correlation between phys variables with CVs_U correlation between exer variables with CVs_U correlation between phys variables with CVs_V correlation between exer variables with CVs_V
Significance: p.asym in CCP vCancor<-cc1$cor # p.asym(rho,N,p,q, tstat = "Wilks|Hotelling|Pillai|Roy") res<-p.asym(vCancor,length(md$weight),3,3, tstat = "Wilks") Wilks' Lambda, using F-approximation (Rao's F): stat approx df1 df2 p.value 1 to 3: 0.2112505 3.4003788 9 34.22293 0.004421278 2 to 3: 0.9261286 0.2933756 4 30.00000 0.879945478 3 to 3: 0.9960736 0.0630703 1 16.00000 0.804904236 plt.asym(res,rhostart=1) plt.asym(res,rhostart=2) plt.asym(res,rhostart=3) At least one cancor significant? Significant relationship after excluding cancor 1? Significant relationship after excluding cancor 1 and 2?
Multivariate statistics PCA: principal component analysis Canonical correlation Cluster analysis ...... Xuhua Xia
Cluster analysis # Use US crime data nd<-scale(md[,2:8]) # nd<-md[,-1] rownames(nd)<-md$STATE d<-dist(nd,method="euclidean") # other distances:, "maximum|manhattan|Canberra|binary|minkowski" hc<-hclust(d, method="average") # default is average linkage (UPGMA) plot(hc,hang=-1) # hang=-1: scale tree with branch length rect.hclust(hc, k=4, border="red") # export tree in Newick format library(ape) class(hc) # must be hclust class my_tree <- as.phylo(hc) write.tree(phy=my_tree, file="clipboard")
Group into clusters Xuhua Xia
kmean & number of clusters Within-group SS
Total sum of squares (TSS) Between-group sum of squares (BSS) Sum of distances between centroids and points Sum of squared distances between centroids and points within-group sum of squared (WSS)
WSS plot If all clustering is truly optimized, then WSS should decrease monotonously with increasing number of clusters BSSi-BSSi-1 3 clusters
Determining number of clusters (k) # Rationale: plot within-cluster sum of squares over k, with k=2:15 # wss: within-group sum of squares # apply(md,1|2,var) 1|2: variables in rows|column # sum(apply(md,2,var)): sum of variance # DF*var = SS totalWSS <- (nrow(md)-1)*sum(apply(md,2,var)) # kmeans clustering with 2, 3, …, 15 clusters and compute wss for each WSS<-rep(0,15) for (i in 1:15) WSS[i] <- sum(kmeans(md, centers=i)$withinss) numCluster<-1:15 plot(numCluster, WSS, type="b", xlab="Num_Cluster", ylab="Within_groups_SS") # Necessary to do this multiple times because of the heuristic nature of the # clustering algorithms. Check WSS[i] values # K-Means Cluster Analysis fit <- kmeans(md, 5) # 5 cluster solution, $cluster, $centers, $totss, $withinss, # $betweenss fitArray<-array(list(),10) MSS<-rep(0,10) for(i in 1:10) { fitArray[[i]]<-as.list(fitArray[[i]]) fitArray[[i]] <- kmeans(md, 5) # 5 cluster solution MSS[i]<-100*fitArray[[i]]$betweenss/fitArray[[i]]$totss } fitArray<-array(list(),10) MSS<-rep(0,10) for(i in 1:10) { fitArray[[i]]<-as.list(fitArray[[i]]) fitArray[[i]] <- kmeans(md, 4) MSS[i]<-100*fitArray[[i]]$betweenss/fitArray[[i]]$totss } Xuhua Xia