Principal Component Analysis Canonical Correlation Cluster Analysis

Slides:



Advertisements
Similar presentations
Canonical Correlation
Advertisements

Component Analysis (Review)
Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
Factor Analysis Continued
Xuhua Xia Slide 1 Correlation Simple correlation –between two variables Multiple and Partial correlations –between one variable and a set of other variables.
Xuhua Xia Slide 1 Correlation Simple correlation –between two variables Multiple and Partial correlations –between one variable and a set of other variables.
Multivariate statistics
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
By Sam Kider Miles Kvalheim Matt Varghese.  Do states that spend more money on education have less crime?
Lecture 7: Principal component analysis (PCA)
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Eigenvalues and eigenvectors
Factor Analysis Purpose of Factor Analysis
Principal component analysis (PCA)
Canonical Correlation: Equations Psy 524 Andrew Ainsworth.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
2.3. Measures of Dispersion (Variation):
Principal Component Analysis Principles and Application.
Xuhua Xia Slide 1 Principal Components Analysis Objectives: –Understand the principles of principal components analysis (PCA) –Recognize conditions under.
Principal Component Analysis & Factor Analysis Psych 818 DeShon.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Map Review. California Kentucky Alabama.
1. AFL-CIO What percentage of the funds received by Alabama K-12 public schools in school year was provided by the state of Alabama? a)44% b)53%
1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Canonical Correlation Analysis and Related Techniques Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia.
Directions: Label Texas, Arkansas, Louisiana, Mississippi, Tennessee, Alabama, Georgia, Florida, South Carolina, North Carolina, Virginia--- then color.
Chapter 9 Factor Analysis
Multiple Linear Regression Partial Regression Coefficients.
Techniques for studying correlation and covariance structure Principal Components Analysis (PCA) Factor Analysis.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Eigenvalues The eigenvalue problem is to determine the nontrivial solutions of the equation Ax= x where A is an n-by-n matrix, x is a length n column.
Hawaii Alaska (not to scale) Alaska GeoCurrents Customizable Base Map text.
Principal Component Analysis (PCA)
US MAP TEST Practice
Multivariate Transformation. Multivariate Transformations  Started in statistics of psychology and sociology.  Also called multivariate analyses and.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Unsupervised Learning II Feature Extraction
Canonical Correlation Analysis (CCA). CCA This is it! The mother of all linear statistical analysis When ? We want to find a structural relation between.
Chapter 12 REGRESSION DIAGNOSTICS AND CANONICAL CORRELATION.
CSE 554 Lecture 8: Alignment
Unsupervised Learning
Inference for Least Squares Lines
Introduction to Matrices
Physicians per 1,000 Persons
Exploring Microarray data
The States How many states are in the United States?
Descriptive Statistics vs. Factor Analysis
Measuring latent variables
Supplementary Data Tables, Utilization and Volume
Principal Component Analysis (PCA)
Multivariate Statistical Methods
Principal Component Analysis
CHAPTER 2: Basic Summary Statistics
Section 3.2 Measures of Spread
2.3. Measures of Dispersion (Variation):
Canonical Correlation Analysis and Related Techniques
Performance Tests in Guide Dogs
Measuring latent variables
CBD Topical Sales Restrictions by State (as of May 23, 2019)
Unsupervised Learning
Outline Variance Matrix of Stochastic Variables and Orthogonal Transforms Principle Component Analysis Generalized Eigenvalue Decomposition.
Presentation transcript:

Principal Component Analysis Canonical Correlation Cluster Analysis Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca Let me first emphasize the point that taking a systems biology perspective in doing biological research is not new. Its root can be traced at least to Ronald Fisher...... 1

Multivariate statistics PCA: principal component analysis Canonical correlation Cluster analysis ...... Xuhua Xia

PCA Given a set of variables x1, x2, …, xn, find a set of coefficients a11, a12, …, a1n, so that PC1 = a11x1 + a12x2 + …+ a1nxn has the maximum variance (v1) subject to the constraint that a1 is a unit vector, i.e., sqrt(a112+ a122 …+ a1n2) = 1 find a 2nd set of coefficients a2 so that PC2 has the maximum variance (v2) subject to the unit vector constraint and the additional constraint that a2 is orthogonal to a1 find 3rd, 4th,… nth set of coefficients so that PC3, PC4, … have the maximum variance (v3, v4, …) subject to the unit vector constraint and that ai is orthogonal to all ai-1 vectors. It turns out that v1, v2, … are eigenvalues and a1, a2, … are eigenvectors of the variance-covariance matrix of x1, x2, …, xn (or of the correlation matrix if x1, x2, …, xn are standardized) Demonstrate how to find the eigenvalues, eigenvectors, and PC_Scores manually in EXCEL

Toy Datasets X1 X2 -1.264911064 -1.788854382 -0.632455532 -0.894427191 0 0 0.632455532 0.894427191 1.264911064 1.788854382 X1 X2 1.31832344 1.064630266 2.341993124 2.921219905 1.913119064 2.304482133 3.00557377 2.691546491 4.183120958 3.897404623 To demonstrate the physical principles involved in forming a rainbow, it is not necessary to create a rainbow spanning the sky – a small one is sufficient. --C.C. Li Xuhua Xia

R functions Don’t use scientific notation. Requesting the PCA to be carried out on the covariance matrix (default) rather than the correlation matrix. Use scale.=TRUE to request PCA on correlation matrix options("scipen"=100, "digits"=6) fit.cov<-prcomp(~X1+X2) # or prcomp(md) fit.cor<-prcomp(md,scale.=T) predict(fit.cov,md) # centered PC scores predict(fit.cov,data.frame(X1=0.3,X2=0.5)) screeplot(fit.cov) Help decide how many PCs to keep when there are many variables Xuhua Xia

prcomp Output Standard deviations: [1] 1.7320714226 0.0000440773 Rotation: PC1 PC2 X1 0.577347 0.816499 X2 0.816499 -0.577347 PC1 PC2 [1,] -2.19092 0.0000278767 [2,] -1.09545 -0.0000557540 [3,] 0.00000 0.0000000000 [4,] 1.09545 0.0000557540 [5,] 2.19092 -0.0000278767 better to output in variance (eigenvalue) accounted for by each PC eigenvectors: PC1 = 0.57735X1+0.81650X2 Principal component scores, generated by as.matrix(md) %*% as.matrix(fit$rotation) Centered ones is obtained by fit$x Xuhua Xia

PCA on correlation matrix (scale=T) Standard deviations: [1] 1.4142135619 0.0000381717 Rotation: PC1 PC2 X1 0.707107 0.707107 X2 0.707107 -0.707107 PC1 PC2 [1,] -1.788850 0.0000241421 [2,] -0.894435 -0.0000482837 [3,] 0.000000 0.0000000000 [4,] 0.894435 0.0000482837 [5,] 1.788850 -0.0000241421 Xuhua Xia

Crime Data in 50 States STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 ALASKA 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 ARIZONA 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5 ARKANSAS 8.8 27.6 83.2 203.4 972.6 1862.1 183.4 CALIFORNIA 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5 COLORADO 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1 CONNECTICUT 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2 DELAWARE 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0 FLORIDA 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4 GEORGIA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9 HAWAII 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4 IDAHO 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6 ILLINOIS 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6 (Full data set in EXCEL) nd<-md[,2:8]; rownames(nd)<-md$STATE PCA.cor<-prcomp(nd,scale=T) # use correlation matrix PCA.cor summary(PCA.cor) PCScore<-predict(PCA.cor,nd) # Centered PC scores screeplot(PCA.cor, type="l")

Correlation Matrix MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER 1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688 RAPE 0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489 ROBBERY 0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907 ASSAULT 0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758 BURGLARY 0.3858 0.7121 0.6372 0.6229 1.0000 0.7921 0.5580 LARCENY 0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442 AUTO 0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000 If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix. Xuhua Xia

Eigenvalues screeplot(objPCA,type = "lines") > summary(objPCA) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 2.029 1.113 0.852 0.5625 0.5079 0.4712 0.3522 Proportion of Variance 0.588 0.177 0.104 0.0452 0.0369 0.0317 0.0177 Cumulative Proportion 0.588 0.765 0.869 0.9137 0.9506 0.9823 1.0000 screeplot(objPCA,type = "lines") Xuhua Xia

Eigenvectors Do these eigenvectors mean anything? PC1 PC2 PC3 PC4 PC5 PC6 PC7 MURDER -0.3003 -0.62918 0.17824 -0.23216 0.53810 0.25912 0.267589 RAPE -0.4318 -0.16944 -0.24421 0.06219 0.18848 -0.77327 -0.296490 ROBBE -0.3969 0.04224 0.49588 -0.55793 -0.52002 -0.11439 -0.003902 ASSAU -0.3966 -0.34353 -0.06953 0.62984 -0.50660 0.17235 0.191751 BURGLA -0.4402 0.20334 -0.20990 -0.05757 0.10101 0.53599 -0.648117 LARCEN -0.3574 0.40233 -0.53922 -0.23491 0.03008 0.03941 0.601688 AUTO -0.2952 0.50241 0.56837 0.41922 0.36980 -0.05729 0.147044 Do these eigenvectors mean anything? All crimes are negatively correlated with the first eigenvector, which is therefore interpreted as a measure of overall safety. The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…... Xuhua Xia

PC Plot: Crime Data Maryland Nevada, New York, California North and South Dakota Mississippi, Alabama, Louisiana, South Carolina

Multivariate statistics PCA: principal component analysis Canonical correlation Cluster analysis ...... Xuhua Xia

Correlation Simple correlation between two variables Multiple and Partial correlations between one variable and a set of other variables Canonical Correlation between two sets of variables each containing more than one variable. Simple and multiple correlations are special cases of canonical correlation. Partial: between X and Y with Z being controlled for Multiple: x1 on x2 and x3 Xuhua Xia

Review of correlation X Z Y 1 4 14.0000 1 5 17.9087 1 6 16.3255 2 3 14.4441 2 4 15.2952 2 5 19.1587 2 6 16.0299 2 5 17.0000 3 3 14.7556 3 4 17.6823 3 5 20.5301 3 6 21.6408 4 3 15.0903 4 4 18.1603 4 5 22.2471 5 2 14.4450 5 3 16.5554 5 4 21.0047 5 5 22.0000 6 1 19.0000 6 2 18.0000 6 3 18.1863 6 4 21.0000 Compute Pearson correlation coefficients between X and Z, X and Y and Z and Y. Compute partial correlation coefficient between X and Y, controlling for Z (i.e., the correlation coefficient between X and Y when Z is held constant), by using the equation in the previous slide. Run R to verify your calculation: #install.packages("ggm") library(ggm) md<-read.table("XYZ.txt",header=T) cor(md) s<-var(md) parcor(s) # or parcor(cor(md)) # install.packages("psych") library(psych) smc(s) # squared multiple correlation Xuhua Xia

Data for canonical correlation # First three variables: physical # Last three variables: exercise # Middle-aged men weight waist pulse chins situps jumps 191 36 50 5 162 60 193 38 58 12 101 101 189 35 46 13 145 58 211 38 56 8 151 38 176 31 74 15 200 40 169 34 50 17 120 38 154 34 64 14 215 105 193 36 46 6 170 31 176 37 54 4 160 25 156 33 54 15 215 73 189 37 52 2 130 60 162 35 62 12 145 37 182 36 56 4 141 42 167 34 60 6 155 40 154 30 56 17 251 250 166 33 52 13 210 115 247 46 50 1 50 50 202 37 62 12 120 120 157 32 52 11 230 80 138 33 68 2 150 43 Use EXCEL to manually calculate the first canonical correlation

Canonical correlation (cc) install.packages("ggplot2") install.packages("Ggally") install.packages("CCA") install.packages("CCP") require(ggplot2) require(GGally) require(CCA) require(CCP) phys<-md[,1:3] exer<-md[,4:6] matcor(phys,exer) cc1<-cc(phys,exer) cc1 http://www.ats.ucla.edu/stat/r/dae/canonical.htm

cc output [1] 0.87857805 0.26499182 0.06266112 canonical correlations $xcoef [,1] [,2] [,3] weight 0.007691932 0.08206036 -0.01089895 waist -0.352367502 -0.46672576 0.12741976 pulse -0.016888712 0.04500996 0.14113157 $ycoef [,1] [,2] [,3] chins 0.063996632 0.19132168 0.116137756 situps 0.017876736 -0.01743903 0.001201433 jumps -0.002949483 0.00494516 -0.022700322 canonical correlations raw canonical coefficients matrices: U and V phys*U: raw canonical variates for phys exer*V: raw canonical variates for exer $scores$xscores: standardized canonical variates.

standardized canonical variates $scores$xscores [,1] [,2] [,3] [1,] -0.06587452 0.39294336 -0.90048466 [2,] -0.89033536 -0.01630780 0.46160952 [3,] 0.33866397 0.51550858 -1.57063280 [4,] -0.71810315 1.37075870 -0.01683463 [5,] 1.17525491 2.57590579 2.01305832 [6,] 0.46963797 -0.47893295 -0.91554740 [7,] 0.11781701 -1.07969890 1.22377873 [8,] 0.01706419 0.37702424 -1.48680882 [9,] -0.60117586 -1.12464792 -0.04505445 [10,] 0.65445550 -0.89895199 -0.33675460 [11,] -0.46740331 -0.14788320 -0.46900387 [12,] -0.13923760 -0.97996173 0.98174380 [13,] -0.23643419 -0.07554011 0.04439525 [14,] 0.28536698 -0.19295410 0.51756617 [15,] 1.66239672 0.42712450 -0.41495287 [16,] 0.76515225 -0.16836835 -0.72800719 [17,] -3.15880133 0.32106568 -0.23662794 [18,] -0.53629531 1.36900100 0.80062552 [19,] 1.04829236 -0.44018579 -0.75733645 [20,] 0.27955875 -1.74589901 1.83526836 $scores$yscores [,1] [,2] [,3] [1,] -0.23742244 -0.91888370 -0.28185833 [2,] -1.00085572 1.68690015 -0.47289464 [3,] -0.02345494 0.89826285 0.67222000 [4,] -0.17718803 -0.26188291 0.55274626 [5,] 1.14084951 0.23274696 1.37918010 [6,] -0.15539717 2.00062200 1.56074166 [7,] 1.15328755 0.10127530 -0.19445711 [8,] 0.05512308 -1.01048386 0.50220023 [9,] -0.23394065 -1.24840794 0.39411232 [10,] 1.31166763 0.13435186 0.64809096 [11,] -1.00146790 -0.93479995 -0.66871744 [12,] -0.02551244 0.60309281 1.03278901 [13,] -0.62373985 -0.83299874 -0.01462037 [14,] -0.23957331 -0.70439205 0.27987584 [15,] 1.56116497 0.76448365 -3.09433899 [16,] 0.97041241 0.04660035 -0.54360525 [17,] -2.46610861 0.21954878 -0.65396658 [18,] -0.71723790 1.44951672 -0.88137354 [19,] 1.30318577 -0.85790412 0.04265917 [20,] -0.59379197 -1.36764817 -0.25878331

Canonical structure: Correlations $scores$corr.X.xscores [,1] [,2] [,3] weight -0.8028458 0.53345479 -0.2662041 waist -0.9871691 0.07372001 -0.1416419 pulse 0.2061478 0.10981908 0.9723389 $scores$corr.Y.xscores [,1] [,2] [,3] chins 0.6101751 0.18985890 0.004125743 situps 0.8442193 -0.05748754 -0.010784582 jumps 0.3638095 0.09727830 -0.052192182 $scores$corr.X.yscores [,1] [,2] [,3] weight -0.7053627 0.14136116 -0.016680651 waist -0.8673051 0.01953520 -0.008875444 pulse 0.1811170 0.02910116 0.060927845 $scores$corr.Y.yscores [,1] [,2] [,3] chins 0.6945030 0.7164708 0.06584216 situps 0.9608928 -0.2169408 -0.17210961 jumps 0.4140890 0.3670993 -0.83292764 correlation between phys variables with CVs_U correlation between exer variables with CVs_U correlation between phys variables with CVs_V correlation between exer variables with CVs_V

Significance: p.asym in CCP vCancor<-cc1$cor # p.asym(rho,N,p,q, tstat = "Wilks|Hotelling|Pillai|Roy") res<-p.asym(vCancor,length(md$weight),3,3, tstat = "Wilks") Wilks' Lambda, using F-approximation (Rao's F): stat approx df1 df2 p.value 1 to 3: 0.2112505 3.4003788 9 34.22293 0.004421278 2 to 3: 0.9261286 0.2933756 4 30.00000 0.879945478 3 to 3: 0.9960736 0.0630703 1 16.00000 0.804904236 plt.asym(res,rhostart=1) plt.asym(res,rhostart=2) plt.asym(res,rhostart=3) At least one cancor significant? Significant relationship after excluding cancor 1? Significant relationship after excluding cancor 1 and 2?

Multivariate statistics PCA: principal component analysis Canonical correlation Cluster analysis ...... Xuhua Xia

Cluster analysis # Use US crime data nd<-scale(md[,2:8]) # nd<-md[,-1] rownames(nd)<-md$STATE d<-dist(nd,method="euclidean") # other distances:, "maximum|manhattan|Canberra|binary|minkowski" hc<-hclust(d, method="average") # default is average linkage (UPGMA) plot(hc,hang=-1) # hang=-1: scale tree with branch length rect.hclust(hc, k=4, border="red") # export tree in Newick format library(ape) class(hc) # must be hclust class my_tree <- as.phylo(hc) write.tree(phy=my_tree, file="clipboard")

Group into clusters Xuhua Xia

kmean & number of clusters Within-group SS

Total sum of squares (TSS) Between-group sum of squares (BSS) Sum of distances between centroids and points Sum of squared distances between centroids and points within-group sum of squared (WSS)

WSS plot If all clustering is truly optimized, then WSS should decrease monotonously with increasing number of clusters BSSi-BSSi-1 3 clusters

Determining number of clusters (k) # Rationale: plot within-cluster sum of squares over k, with k=2:15 # wss: within-group sum of squares # apply(md,1|2,var) 1|2: variables in rows|column # sum(apply(md,2,var)): sum of variance # DF*var = SS totalWSS <- (nrow(md)-1)*sum(apply(md,2,var)) # kmeans clustering with 2, 3, …, 15 clusters and compute wss for each WSS<-rep(0,15) for (i in 1:15) WSS[i] <- sum(kmeans(md, centers=i)$withinss) numCluster<-1:15 plot(numCluster, WSS, type="b", xlab="Num_Cluster", ylab="Within_groups_SS") # Necessary to do this multiple times because of the heuristic nature of the # clustering algorithms. Check WSS[i] values # K-Means Cluster Analysis fit <- kmeans(md, 5) # 5 cluster solution, $cluster, $centers, $totss, $withinss, # $betweenss fitArray<-array(list(),10) MSS<-rep(0,10) for(i in 1:10) { fitArray[[i]]<-as.list(fitArray[[i]]) fitArray[[i]] <- kmeans(md, 5) # 5 cluster solution MSS[i]<-100*fitArray[[i]]$betweenss/fitArray[[i]]$totss } fitArray<-array(list(),10) MSS<-rep(0,10) for(i in 1:10) { fitArray[[i]]<-as.list(fitArray[[i]]) fitArray[[i]] <- kmeans(md, 4) MSS[i]<-100*fitArray[[i]]$betweenss/fitArray[[i]]$totss } Xuhua Xia