Principal Component Analysis Canonical Correlation Cluster Analysis

Principal Component Analysis Canonical Correlation Cluster Analysis
Xuhua Xia Let me first emphasize the point that taking a systems biology perspective in doing biological research is not new. Its root can be traced at least to Ronald Fisher...... 1

Multivariate statistics
PCA: principal component analysis Canonical correlation Cluster analysis ...... Xuhua Xia

PCA Given a set of variables x1, x2, …, xn,
find a set of coefficients a11, a12, …, a1n, so that PC1 = a11x1 + a12x2 + …+ a1nxn has the maximum variance (v1) subject to the constraint that a1 is a unit vector, i.e., sqrt(a112+ a122 …+ a1n2) = 1 find a 2nd set of coefficients a2 so that PC2 has the maximum variance (v2) subject to the unit vector constraint and the additional constraint that a2 is orthogonal to a1 find 3rd, 4th,… nth set of coefficients so that PC3, PC4, … have the maximum variance (v3, v4, …) subject to the unit vector constraint and that ai is orthogonal to all ai-1 vectors. It turns out that v1, v2, … are eigenvalues and a1, a2, … are eigenvectors of the variance-covariance matrix of x1, x2, …, xn (or of the correlation matrix if x1, x2, …, xn are standardized) Demonstrate how to find the eigenvalues, eigenvectors, and PC_Scores manually in EXCEL

Toy Datasets X1 X2 0 0 X1 X2 To demonstrate the physical principles involved in forming a rainbow, it is not necessary to create a rainbow spanning the sky – a small one is sufficient C.C. Li Xuhua Xia

R functions Don’t use scientific notation.
Requesting the PCA to be carried out on the covariance matrix (default) rather than the correlation matrix. Use scale.=TRUE to request PCA on correlation matrix options("scipen"=100, "digits"=6) fit.cov<-prcomp(~X1+X2) # or prcomp(md) fit.cor<-prcomp(md,scale.=T) predict(fit.cov,md) # centered PC scores predict(fit.cov,data.frame(X1=0.3,X2=0.5)) screeplot(fit.cov) Help decide how many PCs to keep when there are many variables Xuhua Xia

prcomp Output Standard deviations:
[1] Rotation: PC PC2 X X PC PC2 [1,] [2,] [3,] [4,] [5,] better to output in variance (eigenvalue) accounted for by each PC eigenvectors: PC1 = X X2 Principal component scores, generated by as.matrix(md) %*% as.matrix(fit$rotation) Centered ones is obtained by fit$x Xuhua Xia

PCA on correlation matrix (scale=T)
Standard deviations: [1] Rotation: PC PC2 X X PC PC2 [1,] [2,] [3,] [4,] [5,] Xuhua Xia

Crime Data in 50 States STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA HAWAII IDAHO ILLINOIS (Full data set in EXCEL) nd<-md[,2:8]; rownames(nd)<-md$STATE PCA.cor<-prcomp(nd,scale=T) # use correlation matrix PCA.cor summary(PCA.cor) PCScore<-predict(PCA.cor,nd) # Centered PC scores screeplot(PCA.cor, type="l")

Correlation Matrix MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix. Xuhua Xia

Eigenvalues screeplot(objPCA,type = "lines") > summary(objPCA)
Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation Proportion of Variance Cumulative Proportion screeplot(objPCA,type = "lines") Xuhua Xia

Eigenvectors Do these eigenvectors mean anything?
PC PC PC PC PC PC PC7 MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO Do these eigenvectors mean anything? All crimes are negatively correlated with the first eigenvector, which is therefore interpreted as a measure of overall safety. The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…... Xuhua Xia

PC Plot: Crime Data Maryland Nevada, New York, California
North and South Dakota Mississippi, Alabama, Louisiana, South Carolina

Correlation Simple correlation between two variables
Multiple and Partial correlations between one variable and a set of other variables Canonical Correlation between two sets of variables each containing more than one variable. Simple and multiple correlations are special cases of canonical correlation. Partial: between X and Y with Z being controlled for Multiple: x1 on x2 and x3 Xuhua Xia

Review of correlation X Z Y Compute Pearson correlation coefficients between X and Z, X and Y and Z and Y. Compute partial correlation coefficient between X and Y, controlling for Z (i.e., the correlation coefficient between X and Y when Z is held constant), by using the equation in the previous slide. Run R to verify your calculation: #install.packages("ggm") library(ggm) md<-read.table("XYZ.txt",header=T) cor(md) s<-var(md) parcor(s) # or parcor(cor(md)) # install.packages("psych") library(psych) smc(s) # squared multiple correlation Xuhua Xia

Data for canonical correlation
# First three variables: physical # Last three variables: exercise # Middle-aged men weight waist pulse chins situps jumps Use EXCEL to manually calculate the first canonical correlation

Canonical correlation (cc)
install.packages("ggplot2") install.packages("Ggally") install.packages("CCA") install.packages("CCP") require(ggplot2) require(GGally) require(CCA) require(CCP) phys<-md[,1:3] exer<-md[,4:6] matcor(phys,exer) cc1<-cc(phys,exer) cc1

cc output [1] 0.87857805 0.26499182 0.06266112 canonical correlations
$xcoef [,1] [,2] [,3] weight waist pulse $ycoef [,1] [,2] [,3] chins situps jumps canonical correlations raw canonical coefficients matrices: U and V phys*U: raw canonical variates for phys exer*V: raw canonical variates for exer $scores$xscores: standardized canonical variates.

standardized canonical variates
$scores$xscores [,1] [,2] [,3] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,] $scores$yscores [,1] [,2] [,3] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] [16,] [17,] [18,] [19,] [20,]

Canonical structure: Correlations
$scores$corr.X.xscores [,1] [,2] [,3] weight waist pulse $scores$corr.Y.xscores [,1] [,2] [,3] chins situps jumps $scores$corr.X.yscores [,1] [,2] [,3] weight waist pulse $scores$corr.Y.yscores [,1] [,2] [,3] chins situps jumps correlation between phys variables with CVs_U correlation between exer variables with CVs_U correlation between phys variables with CVs_V correlation between exer variables with CVs_V

Significance: p.asym in CCP
vCancor<-cc1$cor # p.asym(rho,N,p,q, tstat = "Wilks|Hotelling|Pillai|Roy") res<-p.asym(vCancor,length(md$weight),3,3, tstat = "Wilks") Wilks' Lambda, using F-approximation (Rao's F): stat approx df df2 p.value 1 to 3: 2 to 3: 3 to 3: plt.asym(res,rhostart=1) plt.asym(res,rhostart=2) plt.asym(res,rhostart=3) At least one cancor significant? Significant relationship after excluding cancor 1? Significant relationship after excluding cancor 1 and 2?

Cluster analysis # Use US crime data nd<-scale(md[,2:8])
# nd<-md[,-1] rownames(nd)<-md$STATE d<-dist(nd,method="euclidean") # other distances:, "maximum|manhattan|Canberra|binary|minkowski" hc<-hclust(d, method="average") # default is average linkage (UPGMA) plot(hc,hang=-1) # hang=-1: scale tree with branch length rect.hclust(hc, k=4, border="red") # export tree in Newick format library(ape) class(hc) # must be hclust class my_tree <- as.phylo(hc) write.tree(phy=my_tree, file="clipboard")

Group into clusters Xuhua Xia

kmean & number of clusters
Within-group SS

Total sum of squares (TSS)
Between-group sum of squares (BSS) Sum of distances between centroids and points Sum of squared distances between centroids and points within-group sum of squared (WSS)

WSS plot If all clustering is truly optimized, then WSS should decrease monotonously with increasing number of clusters BSSi-BSSi-1 3 clusters

Determining number of clusters (k)
# Rationale: plot within-cluster sum of squares over k, with k=2:15 # wss: within-group sum of squares # apply(md,1|2,var) 1|2: variables in rows|column # sum(apply(md,2,var)): sum of variance # DF*var = SS totalWSS <- (nrow(md)-1)*sum(apply(md,2,var)) # kmeans clustering with 2, 3, …, 15 clusters and compute wss for each WSS<-rep(0,15) for (i in 1:15) WSS[i] <- sum(kmeans(md, centers=i)$withinss) numCluster<-1:15 plot(numCluster, WSS, type="b", xlab="Num_Cluster", ylab="Within_groups_SS") # Necessary to do this multiple times because of the heuristic nature of the # clustering algorithms. Check WSS[i] values # K-Means Cluster Analysis fit <- kmeans(md, 5) # 5 cluster solution, $cluster, $centers, $totss, $withinss, # $betweenss fitArray<-array(list(),10) MSS<-rep(0,10) for(i in 1:10) { fitArray[[i]]<-as.list(fitArray[[i]]) fitArray[[i]] <- kmeans(md, 5) # 5 cluster solution MSS[i]<-100*fitArray[[i]]$betweenss/fitArray[[i]]$totss } fitArray<-array(list(),10) MSS<-rep(0,10) for(i in 1:10) { fitArray[[i]]<-as.list(fitArray[[i]]) fitArray[[i]] <- kmeans(md, 4) MSS[i]<-100*fitArray[[i]]$betweenss/fitArray[[i]]$totss } Xuhua Xia

Principal Component Analysis Canonical Correlation Cluster Analysis

Similar presentations

Presentation on theme: "Principal Component Analysis Canonical Correlation Cluster Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Principal Component Analysis Canonical Correlation Cluster Analysis

Similar presentations

Presentation on theme: "Principal Component Analysis Canonical Correlation Cluster Analysis"— Presentation transcript:

Similar presentations

About project

Feedback