Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN.

Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN

My Background Benthic ecologist: Community ecology How environments control macroecological patterns in the deep-sea Interested in R but “NOT a statistician”. Education: BS in Zoology in Taiwan; MS & PhD in Biological Oceanography, Texas A&M University Current project: Scale-up regional benthic diversity and standing stock pattern using ecological modeling approaches

Lecture Contents Visualization Resemblance index Cluster analysis Ordination Correlation Testing for difference Other stuff Clarke & Warwick (2001)

Front Matter Mostly non-parametric, permutation-based techniques Start with graphical concept Followed by examples in simple R codes No more than 3 lines of code for each example Most functions in base R or package “vegan” All analyses are available on commercial software (PRIMER-E) [demo version]demo version

R packages # Install and load R Packages install.packages( c("vegan", "scatterplot3d", "reshape2", "lattice", "clustsig") ) library( vegan ) library( scatterplot3d ) library( reshape2 ) library( lattice ) library( clustsig)

First thing first, plot the data # Violent Crime Rates by US State USArrests plot( USArrests[,1:2] )

3D Scatter Plot scatterplot3d( USArrests[,1:3] )

Scatterplot Matrices pairs( USArrests )

Lattice Graphs # Melt dataframe to flat format m = melt( USArrests, id.vars = "Assault" ) m # Multipanel scatter plot xyplot( value ~ Assault | variable, data = m )

Resemblance/distance Indices Clarke & Warwick (2001) *Not good for data with lots of zero (e.g. species abundance)

Resemblance/distance Indices D = 0, if species are identical in 2 samples D = 1, if 2 samples have no species in common Better for species abundance data (with lots of zero)

Resemblance/distance Indices # Euclidean Distance: dist( USArrests ) # Bray-Crutis Dissimilarity # Vegetation in lichen pastures data( varespec ) varespec vegdist( varespec )

Hierarchical Clustering Patterns in distance or dissimilarity matrix is difficult to detect. Find natural grouping by successive fusing of samples

Hierarchical Clustering Linkage Options: Single linkage (neareast neighbour clustering) Complete linkage (furthest neighbour clustering) Group-average linkage Ward’s minimum variance Group 1 Group 2 Sp 1 Sp 2 Single Link Complete Link

Hierarchical Clustering # Normalization arrest = scale( USArrests, center = FALSE ) # Euclidean Distance d = dist( arrest ) # Dendrograms plot( hclust( d, "single" ) ) plot( hclust( d, "complete" ) ) plot( hclust( d, "average" ) ) plot( hclust( d, "ward" ) )

Determine Numbers of Clusters # Using Ward's mehtod clus = hclust( d, "ward" ) plot( clus ) # Cut into 3 groups rect.hclust( clus, k = 3 ) K = 3 K = 6

Determine Significant Clusters Clarke et al. (2008, JEMBE 366:56-69)

Similarity Profile Test # 999 permutation # Group-average clustering # alpha = 0.05 clus2 = simprof( arrest ) simprof.plot( clus2 ) * Colors = significant clusters

Motivations for Ordination Dendrogram is still difficult to understand Clustering forced samples into groups despites the compositional changes may be continuous. Ordination reduces dimensionality of multivariate data (data cloud so to speak) Preferably, capture majority of the information as bivariate data frame, so the multivariate patterns can be shown on a scatter plot.

Principal Component Analysis (PCA) Clarke & Warwick (2001) 2 species example

Principal Component Analysis (PCA) PC1 maximizes variance of points projected on it. PC2 is perpendicular to PC1 PC3 is perpendicular to PC1 and PC2 New orthogonal axes are linear combination of old data: PC1 = 0.62 Sp1 + 0.52 Sp2 + 0.58 Sp3 PC2 = -0.73 Sp1 + 0.65 Sp2 + 0.2 Sp3 PC3 = 0.28 Sp 1 + 0.55 Sp2 -0.79 Sp3 Clarke & Warwick (2001) 3 species example

Principal Component Analysis (PCA) # PCA pca = princomp( arrest ) # New orthogonal axes pairs( pca$scores )

Principal Component Analysis (PCA) # Variable contributions # PC1 = -0.65 Murder -0.6 Assault -0.46 Rape pca$loading # Variance of PC axes plot( pca ) # Total variance explained summary( pca )

Principal Component Analysis (PCA) #Cut dentrogram for 6 cluster group = cutree( clus, 6 ) plot( pca$scores, type = "n" ) text( pca$scores, names( group ), col = group )

Principal Component Analysis (PCA) # Add variable contributions biplot( pca, scale = 0 )

Non-Metric Multidimensional Scaling (nMDS) Ordination bases on ranked resemblance (or distance) matrix Robust and flexible for all kind of resemblance indices Using iterative procedure, successively refine the locations of ordination points according to the ranked dissimilarities of samples Better choice for species abundance data (comparing to PCA)

Multidimensional Scaling (nMDS) mds = metaMDS( arrest ) stressplot( mds )

Multidimensional Scaling (nMDS) # Ordination with 6 clusters plot( mds$points, type = "n" ) text( mds$points, names( group ), pch = group, col = group) # Add variable score # Weighted average biplot( mds$points, mds$species )

Correlation between Matrices # Vegetation and environment # in lichen pastures data( varespec ) data( varechem ) # Bray-Crutis Dissimilarity veg.dist = vegdist( varespec ) # Euclidean distance env.dist = dist( scale( varechem ) )

Mantel Test ρ Correlation Sites Species Sites 1, 2, 3,…….... BC Rank Environ. Sites 1, 2, 3,…….... ED Rank

r = 0.3 # Mantel test # Based on 999 permutations # Pearson's correlation man = mantel( veg.dist, env.dist ) man # Distribution of permuted r hist ( man$perm ) Mantel Test

Best Environmental Subsets ρ Correlation Sites Species Sites 1, 2, 3,…….... BC Rank Environ. Sites 1, 2, 3,…….... ED Rank

BIOENV bioenv( varespec, varechem ) # 16383 possible subsets # Subset of environmental variables with best correlation to community data

Testing Group Difference for Community Data data( dune ) #Vegetation in Dutch Dune Meadows dune # More species (variables) than samples # Dominance of zero values # Violates multivariate normality and constant variance across the groups # A robust, permuatation-based test is needed for community data.

Analysis of Similarity (ANOSIM) R = 1: Within group are more similar than between groups R = 0: Between and within group are the same in average R is an absolute measure of group seperation Sites Species Sites 1, 2, 3,…….... BC Rank r B = Avg. rank between groups r W = Avg. rank within groups n = sample size

Analysis of Similarity (ANOSIM) # Environment factors in Dutch Dune Meadows data( dune.env ) # Does moisture has effect on vegetation? Moisture = as.numeric( dune.env$Moisture ) # Run a MDS on dune vegetation mds = metaMDS( dune ) # MDS plot seems to suggest moisture effect plot( mds$points, pch = 21, bg = Moisture, cex = Moisture )

Analysis of Similarity (ANOSIM) aos = anosim( dune, Moisture ) aos # Distribution of permuted R hist( aos$perm ) R = 0.43

Other Useful Functions Clustering: pam() for clustering around medoids and clara() for clustering large data (both in “cluster”) pvclust() in “pvclust” for assessing the uncertainty in hierarchical cluster analysis Ordination: Great PCA video explanation on YOUTUBEYOUTUBE imputePCA() in “missMDA” for handling missing data cca() and rda() in “vegan” for constrained type of ordinations Testing difference: mrpp() in “vegan” for ANOSIM type analysis but using original dissimilarities instead of their ranks. adonis() in “vegan” for robust and flexible multivariate permutational analysis of variance (e.g. factorial & nested design, mixed model, etc.) betadisper() in “vegan” for testing constant multivariate variance (or dispersion)

Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN.

Similar presentations

Presentation on theme: "Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN.

Similar presentations

Presentation on theme: "Introduction to Multivariate Analysis Biology 4605/7220 Chih-Lin Wei Canadian Health Oceans Network Postdoc Fellow Ocean Science Centre, MUN."— Presentation transcript:

Similar presentations

About project

Feedback