Multivariate Description

Slides:



Advertisements
Similar presentations
Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...
Advertisements

Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...
Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...
What we Measure vs. What we Want to Know
Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...
Step three: statistical analyses to test biological hypotheses General protocol continued.
Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
An Introduction to Multivariate Analysis
Dimension reduction (1)
Chapter 17 Overview of Multivariate Analysis Methods
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Data mining and statistical learning - lab2-4 Lab 2, assignment 1: OLS regression of electricity consumption on temperature at 53 sites.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Bivariate Regression CJ 526 Statistical Analysis in Criminal Justice.
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
19-1 Chapter Nineteen MULTIVARIATE ANALYSIS: An Overview.
New Methods in Ecology Complex statistical tests, and why we should be cautious!
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Review of the fundamental concepts of probability Exploratory data analysis: quantitative and graphical data description Estimation techniques, hypothesis.
Goals of Factor Analysis (1) (1)to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify.
Discriminant Analysis Testing latent variables as predictors of groups.
Business Research Methods William G. Zikmund Chapter 24 Multivariate Analysis.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Basic concepts in ordination
©The McGraw-Hill Companies, Inc., 2001Irwin/McGraw-Hill Donald Cooper Pamela Schindler Chapter 19 Business Research Methods.
Advanced Correlational Analyses D/RS 1013 Factor Analysis.
Thursday AM  Presentation of yesterday’s results  Factor analysis  A conceptual introduction to: Structural equation models Structural equation models.
Complex Analytic Designs. Outcomes (DVs) Predictors (IVs)1 ContinuousMany Continuous1 CategoricalMany Categorical None(histogram)Factor Analysis: PCA,
Stats Multivariate Data Analysis. Instructor:W.H.Laverty Office:235 McLean Hall Phone: Lectures: M W F 9:30am - 10:20am McLean Hall.
17-1 COMPLETE BUSINESS STATISTICS by AMIR D. ACZEL & JAYAVEL SOUNDERPANDIAN 6 th edition (SIE)
LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie.
Techniques for studying correlation and covariance structure Principal Components Analysis (PCA) Factor Analysis.
Math 5364/66 Notes Principal Components and Factor Analysis in SAS Jesse Crawford Department of Mathematics Tarleton State University.
Introduction to Multivariate Analysis of Variance, Factor Analysis, and Logistic Regression Rubab G. ARIM, MA University of British Columbia December 2006.
Multivariate Data Analysis Chapter 1 - Introduction.
ORDINATION What is it? What kind of biological questions can we answer? How can we do it in CANOCO 4.5? Some general advice on how to start analyses.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Factor Analysis Basics. Why Factor? Combine similar variables into more meaningful factors. Reduce the number of variables dramatically while retaining.
Principal Component Analysis
Continuous Outcome, Dependent Variable (Y-Axis) Child’s Height
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Stat240: Principal Component Analysis (PCA). Open/closed book examination data >scores=as.matrix(read.table(" hs.leeds.ac.uk/~charles/mva-
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Financial Analysis, Planning and Forecasting Theory and Application
Factor analysis Advanced Quantitative Research Methods
Principal Components Analysis
Information Management course
Multivariate Data Analysis
Dimension Reduction via PCA (Principal Component Analysis)
Simple Linear Regression
Applied Statistics Using SAS and SPSS
Measuring latent variables
Principal Components The Basics Principal Components 1. 11/30/2018.
Lecture 14 PCA, pPCA, ICA.
Measuring latent variables
Measuring latent variables
Principal Components Analysis
Chapter_19 Factor Analysis
Factor Analysis (Principal Components) Output
Applied Statistics Using SPSS
Compare LDA and PCA.
Principal Component Analysis
Measuring latent variables
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Presentation transcript:

Multivariate Description

What Technique? Response variable(s) ... Predictors(s) No Yes ... is one • distribution summary • regression models ... are many • indirect gradient analysis (PCA, CA, DCA, MDS) • cluster analysis • direct gradient analysis • constrained cluster analysis • discriminant analysis (CVA)

Rotate the Variable Space

Raw Data

Linear Regression

Two Regressions

Principal Components

Gulls Variables

Scree Plot

Output Importance of components: > summary(gulls.pca2) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 1.8133342 0.52544623 0.47501980 Proportion of Variance 0.8243224 0.06921464 0.05656722 Cumulative Proportion 0.8243224 0.89353703 0.95010425 > gulls.pca2$loadings Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Weight -0.505 -0.343 0.285 0.739 Wing -0.490 0.852 -0.143 0.116 Bill -0.500 -0.381 -0.742 -0.232 H.and.B -0.505 -0.107 0.589 -0.622

Bi-Plot

Male or Female?

Linear Discriminant > gulls.lda <- lda(Sex ~ Wing + Weight + H.and.B + Bill, gulls) lda(Sex ~ Wing + Weight + H.and.B + Bill, data = gulls) Prior probabilities of groups: 0 1 0.5801105 0.4198895 Group means: Wing Weight H.and.B Bill 0 410.0381 871.7619 115.1143 17.62524 1 430.6118 1054.3092 125.9474 19.50789 Coefficients of linear discriminants: LD1 Wing 0.045512619 Weight 0.001887236 H.and.B 0.138127194 Bill 0.444847743

Discriminating

Relationship between PCA and LDA

CVA

CVA

Managing Dimensionality (but not acronyms) PCA, CA, RDA, CCA, MDS, NMDS, DCA, DCCA, pRDA, pCCA

Type of Data Matrix species attributes desert macroph inverts uses sites species attributes attributes watervar rain gulls individuals sites

Models of Species Response There are (at least) two models:- Linear - species increase or decrease along the environmental gradient Unimodal - species rise to a peak somewhere along the environmental gradient and then fall again

A Theoretical Model

Linear

Unimodal

Ordination Techniques Linear methods Weighted averaging (unimodal) Unconstrained (indirect) Principal Components Analysis (PCA) Correspondence Analysis (CA) Constrained (direct) Redundancy Analysis (RDA) Canonical Correspondence Analysis (CCA)

Inferring Gradients from Species (or Attribute) Data

Indirect Gradient Analysis Environmental gradients are inferred from species data alone Three methods: Principal Component Analysis - linear model Correspondence Analysis - unimodal model Detrended CA - modified unimodal model

PCA - linear model

PCA - linear model

Terschelling Dune Data

PCA gradient - site plot

PCA gradient - site/species biplot standard biodynamic & hobby nature

Making Effective Use of Environmental Variables

Approaches Use single responses in linear models of environmental variables Use axes of a multivariate dimension reduction technique as responses in linear models of environmental variables Constrain the multivariate dimension reduction into the factor space defined by the environmental variables

Ordination Constrained by the Environmental Variables

Constrained?

Working with the Variability that we Can Explain Start with all the variability in the response variables. Replace the original observations with their fitted values from a model employing the environmental variables as explanatory variables (discarding the residual variability). Carry our gradient analysis on the fitted values.

Unconstrained/Constrained Unconstrained ordination axes correspond to the directions of the greatest variability within the data set. Constrained ordination axes correspond to the directions of the greatest variability of the data set that can be explained by the environmental variables.

Dune Data Unconstrained

Direct Gradient Analysis Environmental gradients are constructed from the relationship between species environmental variables Three methods: Redundancy Analysis - linear model Canonical (or Constrained) Correspondence Analysis - unimodal model Detrended CCA - modified unimodal model

Direct Gradient Analysis Basic PCA yik = b0k + b1kxi + eik xi - the sample scores on the ordination axis b1k - the regression coefficients for each species (the species scores on the ordination axis) In RDA there is a further constraint on xi xi = c1zi1 + c2zi2 Making yik = b0k + b1kc1zi1 + b1kc2zi2 + eik

Direct Gradient Analysis cca(species_data ~ e1 + e2 + ... + en, data=environmental_data) cca(dune ~ Manure + Moisture + A1, data=dune.env)

Dune Data Constrained

Lake Nasser - Egypt

Nasser Data Sites – 23 sampling stations on Lake Nasser 3 Data Frames: Aquatic macrophytes Invertebrate classes Water chemistry

Lake Nasser Unconstrained

Lake Nasser Constrained

Modelling Environmental Variables

Ways of Building Models Automated environmental variable selection (stepwise addition or removal of variables from the model – as with multiple regression) mod0 <- cca(nasser.inverts ~ 1, nasser.watervar) mod1 <- cca(nasser.inverts ~ ., nasser.watervar) op <- options(digits=7) mod <- step(mod0, scope=formula(mod1)) options(op) mod plot(mod)

Ways of Building Models Manual selection of environmental variables using prior knowledge (e.g. example starting with full model and removing terms) mod1 <- cca(nasser.inverts ~ ., nasser.watervar) mod2 <- cca(nasser.inverts ~ . -WMg, nasser.watervar) mod3 <- cca(nasser.inverts ~ . -WMg -WEC, nasser.watervar) mod4 <- cca(nasser.inverts ~ . -WMg -WEC -WCa, nasser.watervar)

Ways of Evaluating Models Graphically using Procrustes Rotation plot(procrustes(mod2, mod1)) plot(procrustes(mod3, mod2)) plot(procrustes(mod4, mod3)) plot(procrustes(mod4, mod1))

Procrustes

Ways of Evaluating Models Permutation Tests can be used to assess adequacy of the models using a Pseudo ANOVA or Permutest anova(mod1) anova(mod2) anova(mod3) anova(mod4) permutest.cca(mod1, perm=1000) permutest.cca(mod2, perm=1000) permutest.cca(mod3, perm=1000) permutest.cca(mod4, perm=1000)

Removing the Effect of Nuisance Variables

Getting rid of the Variability that is Not of Interest Amongst the explanatory variables there may be variability attributable to: Blocks and other design strata Covariates that we can measure but are not the focus of interest We may want to use only the variability attributable to: Meaningful Environmental Variables

Partial Analyses Remove the effect of covariates variables that we can measure but which are of no interest e.g. block effects, start values, etc. Carry out the gradient analysis on what is left of the variation after removing the effect of the covariates.

Lichen-rich Forest Understorey

Forest Data Sites – 28 sites in forests in Finland grazed by reindeer Species Data – 44 heathland plant species (including many lichens and mosses that are very sensitive to their chemical environment) Environmental Data – Soil chemical composition (N P K Ca Mg S Al Fe Mn Zn Mo Baresoil Humdepth pH)

CCA

Removing pH Effect cca(species_data ~ e1 + e2 + ... + en + Condition(e5), data=environmental_data) cca(varespec ~ Al + P + K + Baresoil + Condition(pH), data=varechem)

Removing pH Effect

Interactions in Models cca(species_data ~ e1 + e2 + ... + en + Condition(e5), data=environmental_data) cca(varespec ~ Al + P*(K + Baresoil) + Condition(pH), data=varechem)

CCA

Removing pH Effect

Cluster Analysis

Different types of data example Continuous data : height Categorical data ordered (nominal) : growth rate very slow, slow, medium, fast, very fast not ordered : fruit colour yellow, green, purple, red, orange Binary data : fruit / no fruit

Similarity matrix We define a similarity between units – like the correlation between continuous variables. (also can be a dissimilarity or distance matrix) A similarity can be constructed as an average of the similarities between the units on each variable. (can use weighted average) This provides a way of combining different types of variables.

Distance metrics relevant for continuous variables: Euclidean city block or Manhattan A B A B (also many other variations)

A Distance Matrix

Uses of Distances Distance/Dissimilarity can be used to:- Explore dimensionality in data (using PCO) As a basis for clustering/classification

UK Wet Deposition Network

Fitting Environmental Variables

A Map based on Measured Variables

Fitting Environmental Variables

Similarity coefficients for binary data simple matching count if both units 0 or both units 1 Jaccard count only if both units 1 (also many other variants) simple matching can be extended to categorical data 0,1 1,1 0,0 1,0 0,1 1,1 0,0 1,0

Clustering methods hierarchical non-hierarchical divisive put everything together and split monothetic / polythetic agglomerative keep everything separate and join the most similar points (classical cluster analysis) non-hierarchical k-means clustering

Agglomerative hierarchical Single linkage or nearest neighbour finds the minimum spanning tree: shortest tree that connects all points chaining can be a problem

Agglomerative hierarchical Complete linkage or furthest neighbour compact clusters of approximately equal size. (makes compact groups even when none exist)

Agglomerative hierarchical Average linkage methods between single and complete linkage

From Alexandria to Suez

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Summarise by Weighted Averages

Species and Sites as Weighted Averages of each other 1 1 1111 1 2111 SPP. 23466185750198304927 Bel per 3.2....2..2..22..... Jun buf .3..........4..…42.. Jun art ...3..4..3..4..4.... Air pra ........2........3.. Ele pal ...8..4..5.....44... Rum ace ....6..5....2..…23.. Vic lat ..........12.1...... Bra rut ..246.22.4242624.342 Ran fla .2.2..2..2.....42... Hyp rad ........2..2.....5.. Leo aut 522.3.33223525222623 Pot pal .........2......2... Poa pra 424.34421.44435.…4.. Cal cus ...3...........34... Tri pra ....5..2........…2.. Tri rep 521.5.22.163322.6232 Ant odo ....3..44.4......4.2 Sal rep .............3.5.3.. Ach mil 3...21.22.4.....…2.. Poa tri 79524246..4.5.6…45.. Ely rep 4.4..4.4....6.4..... Sag pro .25...2....22....34. Pla lan ....5..52.33.3..…5.. Agr sto .587..4..4..3.454.4. Lol per 5.5.6742..67226.…6.. Alo gen 2524..5.....3.7...8. Bro hor 4.3....2..4.....…2..

Species and Sites as Weighted Averages of each other

Reciprocal Averaging - unimodal Site A B C D E F Species Prunus serotina 6 3 4 6 5 1 Tilia americana 2 0 7 0 6 6 Acer saccharum 0 0 8 0 4 9 Quercus velutina 0 8 0 8 0 0 Juglans nigra 3 2 3 0 6 0

Reciprocal Averaging - unimodal Site A B C D E F Species Score Species Iteration 1 Prunus serotina 6 3 4 6 5 1 1.00 Tilia americana 2 0 7 0 6 6 0.63 Acer saccharum 0 0 8 0 4 9 0.63 Quercus velutina 0 8 0 8 0 0 0.18 Juglans nigra 3 2 3 0 6 0 0.00 Iteration 1 1.00 0.00 0.86 0.60 0.62 0.99 Site Score

Reciprocal Averaging - unimodal Site A B C D E F Species Score Species Iteration 1 2 Prunus serotina 6 3 4 6 5 1 1.00 0.68 Tilia americana 2 0 7 0 6 6 0.63 0.84 Acer saccharum 0 0 8 0 4 9 0.63 0.87 Quercus velutina 0 8 0 8 0 0 0.18 0.30 Juglans nigra 3 2 3 0 6 0 0.00 0.67 Iteration 1 1.00 0.00 0.86 0.60 0.62 0.99 Site 2 0.65 0.00 0.88 0.05 0.78 1.00 Score

Reciprocal Averaging - unimodal Site A B C D E F Species Score Species Iteration 1 2 3 Prunus serotina 6 3 4 6 5 1 1.00 0.68 0.50 Tilia americana 2 0 7 0 6 6 0.63 0.84 0.86 Acer saccharum 0 0 8 0 4 9 0.63 0.87 0.91 Quercus velutina 0 8 0 8 0 0 0.18 0.30 0.02 Juglans nigra 3 2 3 0 6 0 0.00 0.67 0.66 Iteration 1 1.00 0.00 0.86 0.60 0.62 0.99 Site 2 0.65 0.00 0.88 0.05 0.78 1.00 Score 3 0.60 0.01 0.87 0.00 0.78 1.00

Reciprocal Averaging - unimodal Site A B C D E F Species Score Species Iteration 1 2 3 9 Prunus serotina 6 3 4 6 5 1 1.00 0.68 0.50 0.48 Tilia americana 2 0 7 0 6 6 0.63 0.84 0.86 0.85 Acer saccharum 0 0 8 0 4 9 0.63 0.87 0.91 0.91 Quercus velutina 0 8 0 8 0 0 0.18 0.30 0.02 0.00 Juglans nigra 3 2 3 0 6 0 0.00 0.67 0.66 0.65 Iteration 1 1.00 0.00 0.86 0.60 0.62 0.99 Site 2 0.65 0.00 0.88 0.05 0.78 1.00 Score 3 0.60 0.01 0.87 0.00 0.78 1.00 9 0.59 0.01 0.87 0.00 0.78 1.00

Reordered Sites and Species Site A C E B D F Species Species Score Quercus velutina 8 8 0 0 0 0 0.004 Prunus serotina 6 3 6 5 4 1 0.477 Juglans nigra 0 2 3 6 3 0 0.647 Tilia americana 0 0 2 6 7 6 0.845 Acer saccharum 0 0 0 4 8 9 0.909 Site Score 0.000 0.008 0.589 0.778 0.872 1.000

Gradient Length

Alpha and Beta Diversity alpha diversity is the diversity of a community (either measured in terms of a diversity index or species richness) beta diversity (also known as ‘species turnover’ or ‘differentiation diversity’) is the rate of change in species composition from one community to another along gradients; gamma diversity is the diversity of a region or a landscape.

A Short Coenocline

A Long Coenocline

Arches - Artifact or Feature?

The Arch Effect What is it? Why does it happen? What should we do about it?

CA - with arch effect (sites)

CA - with arch effect (species)

Long Gradients A B C D

Gradient End Compression

CA - with arch effect (species)

CA - with arch effect (sites)

Detrending by Segments

DCA - modified unimodal

Testing Significance in Ordination

Randomisation Tests

Randomisation Tests

Randomisation Example Model: cca(formula = dune ~ Moisture + A1 + Management, data = dune.env) Df Chisq F N.Perm Pr(>F) Model 7 1.1392 2.0007 200 < 0.005 *** Residual 12 0.9761 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05