Distance Measures and Ordination

Slides:



Advertisements
Similar presentations
Tables, Figures, and Equations
Advertisements

An Introduction to Multivariate Analysis
CHAPTER 24 MRPP (Multi-response Permutation Procedures) and Related Techniques From: McCune, B. & J. B. Grace Analysis of Ecological Communities.
Multivariate analysis of community structure data Colin Bates UBC Bamfield Marine Sciences Centre.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
The General Linear Model Or, What the Hell’s Going on During Estimation?
Dimension reduction (1)
Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Properties of Community Data in Ecology Adapted from Ecological Statistical Workshop, FLC, Daniel Laughlin.
CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
The Simple Regression Model
10/17/071 Read: Ch. 15, GSF Comparing Ecological Communities Part Two: Ordination.
CHAPTER 30 Structural Equation Modeling From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach,
Chapter 6 Distance Measures From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
Analysis of Variance & Multivariate Analysis of Variance
Tables, Figures, and Equations
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Chapter 14 Inferential Data Analysis
Multivariate Analysis Techniques
Separate multivariate observations
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Correlation and Regression
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Simple Linear Regression
STATISTICS: BASICS Aswath Damodaran 1. 2 The role of statistics Aswath Damodaran 2  When you are given lots of data, and especially when that data is.
Correlation.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Introduction to the gradient analysis. Community concept (from Mike Austin)
Chapter 15 Data Analysis: Testing for Significant Differences.
Statistical Power 1. First: Effect Size The size of the distance between two means in standardized units (not inferential). A measure of the impact of.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Copyright © 2004 Pearson Education, Inc.
Multidimensional scaling MDS  G. Quinn, M. Burgman & J. Carey 2003.
Examining Relationships in Quantitative Research
Advanced Correlational Analyses D/RS 1013 Factor Analysis.
From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Adjusted from slides attributed to Andrew Ainsworth
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
ORDINATION What is it? What kind of biological questions can we answer? How can we do it in CANOCO 4.5? Some general advice on how to start analyses.
MULTIVARIATE REGRESSION Multivariate Regression; Selection Rules LECTURE 6 Supplementary Readings: Wilks, chapters 6; Bevington, P.R., Robinson, D.K.,
Lecture 12 Factor Analysis.
ANCOVA. What is Analysis of Covariance? When you think of Ancova, you should think of sequential regression, because really that’s all it is Covariate(s)
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
PCB 3043L - General Ecology Data Analysis.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Multidimensional Scaling and Correspondence Analysis © 2007 Prentice Hall21-1.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
PCB 3043L - General Ecology Data Analysis.
Multidimensional Scaling and Correspondence Analysis
Correlation and Regression
Descriptive Statistics vs. Factor Analysis
Multivariate Statistics
Principal Component Analysis (PCA)
Dataset: Time-depth-recorder (TDR) raw data 1. Date 2
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Distance Measures and Ordination Adapted from Ecological Statistical Workshop, FLC, Daniel Laughlin Distance Measures and Ordination

Goals of Ordination To arrange items along an axis or multiple axes in a logical order To extract a few major gradients that explain much of the variability in the total dataset Most importantly: to interpret the gradients since important ecological processes generated them

http://ordination.okstate.edu/

What makes ordination possible? Variables (species) are “correlated” (in a broad sense) Correlated variables = redundancy Ordination thrives on the complex network of inter-correlations among species

Ordination helps to: Describe the strongest patterns of community composition Separate strong patterns from weak ones Reveal unforeseen patterns and suggest unforeseen processes

“Direct” gradient analysis Order plots along measured environmental gradients e.g., regress diatom abundance on salinity

“Indirect” gradient analysis Order plots according to covariation among species, or dissimilarity among sample units Following this step, we can then examine correlations between environment and ordination axes Axes = Gradients In PCA, these are called “Principal Components”

Data reduction Goal: to reduce the dimensionality of community datasets (i.e., from 100 species down to 2 or 3 main gradients) n x p n x d These d dimensions represent the strongest correlation structure in the data This is possible because of redundancy in the data (i.e., species are “correlated”)

Ordination Diagrams Axis 2: “Biotic” Axis 1: “Abiotic” Do not seek patterns as you would with a regression: axes are orthogonal (uncorrelated) Know two things: What the points represent (plots or species?) Distance in the diagram is proportional to compositional dissimilarity NMS Ordination Axis 2: “Biotic” Axis 1: “Abiotic”

How many axes? “How many discrete signals can be detected against a background of noise?” Typically we expect 2 or 3 gradients to be sufficient, but if we know that 5 independent environmental gradients are structuring the vegetation (water, light, CO2, nutrients, grazers, etc.), then perhaps 5 axes are justified

Two basic techniques Eigenanalysis methods- use information from variance-covariance matrix or correlation matrix (e.g., PCA) Appropriate for linear models since covariance is a measure of a linear association Distance-based methods- use information from distance matrix (e.g., NMS) Appropriate for nonlinear models since some distance measures and ordination techniques can “linearize” nonlinear associations

A summary table of ordination methods

Ecological Distance Measures

Distance measures Distance = Difference = Dissimilarity Distance matrix is like a triangular mileage chart on maps (symmetric) We are interested in the distances between sample units (plots) in species space

Distance measures In univariate species space (one species), the distance between two points is their difference in abundances We will examine two kinds of distance measures: Euclidean distance, and Bray-Curtis (Sorenson) distance

Domains and Ranges Distance Domain of x Range of d =f(x) Euclidean all non-negative Sorenson x ≥ 0 0<d<1 (0<d<100)

Which one works best? “If species respond noiselessly to environmental gradients, then we seek a perfect linear relationship between distances in species space and distances in environmental space. Any departure from that represents a partial failure of our distance measure.” McCune p. 51

Easy dataset (low beta diversity) Figure 6.6

Difficult dataset (high beta diversity) Intuitive property Figure 6.7

NMS is able to linearize the relationship between distance in species space and environmental distance because it is based on ranked distances (stay tuned)

Theoretical basis Our choice is primarily empirical: we should select measures that have been shown superior performance One important theoretical basis: ED measures distance through uninhabitable, impossibly species rich space. In contrast, city-block distances are measured along the edges of species space- exactly where the sample units lie in the dust bunny distribution!

Nonmetric Multidimensional Scaling (NMS, NMDS, MDS, NMMDS, etc.)

NMS Uses a distance/dissimilarity matrix Makes no assumptions regarding linear relationships among variables Arranges plots in a space that best approximates the distances in a distance matrix

From a map to a distance matrix Calculate distances

From a distance matrix to a map NMS Question: How well do the distances in the ordination match the distances in the distance matrix?

Advantages of NMS Avoids the assumptions of linear relations The use of ranked distances tends to linearize the relationship between distances in species space and distances in environmental space You can use any distance measure

Historical disadvantages of NMS Failing to find the best solution (low “stress”) due to local minima Slow computation time These concerns have largely been dealt with given modern computer power

In a nutshell NMS is an iterative search for the best positions of n entities on k dimensions (axes) that minimizes the stress of the k-dimensional configuration “Stress” is a measure of departure from monotonicity in the relationship between the original distance matrix and the distances in the ordination diagram

Achieving monotonicity Fig 16.2 The closer the points lie to a monotonic line, the better the fit and the lower the stress. If S* = 0, then relationship is perfectly monotonic Blue = perfect fit, monotonic Red = high stress, not monotonic

Instability Instability is calculated as the standard deviation in stress over the preceeding 10 iterations Instabilities of 0.0001 are generally preferred sd = sqrt(var)

Mini Example

Landscape analogy for NMS Global minimum Local minimum (strong, regular, geometric patterns emerge)

Reliability of Ordination Low stress and stable solutions Proportion of variance represented (R2) Monte Carlo tests

Variance represented? “Ode to an eigenvalue” NMS not based on partitioning variance, so there is no direct method Calculate R2 for relationship between Euclidean distances in ordination versus Bray-Curtis distances in distance matrix Axis Increment Cumulative R2 0.37 0.37 0.20 0.57 0.15 0.72

Monte Carlo test Has the final NMS configuration extracted stronger axes than expected by chance? Compare stress obtained using your data with stress obtained from multiple runs of randomized versions of your data (randomly shuffled within columns) P-value = (1+n)/(1+N) n = # of random runs with final stress less than or equal to the observed minimum stress, N = number of randomized runs P-value = the proportion of randomized runs with stress less than or equal to the observed stress

Monte Carlo tests

Autopilot mode in PC-ORD Table 16.3 in McCune and Grace (2002) PARAMETER Quick and dirty Medium Slow and thorough Maximum number of iterations 75 200 400 Instability criterion 0.001 0.0001 0.00001 Starting number of axes 3 4 6 Number of real runs 5 15 40 Number of randomized runs 20 30 50

Choosing the best solution Select the appropriate number of dimensions Seek low stress Use a Monte Carlo test Avoid unstable solutions

1. How many dimensions? One dimension is generally not used, unless the data is known to be unidimensional. More than three becomes difficult to interpret. Find the elbow and inspect Monte Carlo tests. elbow Figure 16.3

2. Seek low stress <5 = excellent 5-10 = good 10-20 = fair, useable 20-30 = not great, still useable >30 = dangerously close to random Adapted from Table 16.4, p 132

A general procedure Carefully read pages 135-136 In your papers, you should report the information that is listed on page 136 Autopilot mode works really well, but don’t publish ordinations obtained using the Quick and Dirty option! Be sure to publish the parameter settings.

Interpreting NMS axes Two main/complementary approaches Evaluate how species abundances are correlated with NMS axes Evaluate how environmental variables are correlated with NMS axes

Overlays Overlays: flexible way to see whether a variable is patterned on an ordination; not limited to linear relationships Axis 1

Overlays

Species versus Axes Resist the temptation to use p-values when examining these relationships! - nonlinear - circular reasoning Unimodal pattern Linear pattern

Environmental Variables Joint plots- diagram of radiating lines, where the angle and length of a line indicate the direction and strength of the relationship

PerMANOVA

The analysis of community composition Continuous covariates Use ordination to produce a continuous response variable (i.e., axis) Use covariance analysis (multiple regression, SEM) to explain variance of the axis Categorical groups Ordination is not required (remember, ordination is not the test) Permutational MANOVA (PerMANOVA): can use on any experimental design MRPP (only one-way or blocked designs) ANOSIM (up to two factors, in R and PRIMER)

MANOVA Multivariate Analysis of Variance Traditional parametric method Assumes linear relations among variables, multivariate normality, equal variances and covariances Not appropriate for community data

PerMANOVA Permutational MANOVA Straightforward extension of ANOVA Decomposes variance in the distance matrix No distributional assumptions Can still be sensitive to heterogeneous variances (dispersion) among groups Anderson, M. 2001. Austral Ecology

ANOVA Compare variability within groups versus variability among different groups

Decomposing an observation (yij) Variability of observations about the grand mean Variability of the ith trt mean about the grand mean Variability of observations within each treatment = + SStotal = SSamong + SSwithin SStotal = SStreatment + SSerror PROBLEM: WE CAN’T CALCULATE MEANS WITH SEMIMETRIC BRAY-CURTIS

ANOVA Compare variability within groups versus variability among different groups A simple 2-D case Unknowable with semi-metric Bray-Curtis distances

The key link The key to this method is that “the sum of squared distances between points and their centroid is equal to (and can be calculated directly from) the sum of squared interpoint distances divided by the number of points.”

Why is this important? Couldn’t use semimetric Bray-Curtis distance in ANOVA context because central locations cannot be found But we don’t have to calculate the central locations anymore with this finding The analysis can proceed by using distances in any distance matrix

One-way perMANOVA with two groups

Permuted p-values P = (No. of Fπ >= F) (Total no. of Fπ) Fπ obtained with randomly shuffled data Use at least 999 random permutations I tend to use 9999 permutations

The link with ANOVA This F statistic is equal to Fisher’s original F-ratio in the case of one variable and when Euclidean distances are used

Example: grazing effects (one-way)

Example: two-way factorial