What we Measure vs. What we Want to Know

Slides:

Advertisements

Similar presentations

Type of Data Matrix. Managing Dimensionality (but not acronyms) PCA, CA, RDA, CCA, MDS, NMDS, DCA, DCCA, pRDA, pCCA.

Advertisements

Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...

Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.

Gradient Analysis Approach to Ordination. Models of Species Response to Gradients.

Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...

Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...

Multivariate Description

Multivariate Description. What Technique? Response variable(s)... Predictors(s) No Predictors(s) Yes... is one distribution summary regression models...

Tables, Figures, and Equations

Step three: statistical analyses to test biological hypotheses General protocol continued.

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides

Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.

An Introduction to Multivariate Analysis

Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.

Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.

Chapter 17 Overview of Multivariate Analysis Methods

Lecture 7: Principal component analysis (PCA)

1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.

Multivariate Methods Pattern Recognition and Hypothesis Testing.

Statistical Methods Chichang Jou Tamkang University.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

10/17/071 Read: Ch. 15, GSF Comparing Ecological Communities Part Two: Ordination.

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir

Multivariate Methods EPSY 5245 Michael C. Rodriguez.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Chapter 2 Dimensionality Reduction. Linear Methods

© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Canonical Correlation Analysis, Redundancy Analysis and Canonical Correspondence Analysis Hal Whitehead BIOL4062/5062.

Introduction to the gradient analysis. Community concept (from Mike Austin)

Why is it useful to use multivariate statistical methods for microfacies analysis? A microfacies is a multivariate object: each sample is characterized.

Principal Coordinate Analysis, Correspondence Analysis and Multidimensional Scaling: Multivariate Analysis of Association Matrices BIOL4062/5062 Hal Whitehead.

DIRECT ORDINATION What kind of biological questions can we answer? How can we do it in CANOCO 4.5?

Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.

From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon

Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003.

Available at Chapter 13 Multivariate Analysis BCB 702: Biostatistics

Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.

Multivariate Data Analysis Chapter 1 - Introduction.

ORDINATION What is it? What kind of biological questions can we answer? How can we do it in CANOCO 4.5? Some general advice on how to start analyses.

Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.

Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton

Principle Component Analysis and its use in MA clustering Lecture 12.

Principal Component Analysis (PCA)

Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.

Principal Component Analysis

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

JMP Discovery Summit 2016 Janet Alvarado

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Principal Component Analysis (PCA)

Multivariate community analysis

Clustering and Multidimensional Scaling

Classification (Dis)similarity measures, Resemblance functions

Descriptive Statistics vs. Factor Analysis

Dimension reduction : PCA and Clustering

Principal Component Analysis

Multidimensional Scaling

Register variation: correlation, clusters and factors

Unsupervised Learning

Presentation transcript:

What we Measure vs. What we Want to Know "Not everything that counts can be counted, and not everything that can be counted counts." - Albert Einstein

Scales, Transformations, Vectors and Multi-Dimensional Hyperspace All measurement is a proxy for what is really of interest - The Relationship between them The scale of measurement and the scale of analysis and reporting are not always the same - Transformations We often make measurements that are highly correlated - Multi-component Vectors

Multivariate Description

Gulls Variables

Scree Plot

Output Importance of components: > summary(gulls.pca2) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 1.8133342 0.52544623 0.47501980 Proportion of Variance 0.8243224 0.06921464 0.05656722 Cumulative Proportion 0.8243224 0.89353703 0.95010425 > gulls.pca2$loadings Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Weight -0.505 -0.343 0.285 0.739 Wing -0.490 0.852 -0.143 0.116 Bill -0.500 -0.381 -0.742 -0.232 H.and.B -0.505 -0.107 0.589 -0.622

Bi-Plot

Environmental Gradients

Inferring Gradients from Attribute Data (e.g. species)

Indirect Gradient Analysis Environmental gradients are inferred from species data alone Three methods: Principal Component Analysis - linear model Correspondence Analysis - unimodal model Detrended CA - modified unimodal model

Terschelling Dune Data

PCA gradient - site plot

PCA gradient - site/species biplot standard biodynamic & hobby nature

Making Effective Use of Environmental Variables

Approaches Use single responses in linear models of environmental variables Use axes of a multivariate dimension reduction technique as responses in linear models of environmental variables Constrain the multivariate dimension reduction into the factor space defined by the environmental variables

Dimension Reduction (Ordination) ‘Constrained’ by the Environmental Variables

Constrained?

Working with the Variability that we Can Explain Start with all the variability in the response variables. Replace the original observations with their fitted values from a model employing the environmental variables as explanatory variables (discarding the residual variability). Carry our gradient analysis on the fitted values.

Unconstrained/Constrained Unconstrained ordination axes correspond to the directions of the greatest variability within the data set. Constrained ordination axes correspond to the directions of the greatest variability of the data set that can be explained by the environmental variables.

Direct Gradient Analysis Environmental gradients are constructed from the relationship between species environmental variables Three methods: Redundancy Analysis - linear model Canonical (or Constrained) Correspondence Analysis - unimodal model Detrended CCA - modified unimodal model

Dune Data Unconstrained

Dune Data Constrained

How Similar are Objects/Samples/Individuals/Sites?

Similarity approaches or what do we mean by similar?

Different types of data example Continuous data : height Categorical data ordered (nominal) : growth rate very slow, slow, medium, fast, very fast not ordered : fruit colour yellow, green, purple, red, orange Binary data : fruit / no fruit

Different scales of measurement example Large Range : soil ion concentrations Restricted Range : air pressure Constrained : proportions Large numbers : altitude Small numbers : attribute counts Do we standardise measurement scales to make them equivalent? If so what do we lose?

Similarity matrix We define a similarity between units – like the correlation between continuous variables. (also can be a dissimilarity or distance matrix) A similarity can be constructed as an average of the similarities between the units on each variable. (can use weighted average) This provides a way of combining different types of variables.

Distance metrics relevant for continuous variables: Euclidean city block or Manhattan A B A B (also many other variations)

Similarity coefficients for binary data simple matching count if both units 0 or both units 1 Jaccard count only if both units 1 (also many other variants, eg Bray-Curtis) simple matching can be extended to categorical data 0,1 1,1 0,0 1,0 0,1 1,1 0,0 1,0

A Distance Matrix

Uses of Distances Distance/Dissimilarity can be used to:- Explore dimensionality in data using Principal coordinate analysis (PCO or PCoA) As a basis for clustering/classification

UK Wet Deposition Network

Shown with Environmental Variables

A Map based on Measured Variables

Fitting Environmental Variables

Grouping methods

Discriminating If you have continuous measurements and you know which 2 groups you are looking for (e.g. male and female in the gulls data), linear discriminant analysis will find a function of the measurements which will help to allocate new subjects to the groups

Canonical Variate Analysis For more than 2 groups canonical variate analysis maximises the between group to within group variances – this is related to a multivariate analysis of variance (MANOVA)

Cluster Analysis

Clustering methods hierarchical non-hierarchical divisive put everything together and split monothetic / polythetic agglomerative keep everything separate and join the most similar points (classical cluster analysis) non-hierarchical k-means clustering

Agglomerative hierarchical Single linkage or nearest neighbour finds the minimum spanning tree: shortest tree that connects all points chaining can be a problem

Agglomerative hierarchical Complete linkage or furthest neighbour compact clusters of approximately equal size. (makes compact groups even when none exist)

Agglomerative hierarchical Average linkage methods between single and complete linkage

From Alexandria to Suez

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Building and testing models Basically you just approach this in the same way as for multiple regression – so there are the same issues of variable selection, interactions between variables, etc. However the basis of any statistical tests using distributional assumptions are more problematic, so there is much greater use of randomisation tests and permutation procedures to evaluate the statistical significance of results.

Some Examples

Part of Fig 4.

What Technique? Response variable(s) ... Predictors(s) No Yes ... is one • distribution summary • regression models ... are many • indirect gradient analysis (PCA, CA, DCA, MDS) • cluster analysis • direct gradient analysis • constrained cluster analysis • discriminant analysis (CVA)

Raw Data

Linear Regression

Two Regressions

Principal Components

Models of Species Response There are (at least) two models:- Linear - species increase or decrease along the environmental gradient Unimodal - species rise to a peak somewhere along the environmental gradient and then fall again

Linear

Unimodal

Ordination Techniques Linear methods Weighted averaging (unimodal) Unconstrained (indirect) Principal Components Analysis (PCA) Correspondence Analysis (CA) Constrained (direct) Redundancy Analysis (RDA) Canonical Correspondence Analysis (CCA)

Non-metric multidimensional scaling NMDS maps the observed dissimilarities onto an ordination space by trying to preserve their rank order in a low number of dimensions (often 2) – but the solution is linked to the number of dimensions chosen it is like a non-linear version of PCO define a stress function and look for the mapping with minimum stress (e.g. sum of squared residuals in a monotonic regression of NMDS space distances between original and mapped dissimilarities) need to use an iterative process, so try with many different starting points and convergence is not guaranteed

Procrustes rotation used to compare graphically two separate ordinations