Principal Coordinate Analysis, Correspondence Analysis and Multidimensional Scaling: Multivariate Analysis of Association Matrices BIOL4062/5062 Hal Whitehead.

Slides:



Advertisements
Similar presentations
What we Measure vs. What we Want to Know
Advertisements

Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.
An Introduction to Multivariate Analysis
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Dimensionality Reduction and Embeddings
Principal component analysis (PCA)
Dimensional reduction, PCA
1 Efficient Clustering of High-Dimensional Data Sets Andrew McCallum WhizBang! Labs & CMU Kamal Nigam WhizBang! Labs Lyle Ungar UPenn.
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
Face Recognition Jeremy Wyatt.
SVD and PCA COS 323. Dimensionality Reduction Map points in high-dimensional space to lower number of dimensionsMap points in high-dimensional space to.
Introduction Given a Matrix of distances D, (which contains zeros in the main diagonal and is squared and symmetric), find variables which could be able,
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Chapter 6 Distance Measures From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.
Principal component analysis (PCA)
Dimensionality Reduction
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Proximity matrices and scaling Purpose of scaling Classical Euclidean scaling Non-Euclidean scaling Non-Metric Scaling Example.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Agenda Dimension reduction Principal component analysis (PCA)
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Correspondence Analysis Chapter 14.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Principle Component Analysis Presented by: Sabbir Ahmed Roll: FH-227.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
1 February 24 Matrices 3.2 Matrices; Row reduction Standard form of a set of linear equations: Chapter 3 Linear Algebra Matrix of coefficients: Augmented.
Canonical Correlation Analysis, Redundancy Analysis and Canonical Correspondence Analysis Hal Whitehead BIOL4062/5062.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.
CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.
Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Principle Component Analysis and its use in MA clustering Lecture 12.
Introduction to Multivariate Analysis and Multivariate Distances Hal Whitehead BIOL4062/5062.
Multidimensional Scaling and Correspondence Analysis © 2007 Prentice Hall21-1.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Multidimensional Scaling
Multivariate Statistics with Grouped Units Hal Whitehead BIOL4062/5062.
Feature Extraction 主講人:虞台文.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Dimensionality Reduction CS 685: Special Topics in Data Mining Spring 2008 Jinze.
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Dimensionality Reduction Part 1: Linear Methods Comp Spring 2007.
Université d’Ottawa / University of Ottawa 2003 Bio 8102A Applied Multivariate Biostatistics L4.1 Lecture 4: Multivariate distance measures l The concept.
Even More on Dimensions
Spectral Methods Tutorial 6 1 © Maks Ovsjanikov
Principal Components Analysis
Principal Component Analysis (PCA)
Multidimensional Scaling
Principal Component Analysis
Presentation transcript:

Principal Coordinate Analysis, Correspondence Analysis and Multidimensional Scaling: Multivariate Analysis of Association Matrices BIOL4062/5062 Hal Whitehead

Association matrices Principal Coordinates Analysis (PCO) Correspondence Analysis (COA) Multidimensional Scaling (MDS)

The Association Matrix Units:

Association matrices Social structure –association between individuals Community ecology –similarity between species, sites –dissimilarities between species sites Genetic distances Correlation matrices Covariance matrices Distance matrices –Euclidean, Penrose, Mahalanobis Similarity Dissimilarity

Association matrices Symmetric/Asymmetric Genetic relatedness among bottlenose dolphins (Krutzen et al. 2003) Grooming rates of capuchin monkeys (Perry 1996)

Principal Coordinates Analysis Consider a symmetric dissimilarity matrix B5B5 C37C37 D544D544 ABC ABC As a distance matrix And then plot it

Principal Coordinates Analysis B5C37D544 ABCB5C37D544 ABC AB 5 C 3 7 D Can represent: distances between 2 points in 1 dimension distances between 3 points in 2 dimensions distances between 4 points in 3 dimensions … distances between k points in k-1dimensions

Principal Coordinates Analysis HOWEVER! B5C37D544 ABCB5C37D544 ABC AB 5 Triangle inequality violated if: AB + AC < BC No representation possible 10 C ??

Principal Coordinates Analysis Take distance (dissimilarity) matrix with k units Represent as k points in k-1 dimensional space –if triangle inequality holds throughout Find direction of greatest variability –1st Principal Coordinate Find direction of next greatest variability (orthogonal) –2nd Principal Coordinate … k-1 Principal Coordinates Reduces dimensionality of representation

Principal Coordinates Analysis Eigenvectors of distance matrix give principal coordinates Eigenvalues give proportion of variance accounted for Triangle inequality equivalent to: –matrix is positive semi-definite –no unreal eigenvectors –no negative eigenvalues –analysis probably OK if few small, negative eigenvalues

Principal Coordinates Analysis (PCO) & Principal Coomponents Analysis (PCA) PCO is equivalent to PCA on covariance matrix of transposed data matrix if distance matrix is Euclidean PCO is equivalent to PCA on correlation matrix of transposed data matrix if distance matrix is Penrose PCO only gives information on units or variables not both Axes (principal coordinates) rarely interpretable in PCO

Principal Coordinates Analysis Proportion of time chickadees seen together at feeder SCAO 1.00 AOPR ARPO YOSA ROAY SORA BJAO SCAO AOPR ARPO YOSA ROAY SORA BJAO Ficken et al. Behav. Ecol. Sociobiol. 1981

Principal Coordinates Analysis Proportion of time chickadees seen together at feeder Transformed to distance matrix  (1-X) SCAO 0.00 AOPR ARPO YOSA ROAY SORA BJAO SCAO AOPR ARPO YOSA ROAY SORA BJAO

Principal Coordinates Analysis: Chickadees at Feeder SCAO 1.00 AOPR ARPO YOSA ROAY SORA BJAO SCAO AOPR ARPO YOSA ROAY SORA BJAO Prin Coord % explained Cumulative Eigenvalue

Correspondence Analysis Uses incidence matrix –counts indexed by two factors –e.g., Archaeology: tombs X artifacts –e.g., Community ecology: sites X species Data matrix with counts and many zeros

Correspondence Analysis Distance between two species, i and j, over sites k=1,…,p is (“Chi-squared” measure): r i species totals c k site totals {Difference in proportions of each species at each site} Then do Principal Coordinates Analysis

Correspondence Analysis Distance between two species, i and j, over sites k=1,…,p is (“Chi-squared” measure): Distance between two sites, k and l, over species i=1,…,n is:

Correspondence Analysis Example: Sperm Whale Haplotypes by Clan RegShort4-plus # # #39260 #4003 #5121 #6105 #7400 #8041 #9020 #11300 #12010 #13410 #14100 #15100 mtDNA haplotype Eigenvalue Eigenvalue 0.205

Multidimensional Scaling “Non-parametric version of principal coordinates analysis” Given an association matrix between units: –tries to find a representation of the units in a given number of dimensions –preserving the pattern/ordering in the association matrix

Multidimensional Scaling How it works: 1Provide association matrix (similarity/dissimilarity) 2 Provide number of dimensions 3Produce initial plot, perhaps using Principal Coordinates 4Orders distances on plot, compares them with ordering of association matrix 5Computes STRESS 6Juggles points to reduce STRESS 7Go to 4, until STRESS is stabilized 8Output plot, STRESS 9Perhaps repeat with new starting conditions

Multidimensional Scaling STRESS: d ij associations between i and j x ij associations between i and j predicted using distances on plot (by regression)

Multidimensional Scaling Iterative –No unique solution –Try with different starting positions Different possible definitions of STRESS

Multidimensional Scaling Shepard Diagrams Metric Scaling Non-metric Scaling Similar plots to Principal Coordinates Easier to fit Stress 23% Stress 16% Association values

Genetic distances between sperm whale groups Stress 23% Metric MDS Non-Metric 2-D MDS Stress 16% Non-Metric 3-D MDS Stress 8% Principal coordinates 13/14 eigenvalues negative -not a good representation

Multidimensional Scaling How many dimensions? –STRESS <10% is “good representation” –Scree diagram –two (or three) dimensions for visual ease Metric or non-metric? –Metric has few advantages over Principal Coordinates Analysis (unless many negative eigenvalues) –Non-metric does better with fewer dimensions

Non-metric Multidimensional Scaling vs. Principal Coordinates Analysis Principal Coordinates MDSCAL Scaling: Metric Non-metric Input: Distance matrix Association matrix Matrix: Pos. Semi-Def. - Solution: Unique Iterative Max. Units: 100's Dimensions: More Less Choose no. of dimensions: Afterwards Before