Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.

Slides:

Advertisements

Similar presentations

What we Measure vs. What we Want to Know

Advertisements

Tables, Figures, and Equations

Markov chains Assume a gene that has three alleles A, B, and C. These can mutate into each other. Transition probabilities Transition matrix Probability.

Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides

Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.

Machine Learning Lecture 8 Data Processing and Representation

Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.

Introduction to Bioinformatics

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Eigenvalues and eigenvectors

Linear Transformations

Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.

Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.

Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.

Segmentation Graph-Theoretic Clustering.

A quick introduction to the analysis of questionnaire data John Richardson.

Pattern Recognition Introduction to bioinformatics 2005 Lecture 4.

Chapter 3 Determinants and Matrices

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Cluster Analysis (1).

Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.

K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?

Tables, Figures, and Equations

5.1 Orthogonality.

SVD(Singular Value Decomposition) and Its Applications

Point set alignment Closed-form solution of absolute orientation using unit quaternions Berthold K. P. Horn Department of Electrical Engineering, University.

Summarized by Soo-Jin Kim

Principle Component Analysis Presented by: Sabbir Ahmed Roll: FH-227.

Chapter 2 Dimensionality Reduction. Linear Methods

CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.

Some matrix stuff.

CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.

Why is it useful to use multivariate statistical methods for microfacies analysis? A microfacies is a multivariate object: each sample is characterized.

Statistics and Linear Algebra (the real thing). Vector A vector is a rectangular arrangement of number in several rows and one column. A vector is denoted.

Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.

From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon

Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003.

Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.

CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.

Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton

Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.

What is the determinant of What is the determinant of

Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.

Graphics Graphics Korea University kucg.korea.ac.kr Mathematics for Computer Graphics 고려대학교 컴퓨터 그래픽스 연구실.

CSE 554 Lecture 8: Alignment

Unsupervised Learning

Principal Component Analysis (PCA)

Multivariate community analysis

Clustering and Multidimensional Scaling

Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.

Feature space tansformation methods

Unsupervised Learning

Presentation transcript:

Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is cluster analysis. An easy but powerful technique is principal component analysis (PCA).

Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm Distance metric A cluster analysis is a two step process that includes the choice of a) a distance metric and b) a linkage algortihm

Between clusters Within clusters Cluster analysis tries to minimize within cluster distances and to maximize between cluster distances.

Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG The distance metric P.symP.xanP.polaC.platC.gradD.sym P.sym P.xan P.pola C.plat C.grad D.sym A distance matrix counts in the simplest case the number of differences between two data sets.

Site 1 Site 2Site 3Site 4 P.sym1011 P.xan1001 P.pola0101 C.plat0111 C.grad1000 D.sym1011 Sum4235 Species presence-absence matrix A Site 1 Site 2Site 3Site 4 Site Site Site Site Site 1 Site 2Site 3Site 4 Site Site Site Site Distance matrix D = A T A Soerensen index Jaccard index

Site 1 Site 2Site 3Site 4 P.sym P.xan P.pola C.plat C.grad D.sym Sum Abundance data Euclidean distance Manhattan distance Correlation distance Site 1 Site 2Site 3Site 4 Site Site Site Site Correlation distance matrix Bray Curtis distance Due to squaring Euclidean distances put particulalry weight on outliers. Needs a linear scale. The Manhattan distance needs linear scales. Despite of a large distance the metric might be zero. Correlations are sensitive to non-linearities in the data. The Bray-Curtis distance is equivalent to the Soerensen index for presence-absence data. Suffers from the same shortcoming as the Manhattan distance.

P.symP.xanP.polaC.platC.gradD.sym P.sym P.xan P.pola C.plat C.grad D.sym Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.polaA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTCCGTATGCTATGTAGCTGGAGGGTACTGACGGTAG C.platA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTAAGGGTACTGATTTTAG C.gradA AATGCCTGACGTGGGAAATCAATAGGGCTAAGGAATTTATTTCGTATGCTATGTAGCTTCCGGGTACTGATTTTAG D.symT TATGCGAGACGTGAAAAATCTTTAGGGCTAAGGTGATTATTTCGGTTGCTATGTAGAGGAAGGGTACTGACGGTAG Linkage algorithm We first combine species that are nearest to from an inner cluster In the next step we look for a species or a cluster that is clostest to the average distance or the initial cluster We continue this procedure until all species are grouped. The single linkage algorithm tends to produce many small clusters. P.sym P.xan P.pola C.plat C.grad D.sym

Clustering using a predefined number of clusters K-means O P A B D C F E H K I LN M J G K-means clustering starts from a predefind number of clusters and then arranges the items in a way that the distances between clusters are maximized with respect to the distances within the clusters. Technically the algorithm first randomly assigns cluster means and then places items (each time calculating new cluster means) until an optimal solution (convergence) has been reached). K-means always uses Euclidean distances

Neighbour joining Neighbour joining is particularly used to generate phylogenetic trees Dissimilarities You need similarities (phylogenetic distances)  (XY) between all elements X and Y. Select the pair with the lowest value of Q Calculate new dissimilarities Calculate the distancies from the new node Calculate

Factor analysis Is it possible to group the variables according to their values for the countries? T (Jan)T (July)Mean TDiff T GDP GDP/C Elev Factor 1 Factor 2 Factor 3 Correlations The task is to find coefficients of correlation etween the original variables and the exctracted factors from the analysis of the coefficiencts of correlation between the original variables.

Because the f values are also Z-transformed we have Eigenvalue

How to compute the factor loadings? The dot product of orthonormal matrices gives the unity matrix Fundamental theorem of factor analysis

F1F2 f 11 f 21 f 31 f 41 f 51 f 61 f 12 f 22 f 32 f 42 f 52 f 62 Z-transformed Factor values b Cases n Factors F Factors are new variables. They have factor values (independent of loadings) for each case. These factors can now be used in further analysis, for instance in regression analysis.

We are looking for a new x,y system were the data are closest to the longest axis. PCA in fact rotates the original data set to find a solution where the data are closest to the axes. PCA leaves the number of axes unchanged. Only a few of these rotated axes can be interpreted from the distances to the original axes. We interpret the new axis on the basis of their distance (measured by their angle) to the original axes. The new axes are the principal axes (eigenvectors) of the dispersion matrix obtained from raw data. X1 Y1 X’1 Y’1 PCA is an eigenvector method Principal axes are eigenvectors.

The programs differ in the direction of eigenvectors. This does not change the results but might pose problems with the interpretation of factors according to the original variables.

Principal coordinate analysis PCoA uses different metrics to generate the dispersion matrix

Using PCA or PCoA to group cases v A factor might be interpreted if more than two variables have loadings higher than 0.7. A factor might be interpreted if more than four variables have loadings higher than 0.6. A factor might be interpreted if more than 10 variables have loadings higher than 0.4.

Correspondence analysis (reciprocal averaging, seriation, contingency table analysis) Correspondence analysis ordinates rows and columns of matrices simultaneously according their principal axes. It uses the  2-distances instead of correlations coefficients or Euclidean distances.  distances Contingency table

We take the transposed raw data matrix and calculate eigenvectors in the same way Correspondence analyis is row and column ordination. Joint plot

The plots are similar but differ numerically and in orientation. The orientation problem comes again from the way Ecxel calculates eigenvalues. Row and column eigenvectors differ in scale. For a joint plot the vectors have to be rescaled.

Reciprocal averaging Sorting according to row/column eigenvalues rearranges the matrix in a way where the largest values are near the matrix diagonal.

=los() =(B85*B$97+C85*C$97+D85*D$97+E85*E$97)/$F85 =(H85-H$94)/H$95 Seriation using reciprocal averaging Repeat until scores become stable Weighed mean Z-transformed weighed means