Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.

Slides:



Advertisements
Similar presentations
Matrix Representation
Advertisements

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Eigen Decomposition and Singular Value Decomposition
3D Geometry for Computer Graphics
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Benefits Key Features and Results. XLSTAT-ADA’s functions.
Covariance Matrix Applications
Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry.
Tensors and Component Analysis Musawir Ali. Tensor: Generalization of an n-dimensional array Vector: order-1 tensor Matrix: order-2 tensor Order-3 tensor.
Computer Vision – Image Representation (Histograms)
Lecture 19 Singular Value Decomposition
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Correspondence Analysis Multivariate Chi Square. Goals of CA Produce a picture of multivariate data in one or two dimensions Analyze rows and columns.
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
© 2003 by Davi GeigerComputer Vision September 2003 L1.1 Face Recognition Recognized Person Face Recognition.
Computer Graphics Recitation 5.
Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.
Principal component analysis (PCA)
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 9(b) Principal Components Analysis Martin Russell.
Contingency tables and Correspondence analysis
CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Canonical correlations
Chapter 3 Determinants and Matrices
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
10/17/071 Read: Ch. 15, GSF Comparing Ecological Communities Part Two: Ordination.
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
6 1 Linear Transformations. 6 2 Hopfield Network Questions.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Chapter 26: Comparing Counts. To analyze categorical data, we construct two-way tables and examine the counts of percents of the explanatory and response.
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Separate multivariate observations
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Correspondence Analysis Chapter 14.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
1/17 Identification of thermophilic species by the amino acid compositions deduced from their genomes Reporter: Yu Lun Kuo
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
SINGULAR VALUE DECOMPOSITION (SVD)
Class Opener:. Identifying Matrices Student Check:
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
WORKSHOP ZAGREB JUNE 2004 Correspondence analysis for data mining with applications in medicine Annie Morin IRISA France
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Rotational Ambiguity in Soft- Modeling Methods. D = USV = u 1 s 11 v 1 + … + u r s rr v r Singular Value Decomposition Row vectors: d 1,: d 2,: d p,:
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Statistics 300: Elementary Statistics Section 11-3.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Reduced echelon form Matrix equations Null space Range Determinant Invertibility Similar matrices Eigenvalues Eigenvectors Diagonabilty Power.
Canonical Correlation Analysis (CCA). CCA This is it! The mother of all linear statistical analysis When ? We want to find a structural relation between.
13.4 Product of Two Matrices
Matrix Multiplication
Recitation: SVD and dimensionality reduction
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
X.1 Principal component analysis
Feature space tansformation methods
Symmetric Matrices and Quadratic Forms
Marios Mattheakis and Pavlos Protopapas
CSE 203B: Convex Optimization Week 2 Discuss Session
Presentation transcript:

Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax

Correspondance analysis Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables. Involves finding coordinate values which represent the row and column categories in some optimal way

Contingency tables Table with r rows and c columns X11………..j…………cTotal X2 12..i.r12..i.r N 11 N 21. N r1 N 1j N ij N 1c N cr N1.Nr.N1.Nr. Total N. 1 N. j N.c N..

Main idea Develop simple indices that will show us the relation between rows and columns Indices that tell us simultaneously which columns have more wheights in a row category and vice versa Reduce dimensionality like PCA Indice are extracted in decreasing order of imporance

Which crietria? In contigency table global independence between the two variables is generally measured by a chi-square (  ²) calculated as: Where E ij are expected count under independence

Decomposition of  ² We have a departure from indepedence and we want to know why To find the factors we use the matrix C of dimension ( r x c ) with elements

How to find factors? Singular value decomposition (SVD) of matrix C that is find matrice U, D and V such that C=U D V T U are eigenvectors of CC T V eigenvectors of C T C D a diagonal matrix of where k are eigenvalues of CC T k =Rank( C )<Min( r-1,c-1 )

Tr( CC T )=  k =  ²=   c ij ² The projections of the rows and the columns are given by the eigenvectors U k and V k C U k = V k C T V k = U k

How many factors? The adequacy of representation by the two first coordinates is measured by the % of explained inertia ( )/  k In general a display on (U 1,U 2 ) of rows and (V 1,V 2 ) of columns The proximity between rows and columns points is to be interpreted

CA in practice Proximity of two rows (columns) indicates a similar profile that is similar conditional frequency distribution: the two rows (columns) are proportional The orignin is the average of the factor; so a point (row or column) close to the origin indicates an average profile Proximity of a row to a column indicates that this row has particularly important wheight in this column (if far from origin)

A first example: French Bac

Eigenvalues

With Corsica

Without Corsica Classical bac Technical bac

Coefficients for regions

Coefficients for Bac Type

Properties of CA Allows consideration of dummy variables (called ‘illustrative variables’), as additional variables which do not contribute to the construction of the factorial space, but can be displayed on this factorial space. With such a representation it is possible to determine the proximity between observations and variables and the illustrative variables and observations.

Tekaia and yeramian (2006) 208 predicted proteomes representing the three phylogenetic domains and various lifestyle (hyperthromphile, thermophiles, psychrofile and mesophiles including eukaryotes) Variables: amino-acid composition of proteomes Illustrative variables:groups of amino- acids (charged, polar, hydrophobic)

Why CA? To analyze distribution of species in terms of global properties and discriminated groups Search for amino-acid signature in groups of species Try to understand potential evolutionary trends

Results First axis (63%) correspond to GC contents (Mycoplasma (23%) to Streptomyces(72%)) Second axis (14%) correspond to optimals growth temperature