Download presentation
Presentation is loading. Please wait.
Published byMarion Arlene Miles Modified over 9 years ago
1
Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax
2
Correspondance analysis Introduced by Benzecri (1973) For uncovering and understanding the structure and pattern in data in contingency tables. Involves finding coordinate values which represent the row and column categories in some optimal way
3
Contingency tables Table with r rows and c columns X11………..j…………cTotal X2 12..i.r12..i.r N 11 N 21. N r1 N 1j N ij N 1c N cr N1.Nr.N1.Nr. Total N. 1 N. j N.c N..
4
Main idea Develop simple indices that will show us the relation between rows and columns Indices that tell us simultaneously which columns have more wheights in a row category and vice versa Reduce dimensionality like PCA Indice are extracted in decreasing order of imporance
5
Which crietria? In contigency table global independence between the two variables is generally measured by a chi-square ( ²) calculated as: Where E ij are expected count under independence
6
Decomposition of ² We have a departure from indepedence and we want to know why To find the factors we use the matrix C of dimension ( r x c ) with elements
7
How to find factors? Singular value decomposition (SVD) of matrix C that is find matrice U, D and V such that C=U D V T U are eigenvectors of CC T V eigenvectors of C T C D a diagonal matrix of where k are eigenvalues of CC T k =Rank( C )<Min( r-1,c-1 )
8
Tr( CC T )= k = ²= c ij ² The projections of the rows and the columns are given by the eigenvectors U k and V k C U k = V k C T V k = U k
9
How many factors? The adequacy of representation by the two first coordinates is measured by the % of explained inertia ( 1 + 2 )/ k In general a display on (U 1,U 2 ) of rows and (V 1,V 2 ) of columns The proximity between rows and columns points is to be interpreted
10
CA in practice Proximity of two rows (columns) indicates a similar profile that is similar conditional frequency distribution: the two rows (columns) are proportional The orignin is the average of the factor; so a point (row or column) close to the origin indicates an average profile Proximity of a row to a column indicates that this row has particularly important wheight in this column (if far from origin)
11
A first example: French Bac
12
Eigenvalues
13
With Corsica
14
Without Corsica Classical bac Technical bac
15
Coefficients for regions
16
Coefficients for Bac Type
17
Properties of CA Allows consideration of dummy variables (called ‘illustrative variables’), as additional variables which do not contribute to the construction of the factorial space, but can be displayed on this factorial space. With such a representation it is possible to determine the proximity between observations and variables and the illustrative variables and observations.
18
Tekaia and yeramian (2006) 208 predicted proteomes representing the three phylogenetic domains and various lifestyle (hyperthromphile, thermophiles, psychrofile and mesophiles including eukaryotes) Variables: amino-acid composition of proteomes Illustrative variables:groups of amino- acids (charged, polar, hydrophobic)
19
Why CA? To analyze distribution of species in terms of global properties and discriminated groups Search for amino-acid signature in groups of species Try to understand potential evolutionary trends
21
Results First axis (63%) correspond to GC contents (Mycoplasma (23%) to Streptomyces(72%)) Second axis (14%) correspond to optimals growth temperature
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.