Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA
Outline Covariance matrix and correlation matrix Matrix determinant Identity matrix Eigenvalues Eigenvectors Principal Component Analysis (PCA) Distance measure Principal Coordinate Analysis (PcoA) Analysis of Molecular Variance (AMOVA)
Covariance matrix and correlation matrix X Y X 121050272 170833020 Y 170833020 448841482 Variable Number X Y 1 53,047 62,490 2 49,958 58,850 3 41,974 49,445 4 44,366 52,263 5 40,470 47,674 6 36,963 43,542 7 31,474 75,113 8 54,376 72,265 9 60,880 98,675 10 66,774 104,543 Correlation matrix X y X 1.0000000 0.7328961 Y 0.7328961 1.0000000
Matrix determinant The determinant of a matrix A represented by |A| Determinant of 2 x 2 matrix a11 a12 a21 a22 determinant = (a11 X a22) – (a12 x a21) Covariance matrix X Y X 121050272 170833020 Y 170833020 448841482 The determinant of the covariance matrix determinant = (121050272 X 448841482) – (170833020 X 170833020) determinant = 2.51485 x 1016
Matrix determinant The determinant of a matrix A represented by |A| Determinant of 2 x 2 matrix a11 a12 a21 a22 determinant = (a11 X a22) – (a12 x a21) Correlation matrix X y X 1.0000000 0.7328961 Y 0.7328961 1.0000000 The determinant of the correlation matrix determinant = (1.00 X 1.00) – (0.7328961 X 0.7328961) determinant = 0.462863307
Identity matrix Identity matrix is denoted by I is a square matrix with ones on the main (NW - SE) diagonal and zeros elsewhere. 1 0 0 0 1 0 0 0 1 1 0 0 1 I = (2 x 2) I = (3 x 3) 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 I = (4 x 4) 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 I = (5 x 5)
Eigenvalues Let A be a k x k square matrix and I be the k x k identity matrix. Then eigenvalues 𝜆1, 𝜆2, . . . 𝜆k satisfy the polynomial equation given by A - 𝝀I = 0 Characteristics equation For example: A = 1 0 1 3 1 0 1 3 1 0 0 1 1 - 𝝀 - 𝝀 = (1 - 𝝀) = (3 - 𝝀)=𝟎 3 - 𝝀 Eigenvalues are: 𝝀=1, 𝝀 = 3
Eigenvectors Let A be a matrix of dimension k x k and 𝜆 be an eigenvalue of A Eigenvector x of a matrix A associated with eigenvalue 𝜆: Ax = 𝜆x For example: A = , Eigenvectors associated with the eigenvalues are determined by solving the following equations = 1 for 𝝀=1; = 3 for 1 0 1 3 Eigenvalues are: 𝝀=1, 𝝀 = 3 1 0 1 3 x1 x2 x1 x2 1 0 1 3 x1 x2 x1 x2 𝝀 = 3
Eigenvectors Eigenvectors associated with the eigenvalues of 1 and 3 are given below For 𝝀 = 1; eigenvector is -2 1 1 For 𝝀 = 3; eigenvector is
Principal Component Analysis
PCA: Principal Component Analysis PCA is a mathematical procedure that transforms a set of variables into a smaller set of uncorrelated variables called principal components (PCs). These PCs are linear combinations of the original variables and can be thought of as “new” variables. Uses of PCA: a) Data screening (identify outlier) b) Clustering c) Dimension reduction
PCA From k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxk yk's are Principal Components Where: a11 is an eigenvector
PCA Raw data: Original variables Covariance/Correlation Matrix Determinant matrix Eigenvalues Eigenvectors Principal component scores = Original variables x Eigenvectors
PCA for clustering using PC scores Variables Protein M1 M2 M3 Protein_X1 124 99 4.3 Protein_X2 106 67 7.5 Protein_X3 111 90 9.2 Protein_X4 109 Protein_X5 113 112 9.6 Protein_X6 89 72 10.1 Protein_X7 78 7.7 Protein_X8 190 87 6.8 Protein_X9 123 68 7.6 Protein_X10 116 7.8 Protein_X11 Protein_X12 M2 M3 M1 plot PC2 PCA PC1 14
Distance measures
Similarity and Dissimilarity distance Similarity distance measure: Euclidean distance Manhattan distance Dissimilarity distance measure: Jaccard Dice
Euclidean Distance Euclidean Distance is the most common use of distance. In most cases when people said about distance , they will refer to Euclidean distance. Euclidean distance or simply 'distance' examines the root of square differences between coordinates of a pair of objects. Euclidean distance (4,5) (1,1)
Euclidean Distance Euclidean distance Features cost time weight Example: Point A has coordinate (0, 3, 4, 5) and point B has coordinate (7, 6, 3, -1). The Euclidean Distance between point A and B is Euclidean distance Features cost time weight incentive Plant A 3 4 5 Plant B 7 6 -1
Manhattan It is also known as Manhattan distance, boxcar distance, absolute value distance. It examines the absolute differences between coordinates of a pair of objects. Features cost time weight incentive Plant A 3 4 5 Plant B 7 6 -1
Dissimilarity distance Marker1 Marker2 Marker3 Marker4 Marker5 Marker6 Marker7 i 1 j
Genetic distance Marker1 Marker2 Marker3 Marker4 Marker5 Marker6 Sample 1 1 Sample 2 Sample 1 1 Sample 2 Fa=3 Fb=1 Fc=2 Fd=1 N= Fa+Fb+Fc+Fd Simple Match distance = Fa/N= 3/7= 0.43 Genetic distance (Jaccard) = Fa/(Fa+Fb+Fc) = 3/6= 0.5
Distance Matrix Gives the matrix of distances between each pair of elements A B C D E 63 94 111 67 79 96 16 47 83 100 A B C D E
Principal Coordinate Analysis
Principal Coordinate Analysis (PcoA) Raw data: Original variables Distance matrix Determinant matrix Eigenvalues Eigenvectors Principal coordinate scores = Original variables x Eigenvectors
Examples of PcoA plot
Analysis of molecular variance (AMOVA)
Analysis of molecular variance (AMOVA) AMOVA used to detect population differentiation utilizing molecular markers. It uses distance matrix A P-value is calculated by measuring the fraction of 1000 randomizations of the rows and columns in a distance matrix. It is not ANOVA : No requirement of assumption of a normality.
AMOVA An example of AMOVA model This AMOVA model measures gene diversity among populations with specific reference to areas of a region in a continent We have: i = individuals, j = alleles, k = populations
RStudio