Evaluation of Two Methods to Cluster Gene Expression Data Odisse Azizgolshani Adam Wadsworth Protein Pathways SoCalBSI.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Lecture 3: A brief background to multivariate statistics
Machine Learning Lecture 8 Data Processing and Representation
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Introduction to Bioinformatics
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
An introduction to Principal Component Analysis (PCA)
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Principal Component Analysis
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
Face Recognition Jeremy Wyatt.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Statistical Analysis of Microarray Data
Boot Camp in Linear Algebra Joel Barajas Karla L Caballero University of California Silicon Valley Center October 8th, 2008.
Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Separate multivariate observations
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Principle Component Analysis Presented by: Sabbir Ahmed Roll: FH-227.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
CSE 185 Introduction to Computer Vision Face Recognition.
CSSE463: Image Recognition Day 27 This week This week Today: Applications of PCA Today: Applications of PCA Sunday night: project plans and prelim work.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Analyzing Expression Data: Clustering and Stats Chapter 16.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
CSSE463: Image Recognition Day 25 This week This week Today: Applications of PCA Today: Applications of PCA Sunday night: project plans and prelim work.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Boot Camp in Linear Algebra TIM 209 Prof. Ram Akella.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Unsupervised Learning II Feature Extraction
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Principal Component Analysis
PREDICT 422: Practical Machine Learning
Principal Component Analysis (PCA)
Principal Components Analysis
Dimension Reduction via PCA (Principal Component Analysis)
Principal Component Analysis
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
Principal Component Analysis
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Principal Components What matters most?.
Principal Component Analysis
Presentation transcript:

Evaluation of Two Methods to Cluster Gene Expression Data Odisse Azizgolshani Adam Wadsworth Protein Pathways SoCalBSI

Overview: Background information Statement of the project Materials and methods Results Discussion and conclusion Acknowledgements

Microarray Data Transcriptional response of genes to variations in cellular states Cellular States: Mutations, Compound-Treated State 1State 2State 3…State Y Gene ………… Gene 2…………… Gene 3…………… ……………… Gene X…………… The data values are the log ratios of the level of gene expression in the mutant or compound-treated state over the level of expression in the wild-type state

Clustering Clustering: Organizing into groups genes with similar expression profiles Correlation Coefficient: The metric used to determine the similarity between two expression profiles Hierarchical Clustering: A way of forming a multi-level hierarchy of gene expression profiles, which can be cut off at certain places to form gene clusters Project: Evaluating two different methods of hierarchically clustering expression data

Hierarchical Clustering Method 1 State 1State 2…State Y Gene ……… Gene 2………… …………… Gene X………… Correlation Calculations Gene 1Gene 2…Gene X Gene …… Gene …… ………0… Gene X………0 EXPRESSION PROFILES GENE CORRELATIONS gene 1 gene 2 gene 3 gene 4 gene 5 Linking genes by expression similarity DENDROGRAM

Method 1 Example Hughes, T.R., et al. (2000). Functional Discovery via a Compendium of Expression Profiles. Cell 102,

Hierarchical Clustering Method 2 State 1State 2…State Y Gene ……… Gene 2………… …………… Gene X………… Correlation Calculations Gene 1Gene 2…Gene X Gene …… Gene …… ………0… Gene X………0 EXPRESSION PROFILES GENE CORRELATIONS 1 gene 1 gene 2 gene 3 gene 4 gene 5 Linking genes by correlation similarity DENDROGRAM GENE CORRELATIONS 2 Correlation of Correlations Gene 1Gene 2…Gene X Gene …… Gene …… ………0… Gene X………0

Method 2 Example Provided by Matteo Pelligrini, Protein Pathways.

Applications of Clustering Functional Genomics: Gaining information about the possible function of genes with unknown function Looking at the function of genes that cluster together with genes of unknown function Diagnostics: Tissues from clinical samples can be clustered together to determine disease subtypes (e.g. tumor classification)

Project Details Project Question: In the process of hierarchically clustering gene expression data, which metric generates better clusters: 1. The correlation of gene expression ratios (Method 1) 2. The correlation of the correlations (Method 2) Dataset: Yeast microarray gene expression data (6317 genes, 300 strains)* Programming Environment: MATLAB v 6.5 *Hughes, T.R., et al. (2000). Functional Discovery via a Compendium of Expression Profiles. Cell 102,

Two Approaches Problem: Determining the quality of clusters formed so as to evaluate the two clustering methods Approach I: Determine the quality of the clusters by seeing if genes with the same function have clustered together more often in one method over the other method Approach II: Determine the quality of the clusters by analyzing the variances of the clusters and seeing if there is a difference between clustering methods

Approach I Gene Function Analysis If clusters contain lots of genes with the same function (i.e. transcription, then the clustering method is good. Two Function Annotation Options 2221 annotated genes with 318 different functions obtained from annotated genes with 99 different functions obtained from

Approach I Steps For both annotation options… Out of the 6317 yeast genes, select only those genes that have known functions Cluster the genes according to the two methods For each cluster, compare each gene to every other gene in that cluster and see how many pairs have the same function If a cluster contains n genes, then there are (n)(n-1) / 2 gene pairs to compare 6317 genes 2000 genes ANNOTATE 1000 genes600 genes400 genes CLUSTER Cluster 1 Cluster 2Cluster pairs pairs79800 pairs NUMBER OF PAIRS TO COMPARE NUMBER OF PAIRS THAT HAVE SAME FUNCTION pairs total pairs40200 pairs3204 pairs pairs same / = 0.25 When the genes are partitioned into three clusters, 25% of the gene pairs have the same function

Approach I Results

Approach II (1): Determining the quality of the clusters based on their volume. Comparing the average volume of the clusters generated by method 1 and method 2.

Approach II (2): If there are M genes in each cluster and for each gene, N experiments are chosen: We’ll have:  M vectors in a N-dimensional space that can be visualized as M points.  The M points generate an ellipsoid if M > dimensionality of the space.  The closer the points to each other, the more correlated they are together, and the smaller the volume of the cluster.  The smaller the volume of the ellipsoid (cluster), the better the quality of the cluster.

i j k D2D2 M original points in the 3-D space ? ? ? ? Centered ellipsoid with known axes

Approach II (3): To compute the volume of the cluster, we first compute its covariance matrix. We then use Principal Components Analysis (PCA) to estimate the dimensions of the cluster. PCA will construct a new space using N orthogonal linear combinations of old vectors of the space. (Each linear combination is a principal component.)

Approach II (4): In the new space, the ellipsoid is transformed into a centered ellipsoid, and the covariance matrix is diagonalized. The axes of the centered ellipsoid are the elements on the diagonal of the diagonalized matrix, which are the variance of the data points in the new space along the principal components. The volume of the ellipsoid = 4/3 x  x D 1 xD 2 x…x D N

i j k D1D1 D2D2 D3D3 PCA and diagonalizing the covariance matrix M points in the new 3-D space M original points in the 3-D space V 1 = c 1 i + c 2 j + c 3 k V 2 = c 4 i + c 5 j + c 6 k V 3 = c 7 i + c 8 j + c 9 k ? ? ? Covariance matrix Diagonalized Covariance Matrix D1D1 v1v1 v2v2 v3v3

Approach II (5): Question: Is one of the methods systematically generating smaller ellipsoids?

Volume Calculation Results:

Similarity of the Clusters from Two Methods: Make a KxK matrix (K: the number of clusters in each method) whose elements are: a ij = Difference (cluster i in method I, cluster j in method II) N(  (A,B)) Difference (A,B) = N(A  B) + N(A  B) where:  A and B are two sets (Here: cluster i from method I and cluster j from method II)   (A,B): symmetrical difference of A and B  N(  (A,B)) = N(A-B) + N(B-A)  A  B: Union of the two sets  A  B: Intersect of the two sets  0 < Difference(A,B) < 1

A B A-B = {1,2,32,7,89} N(A-B) = 5 B-A = {26,94,10,11} N(B-A) = 4 N(A  B) = N(A-B) + N(B-A) = = 9 A  B = {1,2,6,32,21,7,89,43,26,94,10,11} N(A  B) = 12 A  B = {6,21,43} N(A  B) = 3 Dissimilarity score (A,B) = 9/( ) = B A A = B A – B = B – A = Ø N(A  B) = 0 Dissimilarity score(A,B) = 0 B A N(A-B) = A N(B-A) = B N(A  B) = N(A) + N(B) N(A  B) = N(A) + N(B) N(A  B) = 0 Dissimilarity score = 1 An Example: A: cluster i from method I B: cluster j from method II

Results: Dissimilarity Matrix Dissimilarity score for cluster 1 from method 1 and cluster 2 from method 2 K: the number of clusters generated for each method

Discussion and Conclusion (1): Conclusions:  Neither approach can favor one method over the other with certainty; however,  Approach I favors method I when the number of clusters is small.  In the range of clusters, while approach I favors method I, approach II fluctuates in choosing the better method or the other.  The efficiency of both approaches in clustering genes is dependant on the number of clusters.  The similarity of the clusters from method I and method II decreases as the number of clusters increases; in fact, the two methods generate very different clusters.

Discussion and Conclusion (2): Problems faced and future questions:  What is the best cutoff value for clustering?  In approach I, not all genes were annotated, so around 2/3 of the dataset was ignored.  Gene annotations are somewhat arbitrary.  What are other ways to quantify the quality of clusters?  Memory problem: We couldn’t include all the genes and all the experiments at the same time to analyze the quality of clusters.

Acknowledgments: Special thanks to:  Our mentor: Dr. Matteo Pellegrini  Protein Pathways team: Dr. Darin TavernaDr. Peter Bowers Dr. Mike ThompsonLeon Kopelevich  SoCalBSI faculty: Dr. Jamil MomandDr. Silvia Heubach Dr. Sandra SharpDr. Elizabeth Torres Dr. Wendie Johnston Dr. Jennifer Faust Dr. Nancy Warter-Perez Dr. Beverly Krilowicz  NIH and NSF : whose funding made this internship possible.

Appendix I: Covariance (1) The covariance of two features is the measure of how the two features vary together.  If they both have an increasing or decreasing trend, c ij > 0.  If one decreases while the other one increases, c ij < 0.  If the changes of one is independent of the changes of the other, c ij = 0. * *:

Appendix I: Covariance (2) If we have M variables and each variable has N measurements, the covariance matrix can be obtained as below:  (x i -  x ) (y i -  y ) c ij = M Where: c ij ( i  j ) is the covariance of (measurement i and measurement j) for all M variables. The diagonals are the variances of each measurement. Variance: A measure of how much the points vary around the mean.

Appendix I: Diagonalizing The Covariance Matrix data = covariance_data = cov(data); covariance_data = Covariance matrix [V, D] = eig (covariance_data); Diagonalize the Covariance matrix D = V =  (x i -  x ) (y i -  y ) c ij = M

Appendix II: Eigenvalues and Eigenvectors An eigenvector of an nxn matrix is a nonzero vector x such that Ax = x for some scalar. A scalar is called an eigenvalue of A if there is a nontrivial solution x of Ax = x; such an x is called an eigenvector corresponding to.* *: Lay, C. David. Linear Algebra and Its Applications. 3 rd ed. P. 303

Appendix III: Principal Components Analysis (PCA) data = [ pcs, newdata, variances, t2 ] = princomp (data) ; PCA pcs = newdata = variances = t2 =

Volume Calculation Results (20 random experiments, clusters:

Volume Calculation Results: (300 experiments chosen first, then the dimensionality of the space was reduced to 20):

Method 1 vs. Method 2 in a More Conceptual View: Method 1 links together the two genes that have the most similar expression patterns. Method 2 links together the two genes whose correlation with all other genes is most similar; i.e. it looks at a genes in a more global view (in a context of all other genes).

Appendix IV: