Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright All rights reserved.
Gene Expression Microarray A popular method to since reported by Pat Brown Laboratory in 1995 and by Affymetrix in 1996 A popular method to detect mRNA expression level since reported by Pat Brown Laboratory in 1995 and by Affymetrix in 1996 Different technologies for producing the microarray chips and different approaches for analyzing microarray data Different technologies for producing the microarray chips and different approaches for analyzing microarray data Should carefully process and analyze the data Should carefully process and analyze the data
What is gene expression and why is it important?
Gene Expression in different organs A highly specific process in which a gene is switched on, and therefore begins production of its protein. A highly specific process in which a gene is switched on, and therefore begins production of its protein. Sources: image from the Cancer Genome Anatomy Project (CGAP), Conceptual Tour, July 21,
Gene Expression in single organ Gene expression also varies within a certain type of cell at different points in time. Gene expression also varies within a certain type of cell at different points in time. Sources: image from the Cancer Genome Anatomy Project (CGAP), Conceptual Tour, July 21,
Microarray raw data Label mRNA from one sample with a red fluorescence probe (Cy5) and mRNA from another sample with a green fluorescence probe (Cy3) Label mRNA from one sample with a red fluorescence probe (Cy5) and mRNA from another sample with a green fluorescence probe (Cy3) Hybridize to a chip with specific DNAs fixed to each well Hybridize to a chip with specific DNAs fixed to each well Measure amounts of green and red fluorescence Measure amounts of green and red fluorescence Flash animations: PCR Microarray
Example microarray image Microarray is a technology to globally (simultaneously detecting thousands of genes) detect mRNA expression level.
mRNA expression microarray data for 9800 genes (gene number shown vertically) for 0 to 24 h (time shown horizontally) after addition of serum to a human cell line that had been deprived of serum (from
Data extraction Adjust fluorescent intensities using standards (as necessary) Adjust fluorescent intensities using standards (as necessary) Calculate ratio of red to green fluorescence Calculate ratio of red to green fluorescence Convert to log 2 and round to integer Convert to log 2 and round to integer Display saturated green=-3 to black = 0 to saturated red = +3 Display saturated green=-3 to black = 0 to saturated red = +3
Many different types: Hierarchical clustering k – means clustering Self-organising maps Hill Climbing Simulated Annealing All have the same three basic tasks of: 1.Pattern representation – patterns or features in the data. 2.Pattern proximity – a measure of the distance or similarity defined on pairs of patterns 3.Pattern grouping – methods and rules used in grouping the patterns Unsupervised clustering algorithms
Distances High dimensionality High dimensionality Based on vector geometry – how close are two data points? Based on vector geometry – how close are two data points? Array2 Array 1 Array 2 Gene 114 …
Distances High dimensionality High dimensionality Based on vector geometry – how close are two data points? Based on vector geometry – how close are two data points? Array2 Array 1 Array 2 Gene 114 Gene 213 … Gene 1 Gene 2 Distance(Gene 1, Gene 2) = 1
Distances High dimensionality High dimensionality Based on vector geometry – how close are two data points? Based on vector geometry – how close are two data points? Based on distances to determine clusters Based on distances to determine clusters Array2 Array 1 Array 2 Gene 114 Gene 213 … Gene 1 Gene 2 Distance(Gene 1, Gene 2) = 1
a1 a2b2 b1 Distance Sample 2 Sample 1 Gene a Gene b sample 1 sample 2 a1a2 b1b2 Bivariate Euclidean Distance
General Multivariate Dataset We are given values of p variables for n independent observations We are given values of p variables for n independent observations Construct an n x p matrix M consisting of vectors X 1 through X n each of length p Construct an n x p matrix M consisting of vectors X 1 through X n each of length p
Multivariate Sample Mean Define mean vector I of length p Define mean vector I of length p or matrix notation vector notation
Multivariate Variance Define variance vector of length p Define variance vector of length p matrix notation
Multivariate Variance or or vector notation
Covariance Matrix Define a p x p matrix cov (called the covariance matrix) analogous to 2 Define a p x p matrix cov (called the covariance matrix) analogous to 2
Covariance Matrix Note that the covariance of a variable with itself is simply the variance of that variable Note that the covariance of a variable with itself is simply the variance of that variable
Univariate Distance The simple distance between the values of a single variable j for two observations i and l is The simple distance between the values of a single variable j for two observations i and l is
Univariate z-score Distance To measure distance in units of standard deviation between the values of a single variable j for two observations i and l we define the z-score distance To measure distance in units of standard deviation between the values of a single variable j for two observations i and l we define the z-score distance
Bivariate Euclidean Distance The most commonly used measure of distance between two observations i and l on two variables j and k is the Euclidean distance The most commonly used measure of distance between two observations i and l on two variables j and k is the Euclidean distance M(i,j) k variable j variable i observation l observation M(l,j) M(i,k)M(l,k)
Multivariate Euclidean Distance This can be extended to more than two variables This can be extended to more than two variables
Effects of variance and covariance on Euclidean distance Points A and B have similar Euclidean distances from the mean, but point B is clearly “more different” from the population than point A. B A The ellipse shows the 50% contour of a hypothetical population.
Mahalanobis Distance To account for differences in variance between the variables, and to account for correlations between variables, we use the Mahalanobis distance To account for differences in variance between the variables, and to account for correlations between variables, we use the Mahalanobis distance
Other distance functions We can use other distance functions, including ones in which the weights on each variable are learned We can use other distance functions, including ones in which the weights on each variable are learned Cluster analysis tools for microarray data most commonly use Pearson correlation coefficient Cluster analysis tools for microarray data most commonly use Pearson correlation coefficient
Pearson correlation coefficient
Software for performing microarray cluster analysis Michael Eisen’s Cluster (Windows only) Michael Eisen’s Cluster (Windows only) M. de Hoon’s Cluster 3.0 (all OS) M. de Hoon’s Cluster 3.0 (all OS) Tree viewing (links on same site) Tree viewing (links on same site) Java Treeview Mapletree
Input data for clustering Genes in rows, conditions in columns Genes in rows, conditions in columns
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clustering of Microarray Data (cont’d) Clusters
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Bad Clustering This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Good Clustering This clustering satisfies both Homogeneity and Separation principles
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clustering Techniques Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering (cont’d) Hierarchical Clustering is often used to reveal evolutionary history
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T The algorithm takes a n x n distance matrix d of pairwise distances between points as an input.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T Different ways to define distances between clusters may lead to different clusterings
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Recomputing Distances d min (C, C * ) = min d(x,y) for all elements x in C and y in C * Distance between two clusters is the smallest distance between any pair of their elements d avg (C, C * ) = (1 / |C * ||C|) ∑ d(x,y) for all elements x in C and y in C * Distance between two clusters is the average distance between all pairs of their elements
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Squared Error Distortion Given a data point v and a set of points X, define the distance from v to X d(v, X) as the (Eucledian) distance from v to the closest point from X. Given a set of n data points V={v 1 …v n } and a set of k points X, define the Squared Error Distortion d(V,X) = ∑d(v i, X) 2 / n 1 < i < n
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info K-Means Clustering Problem: Formulation Input: A set, V, consisting of n points and a parameter k Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info 1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info 1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic method for K-Means clustering is the Lloyd algorithm
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info K-Means Clustering: Lloyd Algorithm 1.Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster C i corresponding to the closest cluster representative (center) (1 ≤ i ≤ k) 5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑v \ |C| for all v in C for every cluster C *This may lead to merely a locally optimal clustering.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clique Graphs A clique is a graph with every vertex connected to every other vertex A clique graph is a graph where each connected component is a clique
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Transforming an Arbitrary Graph into a Clique Graphs A graph can be transformed into a clique graph by adding or removing edges
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clique Graphs (cont’d) – REVISION –show yet another way of transformation and compare the costs. A graph can be transformed into a clique graph by adding or removing edges Example: removing two edges to make a clique graph
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Corrupted Cliques Problem Input: A graph G Output: The smallest number of additions and removals of edges that will transform G into a clique graph
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Distance Graphs Turn the distance matrix into a distance graph Genes are represented as vertices in the graph Choose a distance threshold θ If the distance between two vertices is below θ, draw an edge between them The resulting graph may contain cliques These cliques represent clusters of closely located data points!
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Transforming Distance Graph into Clique Graph The distance graph (threshold θ=7) is transformed into a clique graph after removing the two highlighted edges After transforming the distance graph into the clique graph, the dataset is partitioned into three clusters
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Heuristics for Corrupted Clique Problem Corrupted Cliques problem is NP-Hard, some heuristics exist to approximately solve it: CAST (Cluster Affinity Search Technique): a practical and fast algorithm: CAST is based on the notion of genes close to cluster C or distant from cluster C Distance between gene i and cluster C: d(i,C) = average distance between gene i and all genes in C Gene i is close to cluster C if d(i,C)< θ and distant otherwise
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info CAST Algorithm 1.CAST(S, G, θ) 2. P Ø 3. while S ≠ Ø 4. V vertex of maximal degree in the distance graph G 5. C {v} 6. while a close gene i not in C or distant gene i in C exists 7. Find the nearest close gene i not in C and add it to C 8. Remove the farthest distant gene i in C 9. Add cluster C to partition P 10. S S \ C 11. Remove vertices of cluster C from the distance graph G 12. return P S – set of elements, G – distance graph, θ - distance threshold
Choosing the number of Centers A difficult problem A difficult problem Most common approach is to try to find the solution that minimizes the Bayesian Information Criterion Most common approach is to try to find the solution that minimizes the Bayesian Information Criterion L = the likelihood function for the estimated model K = # of parameters n = # of samples
Clustering genes and conditions Rows and columns can be clustered independently - hierarchical is preferred for visualizing this Rows and columns can be clustered independently - hierarchical is preferred for visualizing this