Download presentation
Presentation is loading. Please wait.
Published byArlene West Modified over 9 years ago
1
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
2
Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm
3
Motivation: Inferring Gene Functionality Researchers want to know the functions of newly sequenced genes Simply comparing the new gene sequences to known DNA sequences often does not give away the function of gene For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes Microarrays allow biologists to infer gene function even when sequence similarity alone there is insufficient to infer function.
4
Microarrays and Expression Analysis Microarrays measure the activity (expression level) of the genes under varying conditions/time points Expression level is estimated by measuring the amount of mRNA for that particular gene A gene is active if it is being transcribed More mRNA usually indicates more gene activity
5
Microarray Experiments Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphor’s
6
Microarray Experiments (con’t) www.affymetrix.com Phosphors can be added here instead Then instead of staining, laser illumination can be used
7
Interpretation of Microarray Data Green: expressed only from control Red: expressed only from experimental cell Yellow: equally expressed in both samples Black: NOT expressed in either control or experimental cells
8
Microarray Data Microarray data are usually transformed into an intensity matrix (below) The intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how gene functions might be related Clustering comes into play Time:Time XTime YTime Z Gene 1108 Gene 21009 Gene 348.63 Gene 4783 Gene 5123 Intensity (expression level) of gene at measured time
9
Major Analysis Techniques Single gene analysis Compare the expression levels of the same gene under different conditions Main techniques: Significance test (e.g., t-test) Gene group analysis Find genes that are expressed similarly across many different conditions (co-expressed genes -> co-regulated genes) Main techniques: Clustering (many possibilities) Gene network analysis Analyze gene regulation relationship at a large scale Main techniques: Bayesian networks, won’t be covered
10
Clustering of Microarray Data Clusters
11
Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows
12
Bad Clustering This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster
13
Good Clustering This clustering satisfies both Homogeneity and Separation principles
14
Clustering Methods Similarity-based ( need a similarity function ) Construct a partition Agglomerative, bottom up Typically “hard” clustering Model-based (latent models, probabilistic or algebraic) First compute the model Clusters are obtained easily after having a model Typically “soft” clustering
15
Similarity-based Clustering Define a similarity function to measure similarity between two objects Common clustering criteria: Find a partition to Maximize intra-cluster similarity Minimize inter-cluster similarity Challenge: How to combine them? (many possibilities) How to construct clusters? Agglomerative Hierarchical Clustering (Greedy) Need a cutoff (either by the number of clusters or by a score threshold)
16
Method 1 (Similarity-based): Agglomerative Hierarchical Clustering
17
Similarity Measures Given two gene expression vectors, X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n }, define a similarity function s(X,Y) to measure how likely the two genes are co-regulated/co-expressed However “co-expressed” is hard to quantify Biology knowledge would help In practice, many choices Euclidean distance Pearson correlation coefficient Better measures are possible (e.g., focusing on a subset of values…
18
Agglomerative Hierachical Clustering Given a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met Variations: different ways to compute group similarity based on individual object similarity
19
Hierarchical Clustering
20
Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C 2 10. Add a row and column to d corrsponding to the new cluster C 11. return T The algorithm takes a n x n distance matrix d of pairwise distances between points as an input.
21
Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C 2 10. Add a row and column to d corrsponding to the new cluster C 11. return T Different ways to define distances between clusters lead to different clustering methods…
22
How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:
23
Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm
24
Group Similarity Formulas Given C (new cluster) and C* (any other cluster in the current results) Given a similarity function s(x,y) (or equivalently, a distance function, d(x,y) as in the book) Single link: Complete link: Average link:
25
Comparison of the Three Methods Single-link “Loose” clusters Individual decision, sensitive to outliers Complete-link “Tight” clusters Individual decision, sensitive to outliers Average-link “In between” Group decision, insensitive to outliers Which one is the best? Depends on what you need!
26
Method 2 (Model-based): K-Means
27
A Model-based View of K-Means Each cluster is modeled by a centroid point A partition of the data V={v 1 …v n } with k clusters is modeled by k centroid points X={x 1 …x k }; Note that a model X uniquely specifies the clustering results Define an objective function on X and V to measure how good/bad V is as clustering results of X, f(X,V) Clustering is reduced to the problem of finding a model X* that optimizes the value f(X*,V) Once X* is found, the clustering results are simply computed according to our model X* Major tasks: (1) define d(x,v); (2) define f(X,V); (3) Efficiently search for X*
28
Squared Error Distortion ● Given a set of n data points V={v 1 …v n } and a set of k points X={x 1 …x k } ●Define d(x i,v j )=Euclidean distance ●Define f(X,V)=d(X,V) (Squared error distortion) The distance from v i to the closest point in X
29
K-Means Clustering Problem: Formulation ●Input: A set, V, consisting of n points and a parameter k ●Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X This problem is NP-complete.
30
K-Means Clustering Given a distance function Start with k randomly selected data points Assume they are the centroids of k clusters Assign every data point to a cluster whose centroid is the closest to the data point Recompute the centroid for each cluster Repeat this process until the distance-based objective function converges
31
K-Means Clustering: Lloyd Algorithm 1.Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster C i (x i ) corresponding to the closest cluster representative (i.e,. the center x i ) (1 ≤ i ≤ k) 5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is *This may lead to merely a locally optimal clustering.
32
x1x1 x2x2 x3x3
33
x1x1 x2x2 x3x3
34
x1x1 x2x2 x3x3
35
x1x1 x2x2 x3x3
36
Evaluation of Clustering Results How many clusters? Score threshold in hierarchical clustering (need to have some confidence in the scores) Try out multiple numbers, and see which one works the best (best could mean fitting the data better or discovering some known clusters) Quality of a single cluster A tight cluster is better (but needs to quantify the “tightness”) Prior knowledge about clusters Quality of a set of clusters Quality of individual clusters How well different clusters are separated In general, evaluation of clustering and choosing the right clustering algorithms are difficult But the good news is that significant clusters tend to show up regardless what clustering algorithms to use
37
What You Should Know Why clustering microarray data is interesting How hierarchical agglomerative clustering works The 3 different ways of computing the group similarity/distance How k-means works
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.