Download presentation
Presentation is loading. Please wait.
1
Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005
2
Challenges in Computational Biology DNA 4 Genome Assembly Gene Finding Regulatory motif discovery Database lookup Gene expression analysis9 RNA transcript Sequence alignment Evolutionary Theory7 TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT Cluster discovery10Gibbs sampling Protein network analysis12 Emerging network properties14 13 Regulatory network inference Comparative Genomics RNA folding8 11
3
Plan Gene Expression Data/DNA Microarrays Feature selection and Clustering
4
DNA MicroArrays To measure levels of messages in a cell –Construct an array with DNA sequences for multiple genes –Hybridize each RNA in your sample to a sequence in your array (All sequences from the same gene hybridize to the same spot) –Measure the number of hybridizations for each spot DNA 1 DNA 3 DNA 5DNA 6 DNA 4 DNA 2 cDNA 4 cDNA 6 Hybridize Gene 1 Gene 3 Gene 5Gene 6 Gene 4 Gene 2 Measure RNA 4 RNA 6 RT
5
Result 6000 genes in one shot Entire transcriptome observable in one experiment Can perform multiple experiments under varying conditions –Temperature –Time –Sugar level –Other chemicals –Gene knock-outs –…–…
6
Noise Sources of Noise –Cross-hybridization –Non-uniform hybridization kinetics –Non-linearity of array response to concentration –Non-linear amplification –Improper probe sequence –Difference in materials/procedures
7
Noise model: y ij =n i α ij (c j t ij ) + ε ij –y ij : observed level for gene j on chip i –t ij : true level –c j : gene constant –n i : multiplicative chip normalization –α ij, ε ij : multiplicative and additive noise terms Estimating the parameters –n i : spiked in control probes, not present in genome studied –c j : control experiments of known concentrations for gene j –ε ij : un-spiked control probes should be zero –α ij : spiked controls that are constant across chips Expression Value Normalization
8
Gene expression data For each gene j we have a vector t j =(t 1j,t 2j, …, t dj ) Now what ? I.e., what can we do with this data ?
9
Supervised vs. unsupervised “learning” Make the parallel with modeling biological sequences that we saw last week. What do we do when we don’t have any models? –We can look for patterns / i.e. similarities between the different genes –We can look for recurring themes. Clustering
10
The problem Group genes into co-regulated sets –Observe cells under different environmental changes –Find genes whose expression profiles are affected in a similar way –These genes are potentially co-regulated, i.e. regulated by the same transcription factor Clustering!
11
Clustering expression levels Clustering process: 1.How to tell if two expression profiles are similar ? –Define the (dis)-similarity measure between two profiles 2.How to group multiple profiles into meaningful subsets ? –Describe the clustering procedure 3.Are the results meaningful ? –Evaluate statistical significance of a clustering And don’t forget about: –De-noising –Choice of experiments/features
12
(Dis)-similarity measures Distance metrics (between vectors x and y ) –“Manhattan” distance:MD(x,y) = ∑ I |x i -y i | –Euclidean distance: ED(x,y) = [ ∑ I (x i -y i ) 2 ] 1/2 –SSE:SSE(x,y) = ∑ I (x i -y i ) 2 Correlation: C(x,y)= ∑ I x i * y i (possibly take absolute value) Data pre-processing: Instead of clustering on direct observation of expression values… –… can cluster based on differential expression from the mean, e.g., ∑ I | x i – avg(x) – (y i – avg(y)) | –… or differential expression normalized by standard deviation, e.g., ∑ I | (x i – avg(x))/stdev(x) – (y i - avg(y))/stdev(y) |
13
Clustering Algorithms Hierarchical: Merge data successively to construct tree b e d f a c h g abdefghc Non-Hierarchical: place k-means to best explain data b e d f a c h g c1 c2 c3 abghcdef
14
Hierarchical clustering Bottom-up algorithm: –Initialization: each point in a separate cluster At each step: –Choose the pair of closest clusters –Merge The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y Avoids the problem of specifying the number of clusters b e d f a c h g
15
Distance between clusters CD(X,Y)=min x X, y Y D(x,y) Single-link method CD(X,Y)=max x X, y Y D(x,y) Complete-link method CD(X,Y)=avg x X, y Y D(x,y) Average-link method CD(X,Y)=D( avg(X), avg(Y) ) Centroid method e d f h g e d f h g e d f h g e d f h g
16
Example I
17
Example II
18
K-means algorithm Each cluster X i has a center c i Define the clustering cost criterion COST(X 1,…X k ) = ∑ Xi ∑ x Xi SSE(x,c i ) Algorithm tries to find clusters X 1 …X k and centers c 1 …c k that minimize COST K-means algorithm: –Initialize centers “somehow” –Repeat: Compute best clusters for given centers → Attach each point to the closest center Compute best centers for given clusters → Choose the centroid of points in cluster –Until the COST is “small” b e d f a c h g c1 c2 c3 How ? SSE(x,y) = ∑ I (x i -y i ) 2
19
Choosing optimal center Consider a cluster X and a center c (not necessarily a centroid) Want to minimize ∑ x X SSE(x,c) = ∑ x X ∑ i (x i -c i ) 2 = ∑ i ∑ x X (x i -c i ) 2 Can optimize each c i separately: ∑ x X (x i -c i ) 2 = ∑ x X ( x i 2 - 2x i c i – c i 2 ) = ∑ x X x i 2 – c i ∑ x X 2x i + |X|c i 2 Optimum: c i = ∑ x X x i / |X|
20
Links http://www.elet.polimi.it/upload/matteucc/Clustering /tutorial_html/AppletKM.htmlhttp://www.elet.polimi.it/upload/matteucc/Clustering /tutorial_html/AppletKM.html http://www.elet.polimi.it/upload/matteucc/Clustering /tutorial_html/AppletH.htmlhttp://www.elet.polimi.it/upload/matteucc/Clustering /tutorial_html/AppletH.html http://www.cs.mcgill.ca/~papou/helppage.htm
21
Relationship between k-means and EM, Optimizing two variables at the same time. Know one compute the other, make the parallel
22
Clustering Algorithms: Running time Hierarchical: Merge data successively to construct tree b e d f a c h g abdefghc Non-Hierarchical: place k-means to best explain data b e d f a c h g c1 c2 c3
23
Running time: Hierarchical methods Repeat: –Choose the pair of closest clusters –Merge Number of iterations: –Exactly n-1 Iteration cost: –At most n 2 computations of CD(, ) –How many point-point distance computations ? –At most n 2 as well ! Total running time: O(n 3 ) b e d f a c h g
24
What about the running time for k-means?
25
Improvements Single-link = Minimum Spanning Tree –O(n 2 ) time …
26
What have we learned? Gene expression data –Microarray technology –De-noising Two methods for clustering –Hierarchical clustering non-parametric, general, top-down –K-means clustering ‘model’-based –Relationship with HMMs, alignment Distance metrics What’s next? –Evaluate clustering results –Visualizing clustering output
27
Evaluating clustering output Computing statistical significance of clusters +–N experiments, p labeled +, (N-p) – Cluster: k elements, m positive P-value of single cluster containing k elements out of which r are same Prob that a randomly chosen set of k experiments would result in m positive and k-m negative P-value of uniformity in computed cluster
28
Visualizing clustering output
29
Rearranging tree branches Optimizing one-dimensional ordering of tree leaves abghcdef abdefghc
30
Ziv Bar-Zoseph published a linear-time DP algorithm (from what I remember) to calculate branch re-ordering. It’d be fun to show it in lecture, if you have time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.