Comp. Genomics Recitation 7 Clustering and analysis of microarrays
Exercise 1 A microarray that contains probes for all the N metabolic enzymes of the bacterium D.Angerous was used for the following time-series microarray experiment: The bacteria population were exposed to a drug, and gene expression was measured every hour for M hours. The expression values are discretized to {-1,0,1}
Exercise 1 Find the longest expression pattern that is common to at least k enzymes. Each enzyme may start the pattern at a different time. T7T6T5T4T3T2T E E2 010 E E E5 0 1E6 K=3
Solution Treat each expression vector as a string Create a generalized suffix tree O(MN) Find longest k-common substring
Exercise 2 Expression of N genes was measured under a certain condition using a microarray. No discretization was performed. Give a polynomial time algorithm for clustering these genes into exactly k clusters. The objective function is
Pictorially G1G2G3G4G5G6 Expression level If {G3,G4,G5}is a cluster, its contribution to the objective function is d(G3,G5)
Solution Create a weighted directed graph, every gene is a node and the edge from i to j has weight d(i,j-1) if i’s expression is lower than j’s (otherwise ∞) G1G2G3G4G5G6 The path in the graph that corresponds to this clustering is G1 G3 G6. The value of the objective function is d(G1,G2)+d(G3,G5)+0
Solution Next: Find the shortest path that visits exactly k nodes Dynamic programming: Start from k because if l<k P l (k-1)=∞
Exercise 3 A microarray experiment with N genes and M conditions was conducted Describe a polynomial algorithm that determines whether the genes can be clustered into 2 clusters such that the maximum distance d(Gi,Gj) in each cluster < W
Illustration W=2 G1 G2 G3 G4
Solution Create a graph with a node for every gene Add an edge (i,j) if d(i,j)> W Check if the resulting graph is bipartite: Run BFS, if you discover an edge (u,v) to a gray node and the depths of u and v are both even or both odd, answer: “no”.
Solution Not Bipartite
Exercise 4 We are given a microarray with N genes and M experiments We want to cluster the genes into k clusters such that the distance between genes that belong to the same cluster will be < W Can you give a polynomial algorithm that solves this problem?
Solution Probably not More specifically, if we could solve this problem in polynomial time, we could solve a large class of problem that are widely believed to be unsolvable in polynomial time
Solution How can we show that we can probably not find a solution in polynomial time? We will take a problem for which this has already been shown We will construct a polynomial time reduction to our problem So, if our problem could be solved efficiently the “hard” problem could also be solved efficiently
Graph description The following graph can describe our problem: G1 G2 G3 G6 G5 G4 There’s an edge (Gi,Gj) if the distance between Gi and Gj is less than W
Graph description Clustering with k=3:
3COL 3-Colorability: Given a graph G, can we dye its vertices with 3 different colors such that no two adjacent nodes have the same color?
Comparing the problems What is common to both these problems? In both we “cluster” the nodes What are the differences? First, in 3COL there are only 3 clusters instead of k Second, the elements that belong to the same group in 3COL must not have edges between them
Reduction Now that we understand the differences, we can take a graph G that is an input to 3COL, and transform it to a graph G’ and a constant k that are the input to the k- clustering problem We assume that we have a polynomial k- clustering algorithm, and we apply it to (G’,k) and translate the solution to 3COL
Reduction Given the first difference that we noted, what should be the value of k? We set k to 3, i.e. the algorithm should find exactly 3 clusters How do we change G to get G’? G’ has the complement edges of G
Example
Proof Suppose that G is 3 colorable. Let V 1,V 2,V 3 be the groups of nodes that can be colored by distinct colors. There are no edges between any pair of nodes in V 1,and therefore it forms a legal cluster in G’. Similarly, the nodes of V 2 and V 3 form clusters. Since V 1 UV 2 UV 3 contains all the nodes all the genes are clustered in the 3 corresponding clusters.
Proof, second direction Suppose that G’ contains a clustering to 3 legal clusters. These clusters correspond to 3 nodes sets in G such that within each set there are no edges between pairs of nodes. Therefore, assigning a different color to every set is a 3-coloring.