Download presentation
Presentation is loading. Please wait.
1
COMP 790-90 Research Seminar Spring 2011
Sequence Clustering COMP Research Seminar Spring 2011
2
CLUSEQ The primary structures of many biological (macro)molecules are “letter” sequences despite their 3D structures. Protein has 20 amino acids. DNA has an alphabet of four bases {A, T, G, C} RNA has an alphabet {A, U, G, C} Text document Transaction logs Signal streams Structural similarities at the sequence level often suggest a high likelihood of being functionally/semantically related. 20 amino acids: A R N D C Q E G H I L K M F P S T W Y V Structural similarities at the sequence level often suggest a high likelihood of being biologically related. Similarity between genes from different organisms may indicate close evolutionary relationships. Similarity between proteins may indicate biological/functional homologies.
3
Problem Statement Clustering based on structural characteristics can serve as a powerful tool to discriminate sequences belonging to different functional categories. The goal is to create a grouping of sequences such that sequences in each group have similar features. The result can potentially reveal unknown structural and functional categories that may lead to a better understanding of the nature. Challenge: how to measure the structural similarity?
4
Measure of Similarity Edit distance: q-gram based approach:
computationally inefficient only captures the optimal global alignment but ignore many other local alignments that often represent important features shared by the pair of sequences. q-gram based approach: ignores sequential relationship (e.g., ordering, correlation, dependency, etc.) among q-grams Hidden Markov model: capture some low order correlations and statistics vulnerable to noise and erroneous parameter setting Edit distance: Inefficient May not be biologically accurate Q-gram based Even though it enables significant segments to be identified and used to measure the similarity between sequences regardless of their relative positions in different sequences, valuable information may be lost as a result of ignoring sequential relationship (e.g., ordering, correlation, dependency, etc.) among these segments, which may impact the quality of the clustering. HMM requires number of clusters to be specified in advance which may not be practical.
5
Measure of Similarity Probabilistic Suffix Tree
Effective in capturing significant structural features Easy to compute and incrementally maintain Sparse Markov Transducer Allows wild cards
6
Model of CLUSEQ CLUSEQ: exploring significant patterns of sequence formation. Sequences belonging to one group/cluster may subsume to the same probability distribution of symbols (conditioning on the preceding segment of a certain length), while different groups/clusters may follow different underlying probability distributions. By extracting and maintaining significant patterns characterizing (potential) sequence clusters, one can easily determine whether a sequence should belong to a cluster by calculating the likelihood of (re)producing the sequence under the probability distribution that characterizes the cluster. The assumption is that sequences belonging to one cluster are generated from a single source and therefore obey the same underlying probability distribution.
7
Model of CLUSEQ Sequence: = s1s2…sl Cluster S: Random process:
If PS() is high, we may consider a member of S Given a cluster of sequences, S, let PS be conditional probability distribution that models S. A sequence = s1 s2 … sl is considered a member of S iff can be predicted under PS with relatively high probability. The probability to predict is PS() = PS(s1) PS(s2|s1) … PS(sl|s1 … sl-1) = li=1 PS(si|s1 … si-1) where PS(si|s1 … si-1) is the conditional probability that si is the next symbol right after the segment s1 … si-1 in cluster S. If PS() is significantly higher than the probability Pr() of predicting by a memoryless random process (Pr() = p(s1) p(s2) … p(sl) = li=1p(si)), then we may conclude that subsumes a similar conditional probability distribution of S and may be considered a member of S. If PS() >> Pr(), we may consider a member of S
8
Model of CLUSEQ Similarity between and S Noise may be present.
Different portions of a (long) sequence may subsume to different conditional probability distributions. To accommodate the scenario that different portions of a (long) sequence may subsume to different conditional probability distributions, the above measure is modified to capture the maximum similarity between any continuous segment of and S. e.g., multi domain proteins
9
Model of CLUSEQ Give a sequence = s1s2…sl and a cluster S, a dynamic programming method can be used to calculate the similarity SIMS(). Via a single scan of . Let Intuitively, Xi, Yi, and Zi can be viewed as the similarity contributed by the symbol on the ith position of (i.e., si), the maximum similarity possessed by any segment ending at the ith position, and the maximum similarity possessed by any segment ending prior to or on the ith position, respectively. Intuitively, Xi, Yi, and Zi can be viewed as the similarity contributed by the symbol on the ith position of (i.e., si), the maximum similarity possessed by any segment ending at the ith position, and the maximum similarity possessed by any segment ending prior to or on the ith position, respectively.
10
Model of CLUSEQ Then, SIMS() = Zl, which can be obtained by
For example, SIMS(bbaa) = 2.10 if p(a) = 0.6 and p(b) = 0.4. and Sequence b a PS(si|s1…si-1) 0.55 0.418 0.87 0.406 Xi 1.38 1.05 1.45 0.677 Yi 2.10 1.42 Zi Linear complexity
11
Probabilistic Suffix Tree
a compact representation to organize the derived CPD for a cluster built on the reversed sequences Each node corresponds to a segment, , and is associated with a counter C() and a probability vector P(si| ).
12
Probabilistic Suffix Tree
C(ba)=96 P(a|ba)=0.406 P(b|ba)=0.594 36 a (0.406,0.594) (0.417,0.583) 96 b (0.289,0.711) b a 135 55 60 a b a (0.636,0.364) (0.4,0.6) 39 (0.45,0.55) (0,1) root 300 (0.889,0.111) (0.917,0.083) (0.87,0.13) b a b 45 60 69 b 165 39 b a a To support efficient maintenance and retrieval of the probability entries, the probabilistic suffix tree is employed to serve as a compact representation to organize the derived (conditional) probability distribution for a cluster. A probabilistic suffix tree is a suffix tree built on the reversed sequences. Each node is labeled with a distinct segment that can be generated by concatenating edge-labels on the path from the node to the root. A counter is associated with each node to record the number of occurrences of its label in a sequence cluster. A node is significant if its counter is greater than or equal to c and is insignificant otherwise. A probability vector is associated with each node to track the conditional probability distribution PS(si|) of the next symbol si (i = 1, 2, …, n) given the node’s label ’ as the preceding segment. Given a segment s1…si-1, the node whose label is the longest significant suffix sj…si-1 of s1…si-1 is called the prediction node of s1…si-1 since we always use the value of PS(si| sj…si-1) to approximate the value of PS(si| s1…si-1) is the similarity estimation. Each segment has a unique prediction node that can be located by traversing from the root along the path “ si … s2 s1” until we reach either the node labeled with s1…si or a significant node where any further advance would reach an insignificant node. It takes O(min{i, h}) computation to retrieve the (approximate) value of PS(si| s1…si-1) where h is the height of the probabilistic suffix tree. (0.582,0.418) (0.231,0.769) 96 21 a b (0.375,0.625) (0.25,0.75) 57 b a (0.211,0.789) 20 36 (0.25,0.75) (0.167,0.833)
13
Model of CLUSEQ Retrieval of a CPD entry P(si|s1…si-1)
The longest suffix sj…si-1 can be located by traversing from the root along the path “ si-1 … s2 s1” until we reach either the node labeled with s1…si or a node where no further advance can be made. takes O(min{i, h}) where h is the height of the tree. Example: P(a|bbba)
14
P(a|bbba) P(a|bba) = 0.4 36 a 96 b b a 55 60 a 135 b a 39 root 300 b
(0.406,0.594) (0.417,0.583) 96 b (0.289,0.711) b a 55 60 a 135 b a (0.636,0.364) (0.4,0.6) 0.4 39 (0.45,0.55) (0,1) root 300 (0.889,0.111) (0.917,0.083) (0.87,0.13) b a b 45 60 69 b 165 39 b a a Start from the rightmost symbol, Traverse along the path root a b b b Retrieve the probability entry corresponding to a (0.582,0.418) (0.231,0.769) 96 21 a b (0.375,0.625) (0.25,0.75) 57 b a (0.211,0.789) 20 36 (0.25,0.75) (0.167,0.833)
15
CLUSEQ Sequence Cluster: a set of sequences S is a sequence cluster if, for each sequence in S, the similarity SIMS() between and S is greater than or equal to some similarity threshold t. Objective: automatically group a set of sequences into a set of possibly overlapping clusters.
16
Unclustered sequences
Algorithm of CLUSEQ Unclustered sequences An iterative process Each cluster is represented by a probabilistic suffix tree. The optimal number of clusters and the number of outliers allowed can be adapted by CLUSEQ automatically new cluster generation, cluster split, and cluster consolidation adjustment of similarity threshold Generate new clusters Sequence re-clustering Similarity threshold adjustment Cluster split Cluster consolidation Any improvement? No sequence clusters
17
New Cluster Generation
New clusters are generated from un-clustered sequences at the beginning of each iteration. k’ × f new clusters number of consolidated clusters Unclustered sequences Generate new clusters number of clusters Sequence re-clustering Similarity threshold adjustment Cluster split Cluster consolidation number of new clusters generated at the previous iteration Any improvement? No sequence clusters
18
Sequence Re-Clustering
For each (sequence, cluster) pair Calculate similarity PST update if necessary Only similar portion is used The update is weighted by the similarity value 36 45 55 39 60 96 69 165 135 57 21 300 20 a b (0.25,0.75) (0.167,0.833) (0.889,0.111) (0.917,0.083) (0.87,0.13) (0.211,0.789) (0.231,0.769) (0.375,0.625) (0.582,0.418) (0.636,0.364) (0.4,0.6) (0.417,0.583) (0.406,0.594) (0,1) (0.289,0.711) (0.45,0.55) root Unclustered sequences Generate new clusters Sequence re-clustering Similarity threshold adjustment Cluster split Cluster consolidation Any improvement? No sequence clusters
19
Unclustered sequences Sequence re-clustering Cluster consolidation
Cluster Split Check the convergence of each existing cluster Imprecise probabilities are used for each probability entry in PST Split non-convergent cluster Unclustered sequences Generate new clusters Sequence re-clustering Similarity threshold adjustment Cluster split Cluster consolidation Any improvement? No sequence clusters
20
Imprecise Probabilities
Imprecise probabilities uses two values (p1, p2) (instead of one) for a probability. p1 is called lower probability and p2 is called upper probability. The true probability lies somewhere between p1 and p2. p2 – p1 is called imprecision.
21
Update Imprecise Probabilities
Assuming the prior knowledge of a (conditional) probability is (p1, p2) and the occurrences in the new experiment is a out of b trials. where s is the learning parameter which controls the weight that each experiment carries.
22
Properties The following two properties are very important.
If the probability distribution stays static, then p1 and p2 will converge to the true probability. If the experiment agrees with the prior assumption, the range of imprecision decreases after applying the new evidence, e.g., p2’ – p1’ < p2 – p1. The clustering process terminates when the imprecision of all significant nodes is less than a small threshold.
23
Cluster Consolidation
Starting from the smallest cluster Dismiss clusters that have few sequence not covered by other clusters Unclustered sequences Generate new clusters Sequence re-clustering Similarity threshold adjustment Cluster split Cluster consolidation Any improvement? No sequence clusters
24
Adjustment of Similarity Threshold
Find the sharpest turn of the similarity distribution function count Unclustered sequences Generate new clusters Sequence re-clustering Similarity threshold adjustment Cluster split Cluster consolidation Any improvement? similarity No sequence clusters
25
Algorithm of CLUSEQ Implementation issues Limited memory space
Prune the node with smallest count first. Prune the node with longest label first. Prune the node with expected probability vector first. Probability smoothing Eliminates zero empirical probability Other considerations Background probabilities A priori knowledge Other structural features
26
Edit Distance with Block Operations
Experimental Study We have experimented with a protein database of 8000 proteins from 30 families from SWISS-PROT database. Model CLUSEQ Edit Distance Edit Distance with Block Operations Hidden Markov Model Q-gram Accuracy 92% 23% 90% 91% 75% Response Time (sec) 144 487 13754 3117 132
27
Experimental Study Initial t 1.05 1.5 2 3 Final t 1.99 2.01
Synthetic data Initial t 1.05 1.5 2 3 Final t 1.99 2.01 Response time 8011 7556 6754 7234 precision 81.3% 83.1% 83.4% 81.9% recall 82.1% 82.8% 83.6% 82.7%
28
Experimental Study Initial cluster number 1 20 100 200
Synthetic data Initial cluster number 1 20 100 200 Final cluster number 102 99 101 Response time 10112 9023 6754 8976 precision 81.3% 82.1% 82.6% 81% recall 81.6% 82% 83.4% 81.7%
29
Experimental Study CLUSEQ has linear scalability with respect to the number of clusters, number of sequences, and sequence length. Synthetic Data
30
Remarks Similarity measure Clustering algorithm
Powerful in capturing high order statistics and dependencies Efficient in computation linear complexity Robust to noise Clustering algorithm High accuracy High adaptability High scalability High reliability
31
References CLUSEQ: efficient and effective sequence clustering, Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), 2003. A frame work towards efficient and effective protein clustering, Proceedings of the 1st IEEE CSB, 2002.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.