Download presentation
Presentation is loading. Please wait.
1
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong E-mail: zhyong@cs.cityu.edu.hkzhyong@cs.cityu.edu.hk Joint work with WangSen FENG and Lusheng WANG
2
Outline The Definitions of Problems Applications Previous work Our work Algorithm for Single Group Algorithm for Two Groups Simulation Results for Single Group Simulation Results for Two Groups
3
Motif Identification Two versions 1. Single Group 2. Two Groups
4
Single Group Instance: a group of n sequences. Objective: find a length-L motif that appears in each of the given sequences and those occurrences of the motif are similar
5
Two Groups Instance: two groups of sequences: B (Bad) and G (Good) Objective: find a motif of length-L that appears in every sequence in group B and does not appear in anywhere of the sequences in G the occurrences of the motif have errors
6
Applications 1.Finding Targets for Potential Drugs (T. Jiang, C. Trendall, S, Wang, T. Wareham, X. Zhang, 98) (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999) -- bad strings in B are from Bacteria. -- good strings in G are from Humans -- find a substring s of length L that is conserved in all bad strings, but not conserved in good strings. -- use s to screen chemicals -- those selected chemicals can then be tested as potential broad-range antibiotics.
7
Applications 2. Creating Diagnostic Probes for Bacterial Infection (T. Brown, G.A. Leonard, E.D. Booth, G. Kneale, 1990) -- a group of closely related pathogenic bacteria -- find a substring that occurs in each of the bacterial sequences (with as few substitutions as possible) and does not occur in the human sequences
8
Applications 3. Locating binding sites and regulatory signals 4. Creating Universal PCR Primers 5. Creating Unbiased Consensus Sequences 6. Anti-sense Drug Design
9
Previous work The closest substring problem was proved to be NP-hard. So are the single group and two groups (K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang 1999) Polynomial time approximation schemes -theoretical results -speed is slow in order to solve practical instances
10
Previous Programs Bailey and Elkan: MEME (1994) uses a modified EM algorithm, allows the motif to be absent in some of the given sequences Waterman: Extended sample-driven approach (1984) Keich and Pavel Pevzner: two programs (2002) Buhler and Tompa : Projection (2002) combine EM and random projection Price, Ramabhadran and Pevzner: PatternBranching uses branching from sample strings (2003) faster than the previously best known program: projection
11
Previous Programs (continued) Do not allow indels Only for the one group problem Some algorithms can handle one gap
12
Our work An extension of the EM approach A randomized algorithm for the single group problem which can handle indels We give an algorithm for the two groups problem
13
Representation of motifs Consensus pattern: choosing the letter that appears the most in each of the L columns (Figure a) Profile: 4×L matrix W (ACGT), each cell W(i,j) is a number indicating the occurrence rate of letter i in column j.(Figure b) Use the profile representation in the early stage of the EM algorithm Use the consensus pattern representation to improve the accuracy caaccca caacccc catcccg catccct cacccca -------------------- consensus pattern caaccca Another con. Pattern catccca (a) A 0 1 0.4 0 0 0 0.4 C 1 0 0.2 1 1 1 0.2 G 0 0 0.0 0 0 0 0.2 T 0 0 0.4 0 0 0 0.2 (b)
14
Computing the single group problem The EM (Expectation Maximization) Algorithm (Wang,L. Dong,L. and Fan,H. 2004) Input: –n sequences S 1,S 2,...,S n –a 4 L matrix W (the initial guess of the motif) Output: –new matrix W that is a local maximal solution A 0.25 0.0 1.0 C 0.25 1.0 0.0 G 0.25 0.0 0.0 T 0.25 0.0 0.0
15
Step 1: L-mer: S ij, a length-L substring For each L-mer S ij, calculate the likelihood that S ij is the occurrence of the motif: P(i,j)= x=1 to L W(S ij (x),x) To avoid zero weights, a fixed small number is added to W(i,j) (0.1) Step 2: Normalize the likelihood: P'(i, j)=P(i,j) / x=1 m-L+1 P(i, x) s. t. j=1 to m-L+1 P'(i,j)=1 S ij = c a a W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0 P(i,j): 0.25*0.1*1=0.025
16
Step 3: Re-estimate the motif matrix W. W= i=1 n j=1 m-L+1 W ij Where W ij is constructed from S ij S ij = c a a W=a 0.25 0 1 c 0.25 1 0 g 0.25 0 0 t 0.25 0 0 P(i,j): 0.25*0.1*1=0.025 S ij (1) S ij (2) S ij (3) S ij = c a a W ij = a 0 0.025 0.025 c 0.025 0 0 g 0 0 0 t s 0 0 0
17
Step 4 Normalize W W'(b,x)= W(b,x)/ b=A,C,G,T W(b,x) Replace W with W'
18
Step 5 Steps 1 to 4 is called a cycle. If W changes very little from last cycle, then EM converges and the algorithm ends. otherwise, goto step 1 and start next cycle Determine the amount of change: max|W q (b,x)-W q-1 (b,x)|< set =0.05 such that the algorithm stops within few cycles
19
Our Algorithm For Single Group (with indels) General frame is the same as the previous algorithm 1. We get a initial guess of the motif W 2. With W as initial value, use the new EM algorithm to update W 3. Repeat 1–2 several (Maxtrials) times and choose the best result.
20
Incorporating Indels We add the “space” as a letter, so the matrix for EM algorithm became 5×L K: the maximum total number of indels For each starting position, consider all length L+h substrings, h=0,1,-1,…,k,-k is the number of indels. For each length L+h substring, align it with the matrix
21
Align a length L+h string with a 5×L matrix Dynamic programming similar to pair wise string alignment d[i, j] is the score of aligning the first i columns in the matrix with the first j letters in the string d[i, j]=max{d[i-1, j-1] ×W[x,i], d[i-1,j] ×w[ △,i], d[i, j-1] ×e} Buttom-up order: d[L, L+h] Best alignment (with indel)
22
Continued After calculated the motif W (profile representation: matrix), we use the matrix W to find the occurrence of the motif in each sequence
23
Find the motif occurrences find the occurrence of the motif in each string ∑ i=1 L W(a i,i) a 1 a 2 a 3 …a L is a length-L substring (L-mer) and W is the matrix for the motif
24
Algorithm for the two Groups (no indels) We follow the basic steps of EM method Modify the formula to re-construct W Re-estimate the matrix W from both group B and G
25
Main idea When the motif represented by the matrix W is too close to some L-mers from group G (p(i,j)>ave), we scoop the pattern from the matrix by subtracting the corresponding matrix W ij
26
Experiment Results (Single Group) Input: (1) randomly generate sequences n = 20 m= 600 (2) insert motif into the sequences Center string s (length L) Mutate d positions (insertion, deletion, mutation) Implant the mutated copy into the sequences Output: Use our program to find the implanted pattern.
27
Experiment Results (Single Group) Table 1: 15 sequences: no indel 5 sequences: one deletion Table 2: 10 sequences: no indel 5 sequences : one deletion 5 sequences : one insertion In table 2, the running time increases significantly and accuracy in many cases is slightly worse than that in Table 1
28
Experiment Results (Single Group) Table 3: 5 sequences : one deletion 5 sequences : two deletions 10 sequences: no indel Table 4: 5 sequences : one insertion 5 sequences : two insertions 10 sequences: no indel The results in Table 4 are slightly better than those in Table 3. The reason might be that the case in Table 4 needs to insert two columns in the matrix for the motif, whereas the case in Table 3 needs to insert two spaces in the motif sequences
29
Experiment Results (Single Group) Table 5, the mixed case: Probability: one insertion : 1/8 one deletion : 1/8 two insertions : 1/8 two deletions: 1/8 one insertion and one deletion: 1/8 no indel: 3/8
30
Experiment Results (Two Groups) Center (m=600): c1: the center for group B, random sequence c2: the center for group G, randomly mutate 200 positions from c1 Generate two groups n=10 Randomly mutate 200 positions from the center
31
Experiment Results (Two Groups) From Table 6, we can see that it is easy to find a motif that can distinguish the two groups when L is large Compare Table 7 with Table 6, we can see that it is easy to find a distinguishing motif when the distance between the two centers is large Table 7 shows the results when the average Hamming distance between c1 and c2 is about 175 Table 6 shows the results when the average Hamming distance between c1 and c2 is about 128
32
Summary An algorithm for the single group problem that can handle indels An algorithm for the two groups problem
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.