Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.

Similar presentations


Presentation on theme: "Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park."— Presentation transcript:

1 Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park

2 Contents The exact/approximate gapped pattern matching problem Previous approaches Our contributions

3 Exact gapped pattern matching problem Definition find the occurrences of the pattern that contains gaps from the text. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P1P1 P2P2 P3P3 any string whose length is between 2 and 3 any string whose length is between 1 and 3 subpatterns

4 Example – Exact matching GCATCAATTGCTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT Text T = GCAATTGCACTTC

5 Approximate gapped pattern matching problem Definition find all the substrings of the text which match each subpattern P i with k i number of insertion, deletion, and substitution. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0

6 Example – Approximate matching GCATCAATTGTTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT, k 1 = k 3 = 0, k 2 = 1 Text T = GCAATTGTACTTC 1 substitution

7 Class of characters Allow more than two different characters at a position of the pattern Pattern P = AA * (2,3) G[CT] * (1,3) TT AAGTTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0 C C or T

8 Example – Class of characters GCATCAATTGTTC AAGTTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) G[CT] * (1,3) TT Text T = GCAATTGTACTTC C

9 Application of the gapped pattern matching Information retrieval Data mining Computational biology Especially, finding motifs in a sequence

10 Motifs Motifs (biologically important common region) Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sometimes overall sequence alignment doesn ’ t show the relation between biologically related sequences.

11 PROSITE database Database of protein families, domains and motifs http://www.expasy.ch/prosite Motifs are represented as gapped patterns from the alphabet of 20 amino acids. Prion protein (Creutzfeld-Jacob Disease) : E* (1,1) [ED]* (1,1) K[LIVM][LIVM]* (1,1) [KR][LIVM][LIVM]* (1,1) [QE]MC* (2,2) QY Ribosomal protein L1 : [IM]* (2,2) [LIVA]* (2,3) [LIVM][GA]* (2,2) [LMS] [GSNH][PTKR][KRAV]G* (1,1) [LIMF]P[DENSTKQ]

12 Finding hidden motifs a set of sequences how to find unknown motifs?

13 Finding motifs in a sequence known motif new sequence As biological sequences may contain errors, we should consider approximate matching occurrences. x Our topic

14 Previous approaches Regular expression approaches Exact matching Navarro and Raffinot’s approach [RECOMB 2002] Exact and approximate matching Akutsu’s approach [IEICE Trans. Info rmation and Systems 1996] Approximate matching

15 Regular expression approach Pattern P = AA * (2,3) GC * (1,3) TT Regular expression AA**(*|  )GC*(*|  ) (*|  )TT A * GC  TA **  T * ** Nondeterministic Finite State Automata (NFA) or its equivalent Deterministic Finite State Automata (DFA) Too general!

16 Navarro and Raffinot’s approach A * GC  TA **  T *** NFA is not easy to run and DFA can be large. 0101010010000 Bit-Vector Simulate NFA by the bit-parallelism technique. (A word can be read and written simultaneously)

17 Navarro and Raffinot’s approach A * GC  TA **  T *** Allow k errors for all the pattern. A * GC  TA **  T * 0 errors 1 errors Works for small size pattern and small number of errrors. O (km’n / w) time algorithm (m’ is the total length of the pattern, n is the length of the text, w is the word size)   * *

18 Akutsu’s approach Combination of the dynamic programming and the balanced search tree. O (mn log n) time Text P1P1 * (a1, b1) P2P2 P3P3 * (a2, b2) Dynamic Table use the tree to compute the smallest values here

19 Drawbacks of the previous approaches XXX XXXOOOOO OOOOO OOOO OOOO OXO OXOOXOOO OXOOO OOXO OOXO O O O O ? k = 3 for all the pattern more sensitive and desirable k 1 = 1 k 2 = 1k 3 = 1

20 Our contributions O (ln + m) time algorithm for the exact gapped pattern matching problem. l : number of subpatterns n : length of the text m : length of the pattern O (mn) time algorithm for the approximate gapped pattern matching problem.

21 Graph Modeling 1. Create a node where a subpattern appears (exactly or approximately) in the text 2. Link two nodes with an edge if they represent the two consecutive subpatterns and satisfy the gap condition. 3. If there is a path P 1 – P 2 - … - P m in the graph, there is an occurrence of the pattern in the text.

22 Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 1. Create nodes P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3

23 Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 2. Connect the nodes with the edges P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3

24 Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 3. Find the path by Depth-First Search P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3

25 A better idea GCATCAATTGCTC Text No need to build the graph explicitly. Step 1. Find P 1 = AA and compute the candidate range for P 2. P = AA * (2,3) GC * (1,3) TT P1P1 candidate range

26 A better idea GCATCAATTGCTC Text Step 2. Find P 2 = GC within the candidate range and compute candidate range for P 3. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 candidate range

27 A better idea Text Step 3. After findng P 3 = TT within the candidate range, we found the occurrence of P. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P3P3 GCATCAATTGCTC

28 Approximate matching Almost the same idea as the exact matching case. Find the approximate occurrence of subpatterns, instead of the exact one. GCAATTGCACTTC 00000000000000 A11100111101111 A22210122211222 Text P1P1 * (2,3) k 1 = 0, k 2 = 1 candidate range

29 Approximate matching GCACTTC 00??? G10123 C21012 Text P2P2 * (1,3) k 2 = 1, k 3 = 0 candidate range Infinity – no alignment can start from here

30 Approximate matching TTC 000? T1001 T2101 Text P3P3 k 3 = 0 approximate occurrence of the pattern

31 Handling class of characters Represent characters as bit masks. AGTC 0101 [GC] Text Pattern G [GC] & 0100 0101 0100 T [GC] & 0010 0101 0000 nonzerozero

32 Time Complexity O (mn) (m is the length of the pattern, n is the length of the text), but faster in practice Text P1P1 P2P2 P3P3

33 Conclusion O (ln + m) time algorithm for the exact gapped pattern matching problem O (mn) time algorithm for the approximate gapped pattern matching problem. Open problem time complexity in the average case?


Download ppt "Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park."

Similar presentations


Ads by Google