Download presentation
Presentation is loading. Please wait.
Published byScot Tyrone Ferguson Modified over 8 years ago
1
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park
2
Contents The exact/approximate gapped pattern matching problem Previous approaches Our contributions
3
Exact gapped pattern matching problem Definition find the occurrences of the pattern that contains gaps from the text. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P1P1 P2P2 P3P3 any string whose length is between 2 and 3 any string whose length is between 1 and 3 subpatterns
4
Example – Exact matching GCATCAATTGCTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT Text T = GCAATTGCACTTC
5
Approximate gapped pattern matching problem Definition find all the substrings of the text which match each subpattern P i with k i number of insertion, deletion, and substitution. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0
6
Example – Approximate matching GCATCAATTGTTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT, k 1 = k 3 = 0, k 2 = 1 Text T = GCAATTGTACTTC 1 substitution
7
Class of characters Allow more than two different characters at a position of the pattern Pattern P = AA * (2,3) G[CT] * (1,3) TT AAGTTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0 C C or T
8
Example – Class of characters GCATCAATTGTTC AAGTTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) G[CT] * (1,3) TT Text T = GCAATTGTACTTC C
9
Application of the gapped pattern matching Information retrieval Data mining Computational biology Especially, finding motifs in a sequence
10
Motifs Motifs (biologically important common region) Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sometimes overall sequence alignment doesn ’ t show the relation between biologically related sequences.
11
PROSITE database Database of protein families, domains and motifs http://www.expasy.ch/prosite Motifs are represented as gapped patterns from the alphabet of 20 amino acids. Prion protein (Creutzfeld-Jacob Disease) : E* (1,1) [ED]* (1,1) K[LIVM][LIVM]* (1,1) [KR][LIVM][LIVM]* (1,1) [QE]MC* (2,2) QY Ribosomal protein L1 : [IM]* (2,2) [LIVA]* (2,3) [LIVM][GA]* (2,2) [LMS] [GSNH][PTKR][KRAV]G* (1,1) [LIMF]P[DENSTKQ]
12
Finding hidden motifs a set of sequences how to find unknown motifs?
13
Finding motifs in a sequence known motif new sequence As biological sequences may contain errors, we should consider approximate matching occurrences. x Our topic
14
Previous approaches Regular expression approaches Exact matching Navarro and Raffinot’s approach [RECOMB 2002] Exact and approximate matching Akutsu’s approach [IEICE Trans. Info rmation and Systems 1996] Approximate matching
15
Regular expression approach Pattern P = AA * (2,3) GC * (1,3) TT Regular expression AA**(*| )GC*(*| ) (*| )TT A * GC TA ** T * ** Nondeterministic Finite State Automata (NFA) or its equivalent Deterministic Finite State Automata (DFA) Too general!
16
Navarro and Raffinot’s approach A * GC TA ** T *** NFA is not easy to run and DFA can be large. 0101010010000 Bit-Vector Simulate NFA by the bit-parallelism technique. (A word can be read and written simultaneously)
17
Navarro and Raffinot’s approach A * GC TA ** T *** Allow k errors for all the pattern. A * GC TA ** T * 0 errors 1 errors Works for small size pattern and small number of errrors. O (km’n / w) time algorithm (m’ is the total length of the pattern, n is the length of the text, w is the word size) * *
18
Akutsu’s approach Combination of the dynamic programming and the balanced search tree. O (mn log n) time Text P1P1 * (a1, b1) P2P2 P3P3 * (a2, b2) Dynamic Table use the tree to compute the smallest values here
19
Drawbacks of the previous approaches XXX XXXOOOOO OOOOO OOOO OOOO OXO OXOOXOOO OXOOO OOXO OOXO O O O O ? k = 3 for all the pattern more sensitive and desirable k 1 = 1 k 2 = 1k 3 = 1
20
Our contributions O (ln + m) time algorithm for the exact gapped pattern matching problem. l : number of subpatterns n : length of the text m : length of the pattern O (mn) time algorithm for the approximate gapped pattern matching problem.
21
Graph Modeling 1. Create a node where a subpattern appears (exactly or approximately) in the text 2. Link two nodes with an edge if they represent the two consecutive subpatterns and satisfy the gap condition. 3. If there is a path P 1 – P 2 - … - P m in the graph, there is an occurrence of the pattern in the text.
22
Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 1. Create nodes P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3
23
Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 2. Connect the nodes with the edges P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3
24
Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 3. Find the path by Depth-First Search P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3
25
A better idea GCATCAATTGCTC Text No need to build the graph explicitly. Step 1. Find P 1 = AA and compute the candidate range for P 2. P = AA * (2,3) GC * (1,3) TT P1P1 candidate range
26
A better idea GCATCAATTGCTC Text Step 2. Find P 2 = GC within the candidate range and compute candidate range for P 3. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 candidate range
27
A better idea Text Step 3. After findng P 3 = TT within the candidate range, we found the occurrence of P. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P3P3 GCATCAATTGCTC
28
Approximate matching Almost the same idea as the exact matching case. Find the approximate occurrence of subpatterns, instead of the exact one. GCAATTGCACTTC 00000000000000 A11100111101111 A22210122211222 Text P1P1 * (2,3) k 1 = 0, k 2 = 1 candidate range
29
Approximate matching GCACTTC 00??? G10123 C21012 Text P2P2 * (1,3) k 2 = 1, k 3 = 0 candidate range Infinity – no alignment can start from here
30
Approximate matching TTC 000? T1001 T2101 Text P3P3 k 3 = 0 approximate occurrence of the pattern
31
Handling class of characters Represent characters as bit masks. AGTC 0101 [GC] Text Pattern G [GC] & 0100 0101 0100 T [GC] & 0010 0101 0000 nonzerozero
32
Time Complexity O (mn) (m is the length of the pattern, n is the length of the text), but faster in practice Text P1P1 P2P2 P3P3
33
Conclusion O (ln + m) time algorithm for the exact gapped pattern matching problem O (mn) time algorithm for the approximate gapped pattern matching problem. Open problem time complexity in the average case?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.