Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park
Contents The exact/approximate gapped pattern matching problem Previous approaches Our contributions
Exact gapped pattern matching problem Definition find the occurrences of the pattern that contains gaps from the text. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P1P1 P2P2 P3P3 any string whose length is between 2 and 3 any string whose length is between 1 and 3 subpatterns
Example – Exact matching GCATCAATTGCTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT Text T = GCAATTGCACTTC
Approximate gapped pattern matching problem Definition find all the substrings of the text which match each subpattern P i with k i number of insertion, deletion, and substitution. Pattern P = AA * (2,3) GC * (1,3) TT AAGCTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0
Example – Approximate matching GCATCAATTGTTC AAGCTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) GC * (1,3) TT, k 1 = k 3 = 0, k 2 = 1 Text T = GCAATTGTACTTC 1 substitution
Class of characters Allow more than two different characters at a position of the pattern Pattern P = AA * (2,3) G[CT] * (1,3) TT AAGTTT * (2,3 ) * (1,3 ) P 1 k 1 = 0 any string whose length is between 2 and 3 any string whose length is between 1 and 3 P 2 k 2 = 1 P 3 k 3 = 0 C C or T
Example – Class of characters GCATCAATTGTTC AAGTTT Pattern Text * (2,3 ) * (1,3 ) Pattern P = AA * (2,3) G[CT] * (1,3) TT Text T = GCAATTGTACTTC C
Application of the gapped pattern matching Information retrieval Data mining Computational biology Especially, finding motifs in a sequence
Motifs Motifs (biologically important common region) Sequence 1 Sequence 2 Sequence 3 Sequence 4 Sometimes overall sequence alignment doesn ’ t show the relation between biologically related sequences.
PROSITE database Database of protein families, domains and motifs Motifs are represented as gapped patterns from the alphabet of 20 amino acids. Prion protein (Creutzfeld-Jacob Disease) : E* (1,1) [ED]* (1,1) K[LIVM][LIVM]* (1,1) [KR][LIVM][LIVM]* (1,1) [QE]MC* (2,2) QY Ribosomal protein L1 : [IM]* (2,2) [LIVA]* (2,3) [LIVM][GA]* (2,2) [LMS] [GSNH][PTKR][KRAV]G* (1,1) [LIMF]P[DENSTKQ]
Finding hidden motifs a set of sequences how to find unknown motifs?
Finding motifs in a sequence known motif new sequence As biological sequences may contain errors, we should consider approximate matching occurrences. x Our topic
Previous approaches Regular expression approaches Exact matching Navarro and Raffinot’s approach [RECOMB 2002] Exact and approximate matching Akutsu’s approach [IEICE Trans. Info rmation and Systems 1996] Approximate matching
Regular expression approach Pattern P = AA * (2,3) GC * (1,3) TT Regular expression AA**(*| )GC*(*| ) (*| )TT A * GC TA ** T * ** Nondeterministic Finite State Automata (NFA) or its equivalent Deterministic Finite State Automata (DFA) Too general!
Navarro and Raffinot’s approach A * GC TA ** T *** NFA is not easy to run and DFA can be large Bit-Vector Simulate NFA by the bit-parallelism technique. (A word can be read and written simultaneously)
Navarro and Raffinot’s approach A * GC TA ** T *** Allow k errors for all the pattern. A * GC TA ** T * 0 errors 1 errors Works for small size pattern and small number of errrors. O (km’n / w) time algorithm (m’ is the total length of the pattern, n is the length of the text, w is the word size) * *
Akutsu’s approach Combination of the dynamic programming and the balanced search tree. O (mn log n) time Text P1P1 * (a1, b1) P2P2 P3P3 * (a2, b2) Dynamic Table use the tree to compute the smallest values here
Drawbacks of the previous approaches XXX XXXOOOOO OOOOO OOOO OOOO OXO OXOOXOOO OXOOO OOXO OOXO O O O O ? k = 3 for all the pattern more sensitive and desirable k 1 = 1 k 2 = 1k 3 = 1
Our contributions O (ln + m) time algorithm for the exact gapped pattern matching problem. l : number of subpatterns n : length of the text m : length of the pattern O (mn) time algorithm for the approximate gapped pattern matching problem.
Graph Modeling 1. Create a node where a subpattern appears (exactly or approximately) in the text 2. Link two nodes with an edge if they represent the two consecutive subpatterns and satisfy the gap condition. 3. If there is a path P 1 – P 2 - … - P m in the graph, there is an occurrence of the pattern in the text.
Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 1. Create nodes P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3
Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 2. Connect the nodes with the edges P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3
Exact matching GCATCAATTGCTC Text P 1 = AA, P 2 = GC, P 3 = TT Step 3. Find the path by Depth-First Search P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P2P2 P3P3 P3P3
A better idea GCATCAATTGCTC Text No need to build the graph explicitly. Step 1. Find P 1 = AA and compute the candidate range for P 2. P = AA * (2,3) GC * (1,3) TT P1P1 candidate range
A better idea GCATCAATTGCTC Text Step 2. Find P 2 = GC within the candidate range and compute candidate range for P 3. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 candidate range
A better idea Text Step 3. After findng P 3 = TT within the candidate range, we found the occurrence of P. P = AA * (2,3) GC * (1,3) TT P1P1 P2P2 P3P3 GCATCAATTGCTC
Approximate matching Almost the same idea as the exact matching case. Find the approximate occurrence of subpatterns, instead of the exact one. GCAATTGCACTTC A A Text P1P1 * (2,3) k 1 = 0, k 2 = 1 candidate range
Approximate matching GCACTTC 00??? G10123 C21012 Text P2P2 * (1,3) k 2 = 1, k 3 = 0 candidate range Infinity – no alignment can start from here
Approximate matching TTC 000? T1001 T2101 Text P3P3 k 3 = 0 approximate occurrence of the pattern
Handling class of characters Represent characters as bit masks. AGTC 0101 [GC] Text Pattern G [GC] & T [GC] & nonzerozero
Time Complexity O (mn) (m is the length of the pattern, n is the length of the text), but faster in practice Text P1P1 P2P2 P3P3
Conclusion O (ln + m) time algorithm for the exact gapped pattern matching problem O (mn) time algorithm for the approximate gapped pattern matching problem. Open problem time complexity in the average case?