Download presentation
Presentation is loading. Please wait.
Published byHugh Adams Modified over 9 years ago
1
Filter Algorithms for Approximate String Matching Stefan Burkhardt
2
Outline Motivation Filter Algorithms Gapped q-grams Experimental Analysis
3
Motivation Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse) Information Retrieval Phonebooks Dictionaries Search Engines Many more…. Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation
4
The global approximate string matching problem Given a pattern P, a target S, an error level k and a string distance d(x,y): Find all substrings y from S with: Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG
5
The global approximate string matching problem d(x,y) = Hamming Distance: The k-mismatches problem d(x,y) = Edit Distance: The k-differences problem Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG
6
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Potential Matches Filter Algorithm Filtration Phase, apply Filter Criterion Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches
7
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms BLAST (Altschul, Karlin, et al.) : S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific Sequential scan of S locates all matching q-grams with P Iterative extension with cutoff to find good matches
8
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Index Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches Potential Matches Indexed Filter Algorithm
9
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm Con: preprocessing time extra space required only good for some filter criteria Pro: potentially faster evaluation of filter criterium
10
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection:overlapping rectangles in DP-Matrix
11
|P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 T C G C G A G A T A T T T T A T A C T C G A T T A C Each error can ´destroy´ q matching q-grams => for k errors lose kq q-grams T C G C G A G A T A T T T T A T A C T C G A A T A C How? BLAST The q-gram Lemma and QUASAR Filter Algorithms The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams).
12
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P 3 hits 2 hits 1 hit t = 3
13
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR) S
14
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms QUASAR (Burkhardt, Rivals et al. 1999) : BLAST for the verification of the potential matches wider Rectangles as Match Regions Index is a combination of Lookup Table and Suffix Array used for EST-Clustering at the DKFZ in Heidelberg searches for EST-Clustering about 30 times faster than BLAST
15
Gapped q-grams A new (old?) idea Hamming Distance Finding good shapes
16
use gapped q-grams call arrangement of gaps the shape General idea: Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes TCGATTAC TC.A CG.T GA.T AT.A TT.C gapped 3-shape: # #. # Match Don’t care
17
Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996) Previous work... limited attention paid to choice of shapes no exact threshold for the general case given Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapesRecently... Buhler (2001) : Multiple Shapes Ma, Tromp, Li (2002) : Pattern Hunter threshold t = 1
18
The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO classic 3-shape ### k = 3 gapped 3-shape ##.# k = 3 t = 1 t = 0 no filter! OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
19
OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Definition: t is the number of remaining q-grams in a worst-case placement of k errors gapped shapes can have higher(!) thresholds t than ungapped shapes The Threshold t gapped 3-shape ##.# k = 3 t = 1 classic 3-shape ### k = 3 t = 0 no filter! no simple formula for t we used a DP-based approach to compute t Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
20
highlow# of q-gram hits highlowfiltration time high low verific. time high low # of potential matches good filters bad filters Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes tradeoff line lowhighq
21
Finding good shapes high low # of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q ? tradeoff line good filters bad filters lowhighq
22
Finding good shapes Reason: ##.#### --------- 5 4 A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.
23
We define the minimum coverage c m as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S Finding good shapes CGACGATTGAT ##.# ----- ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
24
# of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q lowhighq tradeoff line good filters bad filters |S| 1 ||cm||cm low high cmcm
25
8 10 12 14 16 18 20 22 0 600 400 200 t = 1 t = 2 t = 3 t = 4 t = 5 minimum coverage number of shapes with given minimum coverage for k = 5 q = 8 median contiguous best compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
26
Experimental Analysis Speed and Filtration Efficiency The Heuristic Zone
27
6 7 8 9 10 11 12 q minimum coverage 8 12 16 20 24 gapped, Hamming contiguous matches hits 2 22 2 20 2 18 2 16 2 14 2 12 2 16 2 12 2 8 2 4 1 2 -4 2 -8 Experimental Analysis A few different Filters Speed and Filtration Efficiency The Heuristic Zone k = 5 |P| = 50 |S| = 50Mbps
28
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0 0% 100% Recognition rate
29
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate
30
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc
31
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc
32
Errors|P||P|0k|P|-mc 0% 100% Recognition rate A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis Heuristic Zone Problem: Behaviour in the Heuristic Zone hard to predict
33
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate s sample strings with i random errors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone
34
|P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 contiguous k=3, q=11 k=4, q=9
35
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 k=4, q=9 k=3, q=11 gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 |P| = 50 1000 samples for each error level
36
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = 50 1000 samples for each error level
37
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = 50 1000 samples for each error level
38
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 50% 100% Recognition rate Errors 0510 15 k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 k=4,q=10 |P| = 50 1000 samples for each error level
39
Conclusion - Future Work Our Work: Significant sensitivity improvement over existing filters Required modifications easy to implement Methods for describing filter properties Future Work: Combination of `orthogonal` shapes into one filter Use of word neighborhoods Database of filter properties for good shapes
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.