Filter Algorithms for Approximate String Matching Stefan Burkhardt
Outline Motivation Filter Algorithms Gapped q-grams Experimental Analysis
Motivation Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse) Information Retrieval Phonebooks Dictionaries Search Engines Many more…. Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation
The global approximate string matching problem Given a pattern P, a target S, an error level k and a string distance d(x,y): Find all substrings y from S with: Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG
The global approximate string matching problem d(x,y) = Hamming Distance: The k-mismatches problem d(x,y) = Edit Distance: The k-differences problem Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Potential Matches Filter Algorithm Filtration Phase, apply Filter Criterion Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms BLAST (Altschul, Karlin, et al.) : S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific Sequential scan of S locates all matching q-grams with P Iterative extension with cutoff to find good matches
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Index Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches Potential Matches Indexed Filter Algorithm
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm Con: preprocessing time extra space required only good for some filter criteria Pro: potentially faster evaluation of filter criterium
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection:overlapping rectangles in DP-Matrix
|P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 T C G C G A G A T A T T T T A T A C T C G A T T A C Each error can ´destroy´ q matching q-grams => for k errors lose kq q-grams T C G C G A G A T A T T T T A T A C T C G A A T A C How? BLAST The q-gram Lemma and QUASAR Filter Algorithms The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q (kq) substrings of length q (q-grams).
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P 3 hits 2 hits 1 hit t = 3
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR) S
How? BLAST The q-gram Lemma and QUASAR Filter Algorithms QUASAR (Burkhardt, Rivals et al. 1999) : BLAST for the verification of the potential matches wider Rectangles as Match Regions Index is a combination of Lookup Table and Suffix Array used for EST-Clustering at the DKFZ in Heidelberg searches for EST-Clustering about 30 times faster than BLAST
Gapped q-grams A new (old?) idea Hamming Distance Finding good shapes
use gapped q-grams call arrangement of gaps the shape General idea: Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes TCGATTAC TC.A CG.T GA.T AT.A TT.C gapped 3-shape: # #. # Match Don’t care
Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996) Previous work... limited attention paid to choice of shapes no exact threshold for the general case given Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapesRecently... Buhler (2001) : Multiple Shapes Ma, Tromp, Li (2002) : Pattern Hunter threshold t = 1
The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO classic 3-shape ### k = 3 gapped 3-shape ##.# k = 3 t = 1 t = 0 no filter! OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Definition: t is the number of remaining q-grams in a worst-case placement of k errors gapped shapes can have higher(!) thresholds t than ungapped shapes The Threshold t gapped 3-shape ##.# k = 3 t = 1 classic 3-shape ### k = 3 t = 0 no filter! no simple formula for t we used a DP-based approach to compute t Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
highlow# of q-gram hits highlowfiltration time high low verific. time high low # of potential matches good filters bad filters Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes tradeoff line lowhighq
Finding good shapes high low # of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q ? tradeoff line good filters bad filters lowhighq
Finding good shapes Reason: ##.#### A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.
We define the minimum coverage c m as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S Finding good shapes CGACGATTGAT ##.# ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
# of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q lowhighq tradeoff line good filters bad filters |S| 1 ||cm||cm low high cmcm
t = 1 t = 2 t = 3 t = 4 t = 5 minimum coverage number of shapes with given minimum coverage for k = 5 q = 8 median contiguous best compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes
Experimental Analysis Speed and Filtration Efficiency The Heuristic Zone
q minimum coverage gapped, Hamming contiguous matches hits Experimental Analysis A few different Filters Speed and Filtration Efficiency The Heuristic Zone k = 5 |P| = 50 |S| = 50Mbps
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0 0% 100% Recognition rate
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc
From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc
Errors|P||P|0k|P|-mc 0% 100% Recognition rate A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis Heuristic Zone Problem: Behaviour in the Heuristic Zone hard to predict
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate s sample strings with i random errors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone
|P| = samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors contiguous k=3, q=11 k=4, q=9
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors k=4, q=9 k=3, q=11 gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 |P| = samples for each error level
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = samples for each error level
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = samples for each error level
A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 50% 100% Recognition rate Errors k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 k=4,q=10 |P| = samples for each error level
Conclusion - Future Work Our Work: Significant sensitivity improvement over existing filters Required modifications easy to implement Methods for describing filter properties Future Work: Combination of `orthogonal` shapes into one filter Use of word neighborhoods Database of filter properties for good shapes