Presentation is loading. Please wait.

Presentation is loading. Please wait.

Filter Algorithms for Approximate String Matching Stefan Burkhardt.

Similar presentations


Presentation on theme: "Filter Algorithms for Approximate String Matching Stefan Burkhardt."— Presentation transcript:

1 Filter Algorithms for Approximate String Matching Stefan Burkhardt

2 Outline  Motivation  Filter Algorithms  Gapped q-grams  Experimental Analysis

3 Motivation Computational Biology:  EST Clustering  Assembly  Genome comparison (e.g. Human/Mouse) Information Retrieval  Phonebooks  Dictionaries  Search Engines Many more…. Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation

4 The global approximate string matching problem Given a pattern P, a target S, an error level k and a string distance d(x,y): Find all substrings y from S with: Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG

5 The global approximate string matching problem d(x,y) = Hamming Distance: The k-mismatches problem d(x,y) = Edit Distance: The k-differences problem Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG

6 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Potential Matches Filter Algorithm Filtration Phase, apply Filter Criterion Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches

7 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms BLAST (Altschul, Karlin, et al.) : S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific Sequential scan of S locates all matching q-grams with P Iterative extension with cutoff to find good matches

8 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Index Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches Potential Matches Indexed Filter Algorithm

9 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm Con: preprocessing time extra space required only good for some filter criteria Pro: potentially faster evaluation of filter criterium

10 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection:overlapping rectangles in DP-Matrix

11 |P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 T C G C G A G A T A T T T T A T A C T C G A T T A C Each error can ´destroy´ q matching q-grams => for k errors lose kq q-grams T C G C G A G A T A T T T T A T A C T C G A A T A C How? BLAST The q-gram Lemma and QUASAR Filter Algorithms The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q + 1 - (kq) substrings of length q (q-grams).

12 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P 3 hits 2 hits 1 hit t = 3

13 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR) S

14 How? BLAST The q-gram Lemma and QUASAR Filter Algorithms QUASAR (Burkhardt, Rivals et al. 1999) :  BLAST for the verification of the potential matches  wider Rectangles as Match Regions  Index is a combination of Lookup Table and Suffix Array  used for EST-Clustering at the DKFZ in Heidelberg  searches for EST-Clustering about 30 times faster than BLAST

15 Gapped q-grams  A new (old?) idea  Hamming Distance  Finding good shapes

16  use gapped q-grams  call arrangement of gaps the shape General idea: Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes TCGATTAC TC.A CG.T GA.T AT.A TT.C gapped 3-shape: # #. # Match Don’t care

17  Califano, Rigoutsos (1993)  Pevzner, Waterman (1995)  Lehtinen, Sutinen, Tarhio (1996) Previous work...  limited attention paid to choice of shapes  no exact threshold for the general case given Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapesRecently...  Buhler (2001) : Multiple Shapes  Ma, Tromp, Li (2002) : Pattern Hunter  threshold t = 1

18 The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO classic 3-shape ### k = 3 gapped 3-shape ##.# k = 3 t = 1 t = 0 no filter! OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

19 OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Definition: t is the number of remaining q-grams in a worst-case placement of k errors  gapped shapes can have higher(!) thresholds t than ungapped shapes The Threshold t gapped 3-shape ##.# k = 3 t = 1 classic 3-shape ### k = 3 t = 0 no filter!  no simple formula for t  we used a DP-based approach to compute t Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

20 highlow# of q-gram hits highlowfiltration time high low verific. time high low # of potential matches good filters bad filters Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes tradeoff line lowhighq

21 Finding good shapes high low # of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q  ? tradeoff line good filters bad filters lowhighq

22 Finding good shapes Reason: ##.#### --------- 5 4 A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.

23 We define the minimum coverage c m as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S Finding good shapes CGACGATTGAT ##.# ----- ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

24 # of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q  lowhighq tradeoff line good filters bad filters |S| 1 ||cm||cm  low high cmcm

25 8 10 12 14 16 18 20 22 0 600 400 200 t = 1 t = 2 t = 3 t = 4 t = 5 minimum coverage number of shapes with given minimum coverage for k = 5 q = 8 median contiguous best compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

26 Experimental Analysis  Speed and Filtration Efficiency  The Heuristic Zone

27 6 7 8 9 10 11 12 q minimum coverage 8 12 16 20 24 gapped, Hamming contiguous matches hits 2 22 2 20 2 18 2 16 2 14 2 12 2 16 2 12 2 8 2 4 1 2 -4 2 -8 Experimental Analysis A few different Filters Speed and Filtration Efficiency The Heuristic Zone k = 5 |P| = 50 |S| = 50Mbps

28 From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0 0% 100% Recognition rate

29 From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate

30 From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc

31 From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc

32 Errors|P||P|0k|P|-mc 0% 100% Recognition rate A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis Heuristic Zone Problem: Behaviour in the Heuristic Zone hard to predict

33 A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate s sample strings with i random errors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone

34 |P| = 50 1000 samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 contiguous k=3, q=11 k=4, q=9

35 A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 k=4, q=9 k=3, q=11 gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 |P| = 50 1000 samples for each error level

36 A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = 50 1000 samples for each error level

37 A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors 0 5 10152025 30 k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = 50 1000 samples for each error level

38 A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 50% 100% Recognition rate Errors 0510 15 k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 k=4,q=10 |P| = 50 1000 samples for each error level

39 Conclusion - Future Work Our Work:  Significant sensitivity improvement over existing filters  Required modifications easy to implement  Methods for describing filter properties Future Work:  Combination of `orthogonal` shapes into one filter  Use of word neighborhoods  Database of filter properties for good shapes


Download ppt "Filter Algorithms for Approximate String Matching Stefan Burkhardt."

Similar presentations


Ads by Google