Presentation is loading. Please wait.

Presentation is loading. Please wait.

Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen.

Similar presentations


Presentation on theme: "Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen."— Presentation transcript:

1 Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen

2 Outline  Motivation  The `classic` q-gram Lemma  q-shapes  Measuring Filter quality/speed  Experimental Results  Conclusion

3 The k-mismatches problem For a pattern P, a string S, a value k : find all occurences of P in S with at most k character replacements.

4 Filter Algorithms Filtration Stage: Examine S with a Filter Criterium Return areas with potential matches Verification Stage: Verify which areas have true matches

5 Pattern P A C T C  Find occurences of P with at most k errors k = 1 String S G C A T T C G A T G G A C T G G A C T A G T G A T T G A G T

6 The q-gram Lemma For a pattern P, a string S, a value k: Matches to P in S with at most k errors contain at least |P|-q+1-(kq) substrings of length q (q-grams) from S.

7 T C G C G A G A T A T T T T A T A C G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T q = 3 # of q-grams : |P| - q + 1 k = 1 Error number k : at least t = |P| - q + 1 - (qk) common q-grams in |P| letters T C G A T T A C |P| = 8 => t = 8-3+1-1 = 5

8 In the DP matrix, one can count the number of matching q-grams per diagonal

9  Use substrings with gaps (q-shapes)  compute correct threshold t  total length s is called span 3-shape ##.# s = 4 1 gap t = 1 General idea: 3-gram ### t = 0 no filter! OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO |Q| = 11 k = 3 OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O O = match, X = mismatch

10 We developed a DP based approach for computing the threshold t given a q-shape and a query length |P| Judging the quality of q-shapes I Observation: The threshold t is not the only factor that influences the behaviour of a q-shape

11 We define the minimum coverage as the minimum number of matching characters for any arrangement of t matching q-shapes in P and a substring of length |P| in S Judging the quality of q-shapes II ##.# ----- For t=2 and the 3-shape ##.# the minimum coverage is 5

12 The value q (i.e.the number of matching characters in a shape) determines the expected number of occurences in a random string S Judging the quality of q-shapes III 3-shape: ##.#  A,C,G,T} Expected number of occurences of a single 3-shape in S : occ = | S | 1 ||q||q

13 The speed of the filter step is influenced by the expected number of matching q-shapes in S. The efficiency of the filtration correlates closely with the minimum coverage Judging the quality of q-shapes IV Speed: value of q Efficiency: minimum coverage

14 Good shapes are not neccessarily regular or predictable in their form. Judging the quality of q-shapes V Shapes with maximal minimum coverage for: |Q| = 50, k=5 q=6 : ##......#..#..#.# q=9 : ###..#..#.#...#.## q=10: ###..#..#.#..###.# q=11: #######.##.## q=12: ###.#..###.#..###.#

15

16

17 Experimental setup for q-shapes: 50 million character random (Bernoulli) string S 1000 random queries of length 500 queries have no approximate matches in S queries have no approximate matches in S compute threshold for |Q|=50 compute threshold for |Q|=50 actual value of |Q| is 500! (to reduce runtime of tests) actual value of |Q| is 500! (to reduce runtime of tests)  Experiments show 10x reduced filter efficiency; relative performance between shapes unaffected Evaluating q-shapes

18 What we measured for every shape and all queries: A) The total number of occurrences of all shapes Good indicator of the total work for the filter phase B) The number of diagonals containing at least t shapes Good indicator of the filter efficiency Good indicator of the filter efficiency The The experiments show a good correlation between A and the predicted values as well as B and the minimum coverage Evaluating q-shapes

19

20

21

22 An analysis of q-grams with gaps (q-shapes) Results include: experimental evidence for their superiority when compared to standard q-grams a method to roughly judge their quality, the minimum coverage a way to calculate the parameters required to us them in a filter algorithm Our work….

23 an algorithm to predict the best shapes improve the quality measure for q-grams extension to the k-differences problem (with insertions and deletions) a thorough analysis of filter behaviour for > k differences (use as a heuristic filter) Todo….


Download ppt "Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen."

Similar presentations


Ads by Google