Download presentation
Presentation is loading. Please wait.
1
Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen
2
Outline Motivation The `classic` q-gram Lemma q-shapes Measuring Filter quality/speed Experimental Results Conclusion
3
The k-mismatches problem For a pattern P, a string S, a value k : find all occurences of P in S with at most k character replacements.
4
Filter Algorithms Filtration Stage: Examine S with a Filter Criterium Return areas with potential matches Verification Stage: Verify which areas have true matches
5
Pattern P A C T C Find occurences of P with at most k errors k = 1 String S G C A T T C G A T G G A C T G G A C T A G T G A T T G A G T
6
The q-gram Lemma For a pattern P, a string S, a value k: Matches to P in S with at most k errors contain at least |P|-q+1-(kq) substrings of length q (q-grams) from S.
7
T C G C G A G A T A T T T T A T A C G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T q = 3 # of q-grams : |P| - q + 1 k = 1 Error number k : at least t = |P| - q + 1 - (qk) common q-grams in |P| letters T C G A T T A C |P| = 8 => t = 8-3+1-1 = 5
8
In the DP matrix, one can count the number of matching q-grams per diagonal
9
Use substrings with gaps (q-shapes) compute correct threshold t total length s is called span 3-shape ##.# s = 4 1 gap t = 1 General idea: 3-gram ### t = 0 no filter! OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO |Q| = 11 k = 3 OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O O = match, X = mismatch
10
We developed a DP based approach for computing the threshold t given a q-shape and a query length |P| Judging the quality of q-shapes I Observation: The threshold t is not the only factor that influences the behaviour of a q-shape
11
We define the minimum coverage as the minimum number of matching characters for any arrangement of t matching q-shapes in P and a substring of length |P| in S Judging the quality of q-shapes II ##.# ----- For t=2 and the 3-shape ##.# the minimum coverage is 5
12
The value q (i.e.the number of matching characters in a shape) determines the expected number of occurences in a random string S Judging the quality of q-shapes III 3-shape: ##.# A,C,G,T} Expected number of occurences of a single 3-shape in S : occ = | S | 1 ||q||q
13
The speed of the filter step is influenced by the expected number of matching q-shapes in S. The efficiency of the filtration correlates closely with the minimum coverage Judging the quality of q-shapes IV Speed: value of q Efficiency: minimum coverage
14
Good shapes are not neccessarily regular or predictable in their form. Judging the quality of q-shapes V Shapes with maximal minimum coverage for: |Q| = 50, k=5 q=6 : ##......#..#..#.# q=9 : ###..#..#.#...#.## q=10: ###..#..#.#..###.# q=11: #######.##.## q=12: ###.#..###.#..###.#
17
Experimental setup for q-shapes: 50 million character random (Bernoulli) string S 1000 random queries of length 500 queries have no approximate matches in S queries have no approximate matches in S compute threshold for |Q|=50 compute threshold for |Q|=50 actual value of |Q| is 500! (to reduce runtime of tests) actual value of |Q| is 500! (to reduce runtime of tests) Experiments show 10x reduced filter efficiency; relative performance between shapes unaffected Evaluating q-shapes
18
What we measured for every shape and all queries: A) The total number of occurrences of all shapes Good indicator of the total work for the filter phase B) The number of diagonals containing at least t shapes Good indicator of the filter efficiency Good indicator of the filter efficiency The The experiments show a good correlation between A and the predicted values as well as B and the minimum coverage Evaluating q-shapes
22
An analysis of q-grams with gaps (q-shapes) Results include: experimental evidence for their superiority when compared to standard q-grams a method to roughly judge their quality, the minimum coverage a way to calculate the parameters required to us them in a filter algorithm Our work….
23
an algorithm to predict the best shapes improve the quality measure for q-grams extension to the k-differences problem (with insertions and deletions) a thorough analysis of filter behaviour for > k differences (use as a heuristic filter) Todo….
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.