Presentation is loading. Please wait.

Presentation is loading. Please wait.

Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches.

Similar presentations


Presentation on theme: "Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches."— Presentation transcript:

1 q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches Krebsforschungszentrum, Heidelberg E. Rivals P. Ferragina M. Vingron

2 Outline  Existing Work  Motivation  Problem  Algorithm  Results

3  Examples : BLAST FASTA  Linear Scan (No Index)  Good Sensitivity

4  Today: New Applications  Examples : EST-Clustering Large Scale Shotgun Assembly  Low Sensitivity  Multiple Searches  Specialized Algorithms Needed

5 Pattern P T C G A T T A C A G T G A A T  Local Alignment, minimum Length w w = 8  Low Error Rate (<10% Edit Distance) Database D G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T

6 Filter Step: Identify Hotspots Scan Step: Scan Hotspots with BLAST

7 T C G C G A G A T A T T T T A T A C G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T q = 3 # of q-grams : |P| - q + 1 Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams q-gram Filtration Block Addressing Suffix Array Window Shifting T C G A T T A CT C G A T T A C A G T G A A T w = 8

8 102030 G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting  Scan Blocks with counter  t How to find the matching q-grams?  Divide D into Blocks  Count matching q-grams per Block 40

9 G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting  Precompute Searches for q-grams, O(1) Time Access AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : 30 23 16 11 3  Sorted List of Pointers to Suffixes, O(log |D|) Access Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

10 G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T 4040070730302121 T C G A T T A C A G T G A A T q-gram Filtration Block Addressing Suffix Array Window Shifting  Scan Marked Blocks q = 3 w = 8 e = 1 t = 3  Mark full Blocks for each Window  Move Window over Query T C G A T T A C

11  Influence of the Block Size  Sensitivity  Running Times  Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM

12 Influence of Block Size

13 Sensitivity  1000 Queries  BLAST Cutoff E = 0.00001  Number of identical hitlists Mouse EST DB: 91.4 % Human EST DB: 97.1 %  QUASAR finds many Hits below selected Error Level

14 Running Times  Test Parameters: l 6% Error l w = 50 l q = 11 l block size 2048 l scan with BLAST l time averaged for 1000 queries  ~30 times faster than BLAST

15 Overhead for Loading the Index  1000 queries  Human EST DB, 280 Mbps  BLAST Test Run: 5 seconds Load Time 13.270 seconds Search Time  QUASAR Test Run: 90 seconds Load Time 380 seconds Search Time

16


Download ppt "Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches."

Similar presentations


Ads by Google