q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches Krebsforschungszentrum, Heidelberg E. Rivals P. Ferragina M. Vingron
Outline Existing Work Motivation Problem Algorithm Results
Examples : BLAST FASTA Linear Scan (No Index) Good Sensitivity
Today: New Applications Examples : EST-Clustering Large Scale Shotgun Assembly Low Sensitivity Multiple Searches Specialized Algorithms Needed
Pattern P T C G A T T A C A G T G A A T Local Alignment, minimum Length w w = 8 Low Error Rate (<10% Edit Distance) Database D G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
Filter Step: Identify Hotspots Scan Step: Scan Hotspots with BLAST
T C G C G A G A T A T T T T A T A C G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T q = 3 # of q-grams : |P| - q + 1 Edit Distance e : at least t = |P| - q (qe) common q-grams q-gram Filtration Block Addressing Suffix Array Window Shifting T C G A T T A CT C G A T T A C A G T G A A T w = 8
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting Scan Blocks with counter t How to find the matching q-grams? Divide D into Blocks Count matching q-grams per Block 40
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting Precompute Searches for q-grams, O(1) Time Access AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : Sorted List of Pointers to Suffixes, O(log |D|) Access Time
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C A G T G A A T q-gram Filtration Block Addressing Suffix Array Window Shifting Scan Marked Blocks q = 3 w = 8 e = 1 t = 3 Mark full Blocks for each Window Move Window over Query T C G A T T A C
Influence of the Block Size Sensitivity Running Times Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM
Influence of Block Size
Sensitivity 1000 Queries BLAST Cutoff E = Number of identical hitlists Mouse EST DB: 91.4 % Human EST DB: 97.1 % QUASAR finds many Hits below selected Error Level
Running Times Test Parameters: l 6% Error l w = 50 l q = 11 l block size 2048 l scan with BLAST l time averaged for 1000 queries ~30 times faster than BLAST
Overhead for Loading the Index 1000 queries Human EST DB, 280 Mbps BLAST Test Run: 5 seconds Load Time seconds Search Time QUASAR Test Run: 90 seconds Load Time 380 seconds Search Time