Download presentation
Presentation is loading. Please wait.
Published byDominic Perkins Modified over 9 years ago
1
q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches Krebsforschungszentrum, Heidelberg E. Rivals P. Ferragina M. Vingron
2
Outline Existing Work Motivation Problem Algorithm Results
3
Examples : BLAST FASTA Linear Scan (No Index) Good Sensitivity
4
Today: New Applications Examples : EST-Clustering Large Scale Shotgun Assembly Low Sensitivity Multiple Searches Specialized Algorithms Needed
5
Pattern P T C G A T T A C A G T G A A T Local Alignment, minimum Length w w = 8 Low Error Rate (<10% Edit Distance) Database D G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T
6
Filter Step: Identify Hotspots Scan Step: Scan Hotspots with BLAST
7
T C G C G A G A T A T T T T A T A C G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T q = 3 # of q-grams : |P| - q + 1 Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams q-gram Filtration Block Addressing Suffix Array Window Shifting T C G A T T A CT C G A T T A C A G T G A A T w = 8
8
102030 G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting Scan Blocks with counter t How to find the matching q-grams? Divide D into Blocks Count matching q-grams per Block 40
9
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T T C G A T T A C q-gram Filtration Block Addressing Suffix Array Window Shifting Precompute Searches for q-grams, O(1) Time Access AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : 30 23 16 11 3 Sorted List of Pointers to Suffixes, O(log |D|) Access Time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
10
G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T 4040070730302121 T C G A T T A C A G T G A A T q-gram Filtration Block Addressing Suffix Array Window Shifting Scan Marked Blocks q = 3 w = 8 e = 1 t = 3 Mark full Blocks for each Window Move Window over Query T C G A T T A C
11
Influence of the Block Size Sensitivity Running Times Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM
12
Influence of Block Size
13
Sensitivity 1000 Queries BLAST Cutoff E = 0.00001 Number of identical hitlists Mouse EST DB: 91.4 % Human EST DB: 97.1 % QUASAR finds many Hits below selected Error Level
14
Running Times Test Parameters: l 6% Error l w = 50 l q = 11 l block size 2048 l scan with BLAST l time averaged for 1000 queries ~30 times faster than BLAST
15
Overhead for Loading the Index 1000 queries Human EST DB, 280 Mbps BLAST Test Run: 5 seconds Load Time 13.270 seconds Search Time QUASAR Test Run: 90 seconds Load Time 380 seconds Search Time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.