BLAST benchmarks George Coulouris NCBI/NLM/NIH June 2005
Motivation and goal It’s hard to define what constitutes a “typical” search. NCBI BLAST processes over 150,000 searches per day. Large scale characteristics of this workload are stable over time. Goal: Design a test suite that approximates this workload.
Applications Evaluate the relative performance of BLAST running on different hardware Evaluate the relative performance of different BLAST implementations
Components Databases Queries Tasks Driver
Databases Protein “nr” and nucleotide “nt” account for >80% of all searches; good choice for representative databases. Sequences are constantly added and removed; databases are updated daily. The volatility and large size of these databases make them unsuitable for benchmarking purposes.
Databases Solution: Generate benchmark databases from subsets of “nr” and “nt”. Non-redundant proteins are sampled from “nr”. Size ratio of nucleotide to protein databases is preserved to avoid skewing runtime results.
Queries >90% of protein queries are <1000 residues in length >90% of nucleotide queries are <2000 base pairs in length Should cover major model organisms Solution: Sample 200 queries from refseq_rna and refseq_protein. Resulting set covers many organisms and has a typical length distribution.
Tasks Program distribution: blastn50% megablast10% blastp20% blastx10% tblastn5% tblastx5%
Driver script Executes 200 searches according to above program distribution. Runs in 35 minutes on current hardware. Can be used to measure speed or throughput.
Sample results