Gao, Ge Center for Bioinformatics Peking University Effectively mapping deep sequencing reads by BOAT (Basic Oligonucleotide Alignment Tool) Gao, Ge Center for Bioinformatics Peking University
Next-generation deep sequencing platforms produce millions of short reads in one run 454 Genome Sequencer FLX Illumina/Solexa Genome Analyzer SOLiDTM 3 Analyzer Amplification emPCR BridgePCR Read length 400bp 36bp-50bp 50-60bp Read number >1M 30M 400M Time 10h 2-3day 3.5day Bases 400-600M 1.3G 20G Sample 16 8
Comparative genomics, Genotyping Profiling: RNA-Seq, ChIP-Seq, Methy-Seq Goal: identify variations GGTATAC… …CCATAG TATGCGCCC CGGAAATTT CGGTATAC …CCAT CTATATGCG TCGGAAATT CGGTATAC …CCAT GGCTATATG CTATCGGAAA GCGGTATA …CCA AGGCTATAT CCTATCGGA TTGCGGTA C… …CCA AGGCTATAT GCCCTATCG TTTGCGGT C… …CC AGGCTATAT GCCCTATCG AAATTTGC ATAC… …CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… Goal: measure significant peaks GAAATTTGC GGAAATTTG CGGAAATTT CGGAAATTT TCGGAAATT CTATCGGAAA CCTATCGGA TTTGCGGT GCCCTATCG AAATTTGC …CC GCCCTATCG AAATTTGC ATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…
And those reads need to be mapped back to reference genome effectively for further analysis Millions of Sequence reads
So why we need yet another mapping tool? (http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html)
Effectively handle (large) sequence variants during mapping
Seeding Genome Seeding by hybrid indexing schema Inputted reads Extension (based on prefix tree) Generate alignment & Calculate E - value Hits List Initialization (based on hash table & bitmap index) Seeding Refining alignment
Basic idea: hybrid index by integrating hash and tree
Prefix tree enables effectively detection of longest common substring with mismatches (http://en.wikipedia.org/wiki/Trie)
Trigger a new alignment: “double-window hit” TTTTTTTTTTT ACGTA AAAAAAAAAA ACGAT Seed1 Seed2 Either of the two indexed seeds could initialize a new alignment
Extension of alignment by depth-first traversing the index tree ACGTAC AGTA CGTAC CACAT ACG AAGAT TCG TCGAT GCGAA ACGAT GAGAAG CGATAC ACGATA GACTAG ACGTACAGTAAACATACGAT |||||||||||| ||||||| ACGTACAGTAAAGATACGAT
Refining alignment by bounded dynamic programming For each cell between (i, i-k) and (i, i+k)
BOAT showed significant better recall rate in evaluation 5,000,000 simulated reads were mapped to an original two-million-bp mouse chrX region on a local Linux box with two Intel quad-core (E7310 @ 1.6G Hz) CPUs and 64G RAM. All programs were tuned to maximize their capability for tolerating no more than five mismatches
Effectively handling multiple mismatches contributes significantly to the improved recall rate, especially with large sequence variance
And the performance of SNP calling is also improved
BOAT also provides several flexible and friendly features Max allowed mismatches Gapped alignment Local alignment BLAST-style E-value Pair-end reads Multiple Threads SNP Calling BOAT No hardcoded limitation YES RMAP NO MAQ 3 SOAP 5 YES* SeqMap
BOAT is available as an Open Source Software (http://boat.cbi.pku.edu.cn)
Acknowledgement Zhao, Shu-Qi Wang, Jun Zhang, Li Li, Jiong-Tang Gu, Xiao-Cheng Wei, Li-Ping gaog@mail.cbi.pku.edu.cn