Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gao, Ge Center for Bioinformatics Peking University

Similar presentations


Presentation on theme: "Gao, Ge Center for Bioinformatics Peking University"— Presentation transcript:

1 Gao, Ge Center for Bioinformatics Peking University
Effectively mapping deep sequencing reads by BOAT (Basic Oligonucleotide Alignment Tool) Gao, Ge Center for Bioinformatics Peking University

2 Next-generation deep sequencing platforms produce millions of short reads in one run
454 Genome Sequencer FLX Illumina/Solexa Genome Analyzer SOLiDTM 3 Analyzer Amplification emPCR BridgePCR Read length 400bp 36bp-50bp 50-60bp Read number >1M 30M 400M Time 10h 2-3day 3.5day Bases M 1.3G 20G Sample 16 8

3 Comparative genomics, Genotyping
Profiling: RNA-Seq, ChIP-Seq, Methy-Seq Goal: identify variations GGTATAC… …CCATAG TATGCGCCC CGGAAATTT CGGTATAC …CCAT CTATATGCG TCGGAAATT CGGTATAC …CCAT GGCTATATG CTATCGGAAA GCGGTATA …CCA AGGCTATAT CCTATCGGA TTGCGGTA C… …CCA AGGCTATAT GCCCTATCG TTTGCGGT C… …CC AGGCTATAT GCCCTATCG AAATTTGC ATAC… …CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… Goal: measure significant peaks GAAATTTGC GGAAATTTG CGGAAATTT CGGAAATTT TCGGAAATT CTATCGGAAA CCTATCGGA TTTGCGGT GCCCTATCG AAATTTGC …CC GCCCTATCG AAATTTGC ATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

4 And those reads need to be mapped back to reference genome effectively for further analysis
Millions of Sequence reads

5 So why we need yet another mapping tool?
(

6 Effectively handle (large) sequence variants during mapping

7 Seeding Genome Seeding by hybrid indexing schema Inputted reads
Extension (based on prefix tree) Generate alignment & Calculate E - value Hits List Initialization (based on hash table & bitmap index) Seeding Refining alignment

8 Basic idea: hybrid index by integrating hash and tree

9 Prefix tree enables effectively detection of longest common substring with mismatches
(

10 Trigger a new alignment: “double-window hit”
TTTTTTTTTTT ACGTA AAAAAAAAAA ACGAT Seed1 Seed2 Either of the two indexed seeds could initialize a new alignment

11 Extension of alignment by depth-first traversing the index tree
ACGTAC AGTA CGTAC CACAT ACG AAGAT TCG TCGAT GCGAA ACGAT GAGAAG CGATAC ACGATA GACTAG ACGTACAGTAAACATACGAT |||||||||||| ||||||| ACGTACAGTAAAGATACGAT

12 Refining alignment by bounded dynamic programming
For each cell between (i, i-k) and (i, i+k)

13 BOAT showed significant better recall rate in evaluation
5,000,000 simulated reads were mapped to an original two-million-bp mouse chrX region on a local Linux box with two Intel quad-core 1.6G Hz) CPUs and 64G RAM. All programs were tuned to maximize their capability for tolerating no more than five mismatches

14 Effectively handling multiple mismatches contributes significantly to the improved recall rate, especially with large sequence variance

15 And the performance of SNP calling is also improved

16 BOAT also provides several flexible and friendly features
Max allowed mismatches Gapped alignment Local alignment BLAST-style E-value Pair-end reads Multiple Threads SNP Calling BOAT No hardcoded limitation YES RMAP NO MAQ 3 SOAP 5 YES* SeqMap

17 BOAT is available as an Open Source Software
(

18 Acknowledgement Zhao, Shu-Qi Wang, Jun Zhang, Li Li, Jiong-Tang
Gu, Xiao-Cheng Wei, Li-Ping


Download ppt "Gao, Ge Center for Bioinformatics Peking University"

Similar presentations


Ads by Google