Gao, Ge Center for Bioinformatics Peking University

Slides:



Advertisements
Similar presentations
RNAseq.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
GNUMap: Unbiased Probabilistic Mapping of Next- Generation Sequencing Reads Nathan Clement Computational Sciences Laboratory Brigham Young University Provo,
Fast and accurate short read alignment with Burrows–Wheeler transform
Lecture II: Genomic Methods Dennis P. Wall, PhD Frederick G. Barr, MD, PhD Deborah G.B. Leonard, MD, PhD 1TRiG Curriculum: Lecture 2March 2012.
1 Omics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Dawei Lin, Ph.D. Director, Bioinformatics Core UC Davis Genome Center July 20, 2008, SLIMS (Solexa sequencing.
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Sequence Alignment technology Chengwei Lei Fang Yuan Saleh Tamim.
Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar.
Bowtie: A Highly Scalable Tool for Post-Genomic Datasets
High Throughput Sequencing
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
SOAP3-dp Workflow.
NGS Analysis Using Galaxy
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
RExPrimer Pongsakorn Wangkumhang, M.Sc. Biostatistics and Informatics Laboratory, Genome Institute, National Center for Genetic Engineering and Biotechnology.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
FlowString: Partial Streamline Matching using Shape Invariant Similarity Measure for Exploratory Flow Visualization Jun Tao, Chaoli Wang, Ching-Kuang Shene.
High Throughput Sequence Analysis with MapReduce Michael Schatz June 18, 2009 JCVI Informatics Seminar.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Massive Parallel Sequencing
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Next Generation Sequencing
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Short Read Mapping On Post Genomics Datasets
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
JAX: Exploring The Galaxy Glen Beane, Senior Software Engineer.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
RNAseq: a Closer Look at Read Mapping and Quantitation
Short Read Mapping On Post Genomics Datasets
Lesson: Sequence processing
Detecting Variation UNIT 03.
Genome alignment Usman Roshan.
Homology Search Tools Kun-Mao Chao (趙坤茂)
ChIP-Seq Analysis – Using CLCGenomics Workbench
Jin Zhang, Jiayin Wang and Yufeng Wu
Homology Search Tools Kun-Mao Chao (趙坤茂)
CSC2431 February 3rd 2010 Alecia Fowler
Department of Computer Science, University of Tennessee, Knoxville
PatternHunter: faster and more sensitive homology search
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

Gao, Ge Center for Bioinformatics Peking University Effectively mapping deep sequencing reads by BOAT (Basic Oligonucleotide Alignment Tool) Gao, Ge Center for Bioinformatics Peking University

Next-generation deep sequencing platforms produce millions of short reads in one run 454 Genome Sequencer FLX Illumina/Solexa Genome Analyzer SOLiDTM 3 Analyzer Amplification emPCR BridgePCR Read length 400bp 36bp-50bp 50-60bp Read number >1M 30M 400M Time 10h 2-3day 3.5day Bases 400-600M 1.3G 20G Sample 16 8

Comparative genomics, Genotyping Profiling: RNA-Seq, ChIP-Seq, Methy-Seq Goal: identify variations GGTATAC… …CCATAG TATGCGCCC CGGAAATTT CGGTATAC …CCAT CTATATGCG TCGGAAATT CGGTATAC …CCAT GGCTATATG CTATCGGAAA GCGGTATA …CCA AGGCTATAT CCTATCGGA TTGCGGTA C… …CCA AGGCTATAT GCCCTATCG TTTGCGGT C… …CC AGGCTATAT GCCCTATCG AAATTTGC ATAC… …CC TAGGCTATA GCGCCCTA AAATTTGC GTATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC… Goal: measure significant peaks GAAATTTGC GGAAATTTG CGGAAATTT CGGAAATTT TCGGAAATT CTATCGGAAA CCTATCGGA TTTGCGGT GCCCTATCG AAATTTGC …CC GCCCTATCG AAATTTGC ATAC… …CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGTATAC…

And those reads need to be mapped back to reference genome effectively for further analysis Millions of Sequence reads

So why we need yet another mapping tool? (http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html)

Effectively handle (large) sequence variants during mapping

Seeding Genome Seeding by hybrid indexing schema Inputted reads Extension (based on prefix tree) Generate alignment & Calculate E - value Hits List Initialization (based on hash table & bitmap index) Seeding Refining alignment

Basic idea: hybrid index by integrating hash and tree

Prefix tree enables effectively detection of longest common substring with mismatches (http://en.wikipedia.org/wiki/Trie)

Trigger a new alignment: “double-window hit” TTTTTTTTTTT ACGTA AAAAAAAAAA ACGAT Seed1 Seed2 Either of the two indexed seeds could initialize a new alignment

Extension of alignment by depth-first traversing the index tree ACGTAC AGTA CGTAC CACAT ACG AAGAT TCG TCGAT GCGAA ACGAT GAGAAG CGATAC ACGATA GACTAG ACGTACAGTAAACATACGAT |||||||||||| ||||||| ACGTACAGTAAAGATACGAT

Refining alignment by bounded dynamic programming For each cell between (i, i-k) and (i, i+k)

BOAT showed significant better recall rate in evaluation 5,000,000 simulated reads were mapped to an original two-million-bp mouse chrX region on a local Linux box with two Intel quad-core (E7310 @ 1.6G Hz) CPUs and 64G RAM. All programs were tuned to maximize their capability for tolerating no more than five mismatches

Effectively handling multiple mismatches contributes significantly to the improved recall rate, especially with large sequence variance

And the performance of SNP calling is also improved

BOAT also provides several flexible and friendly features Max allowed mismatches Gapped alignment Local alignment BLAST-style E-value Pair-end reads Multiple Threads SNP Calling BOAT No hardcoded limitation YES RMAP NO MAQ 3 SOAP 5 YES* SeqMap

BOAT is available as an Open Source Software (http://boat.cbi.pku.edu.cn)

Acknowledgement Zhao, Shu-Qi Wang, Jun Zhang, Li Li, Jiong-Tang Gu, Xiao-Cheng Wei, Li-Ping gaog@mail.cbi.pku.edu.cn