Fast and accurate short read alignment with Burrows–Wheeler transform

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
RNAseq.
Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett.
Transcriptomics Breakout. Topics Discussed Transcriptomics Applications and Challenges For Each Systems Biology Project –Host and Pathogen Bacteria Viruses.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Group 1 (1)陳伊瑋 (2)沈國曄 (3)唐婉馨 (4)吳彥緯 (5)魏銘良
Next Generation Sequencing, Assembly, and Alignment Methods
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
De-novo Assembly Day 4.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
RNA-seq workshop ALIGNMENT
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
The iPlant Collaborative
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Identification of Copy Number Variants using Genome Graphs
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Transcriptomics Sequencing. over view The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non coding RNA produced.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
RNA-seq: Quantifying the Transcriptome
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
RNAseq: a Closer Look at Read Mapping and Quantitation
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
An Introduction to RNA-Seq Data and Differential Expression Tools in R
Jin Zhang, Jiayin Wang and Yufeng Wu
CS 598AGB Genome Assembly Tandy Warnow.
From: TopHat: discovering splice junctions with RNA-Seq
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
CS4021/4521 Advanced Computer Architecture II
Maximize read usage through mapping strategies
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Fast and accurate short read alignment with Burrows–Wheeler transform Heng Li and Richard Durbin∗ Members of this presentation: Yunji Wang Sree Devineni Zhen Gao

Motivation The first generation of hash table-based methods (e.g. MAQ) are: Slow Not support gapped alignment

Suffix array interval position of each substring will occur in an interval in the suffix array. (On the right figure) e.g. Suffix interval of pattern “go” is [1, 2]. What about “og”?

Prefix trie and Inexact string matching Prefix trie of string “GOOGOL” The dashed line shows how to find string ‘LOL’ (1 mismatch allowed) What about “LOG”?

Conclusions Scientists Implemented of Burrows-Wheeler Alignment tool (BWA) which is based on BWT. Thus: Fast Reducing memory Allow gaps

REFERENCES Heng Li and Richard Durbin (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25, no. 14 2009, pages 1754–1760

CS 6293: Advanced Topics: Current Bioinformatics A probabilistic framework for aligning paired-end RNA-seq data Members of this presentation: Yunji Wang Sree Devineni Zhen Gao

A probabilistic framework for aligning paired-end RNA-seq data Current Biology Method Align RNA-seq reads to the reference genome rather than to a transcript database.

Current Biology Problem A single read: Constitute 35-100 consecutive nucleotides of a fragment of an mRNA transcript. However, the expected size of mRNA fragments are around 182bp. Paired-end read (PER)protocol sequences two ends of a size-selected fragment of an mRNA. (Double the length of single read)

Problem of PER fragment alignment The expected distance between the two end reads within the transcript fragment, know as mate-pair distance. The distance between the two ends when aligned to the genome is quit different with mate-pair distance.

Problem of PER fragment alignment

Current Tools TopHat TopHat reports the closest end alignment for a PER. SpliceMap SpliceMap considers PERs with ends mapped within 400 000bp on the genome.

Method-Step 1 Mapping the individual reads

Method-Step 2 Graphical model

Probabilistic framework Splice graph, G={V,E} Nodes - individual nucleotides Directed edge types connect adjacent nodes Skips around the sliced-out portion of the genome

Estimation of alignments , (Maximize likelihood of PERs over all the putative alignments.)

EM continued...

Methods-Step 3 Expectation-maximization algorithm

Discussion Proposed a probabilistic framework to predict the alignment of each PER fragment to a reference genome. By maximizing the likelihood of all PER alignments through a splice graph model Advantageous-higher coverage and specificity than just the alignment of PERs. Capable of detecting trans-chromosome and trans-strand gene fusion events.

Advantages First, the fragment alignments significantly increase coverage of the transcriptome. Reason: The PER contains almost double information of single read. Second, it has higher specificity than the junctions in the individual end reads. Reasons: EM algorithm used the information from the entire set of end read alignments.

Advantages Third, the splice graph accurately captures alternative paths between two end read and the expected mate-pair distance can effectively disambiguate them.

Thank you