1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
RNA Assembly Using extending method. Wei Xueliang
Next Generation Sequencing, Assembly, and Alignment Methods
SplitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014.
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Genome sequencing and assembling
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved.
1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.
Fuzzypath – Algorithms, Applications and Future Developments
CS/BioE 598AGB: Genome Assembly, part II Tandy Warnow.
Metagenomics Assembly Hubert DENISE
The iPlant Collaborative
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Genomics Quick Start Mikhail Dvorkin Vladislav Isenbaev Eugene Kapun Scientific advisors Acad. Konstantin Skryabin, Bioengineering RAS Prof. Anatoly Shalyto,
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Assembly algorithms for next-generation sequencing data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Phylogeny - based on whole genome data
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
A Fast Hybrid Short Read Fragment Assembly Algorithm
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Introduction to Genome Assembly
Removing Erroneous Connections
Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,
CS 598AGB Genome Assembly Tandy Warnow.
Can you draw this picture without lifting up your pen/pencil?
De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Fragment Assembly 7/30/2019.
Presentation transcript:

1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee

2 What is de Bruijn Graphs? “De Bruijn graph” is a directed graph An edge represents overlap between sequences of symbols V=(s 1, s 2, …, s m ) E={(v 1,v 2,…, v n ),(w 1,w 2,…,w n )):v 2 =w 1,v 3 =w 2, …, v n =w n-1 }

3 Introduction New sequencing techniques are commercially available (e.g. 454 Sequencing, Solexa) 454 Sequencing ~ 100 – 200bp Solexa ~ 30bp Algorithms whole genome shotgun (WGS) assembly are not suitable for short reads  Overlap graph with a node per read is extremely large  More ambiguous connections in assembly

4 Introduction (cont) Euler assembler (Pevzner 2001) used k-mer for a node of de Bruijn graphs Reads are mapped as a path through the de Brujin graph High redundancy does not affect the number of nodes “Velvet” effectively deals with experimental errors and repeats by using Brujin graphs with k-mers

5 De Bruijn Graphs - structure Structure

6 De Bruijn Graphs – structure (cont) Adjacent k-mers overlap by k-1 nucleotides Each node is attached to twin node  Reverse series of reverse complement k-mers  Overlap between reads from opposite strand Union of a node and its twin node is called a “block” Last k-mer overlaps with the first of its destination

7 De Bruijn Graphs – construction (cont) Construction Reads are hashed with predefined k-mer length Small k-mer → increase connectivity → more ambiguous repeats Large k-mer → increase specificity → decrease connectivity Determine k considering “sensitivity” and “specificity”

8 De Bruijn Graphs – construction (cont) For each k-mer, hash table records ID of the first read and its position Each k-mer is recorded with reverse complement Node is created if there is distinct interruption points Reads are traced through the graph Create a directed arc if necessary

9 De Bruijn Graphs – simplification Simplify the chains of blocks  No information loss If node A has only one outgoing arc to node B, and if node B has only one ingoing arc → merge AB

10 De Bruijn Graphs – error removal Velvet focuses on “topological features” of the graph First step: remove tips  Tip: chain of nodes disconnected on one end Use two criteria: (1) length and (2) minority count  Length: remove a tip if < 2k bp since two nearby errors can create a tip up to 2k bp error k k

11 De Bruijn Graphs – error removal (cont) Minority count: multiplicity m < n Starting from node B, going through the tip is an alternative to a more common path m n B tip A C

12 De Bruijn Graphs – error removal (cont) Second step: remove bubbles using Tour Bus Redundant paths start and end at the same nodes Bubbles are created by errors or biological variants such as SNP Bubble

13 De Bruijn Graphs – error removal (cont) 1.Detect redundant paths 2. Compare them using dynamic programming methods 3. If similar, merge them Tour Bus

14 De Bruijn Graphs – error removal (cont) Third step: remove erroneous connections Remove erroneous connections after Tour Bus algorithm Remove erroneous connections with basic coverage cutoff Genuine short nodes which cannot be simplified in the graph should have high coverage

15 Breadcrumb: resolution of repeats 1. Using read pairs, pair up the long nodes 2. Flag paired reads using unambiguous long nodes unambiguous long nodes

16 Breadcrumb: resolution of repeats 1. Using read pairs, pair up the long nodes 2. Flag paired reads using unambiguous long nodes unambiguous long nodes

17 Breadcrumb: resolution of repeats Extends the nodes as far as possible using flagged paired reads All nodes between A and B are paired up to either A or B

18 Experimental Results Test error removal pipeline on simulated data Simulate reads are from E. coli, S. cerevisiae, C.elegans, and H. sapiens Coverage density vs N50 for H. sapiens Limited by natural repetition of the reference genome Ideal+ Error (1%)+ SNP N50

19 Experimental Results (cont) Test error removal pipeline on experimental data 173,428 bp human BAC was sequenced using Solexa machines Reads were 35bp long, and k=31 Tour Bus increased sensitivity by correcting errors and preserved the integrity of the graph structure

20 Experimental Results (cont)

21 Experimental Results (cont)

22 Conclusions Velvet is a de Bruijn graph based sequence assembly method for short reads Errors are handled by removing tips and Tour Bus algorithm A large number of repeats are resolved by Breadcrumb algorithm Velvet was assessed using simulated and real datasets and it performed well