Removing Erroneous Connections

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Graphs Chapter 12. Chapter Objectives  To become familiar with graph terminology and the different types of graphs  To study a Graph ADT and different.
Gao Song 2010/04/27. Outline Concepts Problem definition Non-error Case Edge-error Case Disconnected Components Simulated Data Future Work.
Network Optimization Problems: Models and Algorithms This handout: Minimum Spanning Tree Problem.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Connected Components, Directed Graphs, Topological Sort COMP171.
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Connected Components, Directed Graphs, Topological Sort Lecture 25 COMP171 Fall 2006.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Connected Components, Directed graphs, Topological sort COMP171 Fall 2005.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
Time-Constrained Flooding A.Mehta and E. Wagner. Time-Constrained Flooding: Problem Definition ●Devise an algorithm that provides a subgraph containing.
Graphs and Sets Dr. Andrew Wallace PhD BEng(hons) EurIng
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Fuzzypath – Algorithms, Applications and Future Developments
Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
billion-piece genome puzzle
Graphs. Graphs Similar to the graphs you’ve known since the 5 th grade: line graphs, bar graphs, etc., but more general. Those mathematical graphs are.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Data Structures and Algorithm Analysis Graph Algorithms Lecturer: Jing Liu Homepage:
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Leda Demos By: Kelley Louie Credits: definitions from Algorithms Lectures and Discrete Mathematics with Algorithms by Albertson and Hutchinson graphics.
Sets and Maps Chapter 9.
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Fragment Assembly (in whole-genome shotgun sequencing)
Data Structures Using C++ 2E
Depth-First Search.
Design and Analysis of Algorithms
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
Dr. David Matuszek Heapsort Dr. David Matuszek
Graphs Chapter 13.
Graphs Chapter 11 Objectives Upon completion you will be able to:
Connected Components, Directed Graphs, Topological Sort
CSE 589 Applied Algorithms Spring 1999
Heapsort.
Chapter 15 Graphs © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
Sets and Maps Chapter 9.
An Eulerian path approach to DNA fragment assembly
Heapsort.
Unit 2: Computational Thinking, Algorithms & Programming
Heapsort.
Fragment Assembly 7/30/2019.
CO 303 Algorithm Analysis and Design
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
Presentation transcript:

Removing Erroneous Connections The erroneous connections left after Tour Bus cannot be identified via the graph topology Cannot find a recognizable loop or structure Cannot be associated directly to a corresponding correct path Solution: Create a coverage cutoff Currently set by the user Concern: doesn’t this undo Tour Bus’s goal of preserving ‘important’ low coverage nodes? No, Tour Bus turns those into long straight nodes which have near average coverage

Breadcrumb! Assembly is limited by the repeat structure in the sequenced genome We need to be able to resolve these repeats! We need to be able to connect contiguous regions through repeated regions, otherwise these repeats would create tangles in the de Bruijn graph. How do we do this? With Breadcrumb!

Breadcrumb algorithm. Breadcrumb algorithm. Two long contigs produced after error correction, A and B, are joined by several paired reads (red and blue arcs). The path between the two can be broken up because of a repeat internal to the connecting sequence, because of an overlap with a distinct part of the genome, or because of some unresolved errors. The small square nodes represent either nodes of the path between A and B, or other nodes of the graph connected to the former. Finding the exact path in the graph from A to B is not straightforward because of all the alternate paths that need to be explored. However, if we mark all the nodes that are paired up to either A or B (with a blue circle), we can define a subgraph much simpler to explore. Ideally, only a linear path connects both nodes. Daniel R. Zerbino, and Ewan Birney Genome Res. 2008;18:821-829 Copyright © 2008, Cold Spring Harbor Laboratory Press

The Breadcrumb Algorithm We start by selecting a cutoff length longer than most of the inserts, and designate “long nodes” to be the nodes that are longer than the cutoff. Using the read pairs, Breadcrumb starts by pairing up the long nodes. Because we did not set any restriction on uniqueness, some long nodes may consistently connect to several other long nodes, but they are simply flagged as ambiguous and left untouched. This selection eliminates some duplicated nodes, but not necessarily all of them. Select only unambiguous long nodes, Breadcrumb flags all the nodes containing the mate reads of the reads in that long node. If a single opposite long node is available, then all the nodes that pair up to it are also flagged. Because of the node length constraint, between two long contigs nearly all paired reads map onto a read on either of the long reads. Breadcrumb then extends the unique node by going as far as possible from one connected flagged node to the next and stopping if there are no connections or if the node is flagged. In the best case, a simple path can be found to the opposite long node, and the two contigs can be merged.

Breadcrum cont’d We must allow for erroneous reads! All reads detected while removing erroneous connections are marked unreliable. Long nodes may be erroneously connected to very few paired reads, in which case we discard the weak connections between long nodes. However, error also occurs in low complexity regions, so it is necessary to apply Tour Bus to the flagged nodes only, where we flag them instead of destroying them.

Time Complexity Main bottleneck is graph construction- the initializer graph for streptococcus needs 2 Gb of RAM alone! Tour bus uses a modified version of Dijkstra which has a time complexity of O(NlogN), where N is the number of nodes in the graph. This N depends on read coverage, error rate, and number of repeats. The runtime on breadcrumb is dependent on some subset K of N, where K is the size of the set of long nodes (remember that we set a threshold from which to select our K). It is a near constant amount of time per long node though, so its around O(K).

Why Velvet? There are other short read assemblers SSAKE and VCAKE explore a de Bruijn graph by searching for reads in a hash table. Since Velvet builds this graph, it uses more memory but tends to be much faster and produces longer contigs without misassembly! Velvet makes connected supercontigs out of very short reads. The connectivity of the supercontigs is super useful!