Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.

Slides:



Advertisements
Similar presentations
Lecture 15. Graph Algorithms
Advertisements

Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
CS 336 March 19, 2012 Tandy Warnow.
Lecture 5 Graph Theory. Graphs Graphs are the most useful model with computer science such as logical design, formal languages, communication network,
Longest Common Subsequence
Walks, Paths and Circuits Walks, Paths and Circuits Sanjay Jain, Lecturer, School of Computing.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Label Placement and graph drawing Imo Lieberwerth.
Lecture 14 Genome sequencing projects
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Spring 2010CS 2251 Graphs Chapter 10. Spring 2010CS 2252 Chapter Objectives To become familiar with graph terminology and the different types of graphs.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Fragment Assembly CIS 667 Spring 2004 February 18.
Genome sequencing and assembling
Multiple Sequence Alignment
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
1 Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004.
Sequence Alignment.
De-novo Assembly Day 4.
Physical Mapping of DNA Shanna Terry March 2, 2004.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 A -Approximation Algorithm for Shortest Superstring Speaker: Chuang-Chieh Lin Advisor: R. C. T. Lee National Chi-Nan University Sweedyk, Z. SIAM Journal.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.
Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Greedy Algorithms for the Shortest Common Superstring Overview by Anton Nesterov Saint Petersburg State University Russia Original paper by A. Frieze,
Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Great Theoretical Ideas in Computer Science for Some.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
An Algorithm for the Consecutive Ones Property Claudio Eccher.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Greedy Technique.
Graph theory Definitions Trees, cycles, directed graphs.
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
ICS 353: Design and Analysis of Algorithms
Enumerating Distances Using Spanners of Bounded Degree
Graph Algorithms in Bioinformatics
Phylogeny.
Important Problem Types and Fundamental Data Structures
Chapter 14 Graphs © 2011 Pearson Addison-Wesley. All rights reserved.
Fragment Assembly 7/30/2019.
Presentation transcript:

Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.

® Pei-Jie Wu2 Fragment assembly of DNA Biological background Models Algorithms Heuristics

® Pei-Jie Wu3 Biological background Problem as puzzle We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair. Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows.

® Pei-Jie Wu4 Biological background Target: The long sequence to reconstruct. Fragment vs. Subsequence Shotgun method: Based on fragment overlap Fragment assembly: A collection of fragments to put together

® Pei-Jie Wu5 Biological background --The ideal case Case: p.106 Aligned the input set, ignoring spaces at the extremities Overlaps: the end part of a fragment is similar to the beginning of another Consensus sequence base on majority vote

® Pei-Jie Wu6 Biological background --Complications The main factors that add to the complexity of the problem are: –Error –Unknown orientation –Repeated regions –Lack of coverage.

® Pei-Jie Wu7 Biological background --Complications It usually means algorithms that require more time and space when computer program deal with error. The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments. Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters. Figures 4.2, 4.3, 4.4 Errors

® Pei-Jie Wu8 Biological background --Complications Two other types of errors: chimera and Contamination Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target –Figure 4.5 –Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage. Contamination is from host or vector DNA –Solution: Most vectors are well know, so we can screen the data before starting assembly. Errors

® Pei-Jie Wu9 Biological background --Complications We generally do not know to which strand a particular fragment belongs to. The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement. Figure 4.6 Complexity: 2 n Unknown orientation

® Pei-Jie Wu10 Biological background --Complications Repeats are sequences that appear two or more times in the targrt molecule. –Short repeats –Longer repeats If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors Figure 4.7 Repeated regions

® Pei-Jie Wu11 Biological background --Complications Problems: –If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy. –Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9) Direct repeats: repeated copies in the same strand. Inverted repeats: repeated regions in opposite strands (Figure 4.10) Repeated regions

® Pei-Jie Wu12 Biological background --Complications Coverage: position i of the target as the number of fragments that cover this position. Contigs: The contiguously covered regions Figure 4.11 Solutions: –Sampling more fragments –Directed sequencing or walking Lack of coverage

® Pei-Jie Wu13 Biological background --Alternative methods for DNA sequencing Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project. Problem: –It is expensive to build special primers –Sequential rather than parallel Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes.

® Pei-Jie Wu14 Models Shortest common superstring (SCS) RECONSTRUCTION MULTICONTIG –All three assume that the fragment collection is free of contamination and chimeras.

® Pei-Jie Wu15 Models --Shortest common superstring Seeking the shortest superstring of a collection of given strings PROBLEM: Shortest common superstring (SCS) INPUT: a collection F of strings. OUTPUT: a shortest possible string S such that for every f  F, S is a superstring of f.

® Pei-Jie Wu16 Models --Shortest common superstring Example 4.1 Example 4.2 –Figure 4.12 –Figure 4.13 A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies

® Pei-Jie Wu17 Models --Reconstruction Takes into account both errors and unknown orientation Dynamic programming sequence comparison algorithm Use distance rather than similarity Expression: p.116

® Pei-Jie Wu18 Models --Reconstruction PROBLEM: RECONSTRUCTION INPUT: a collection F of strings and an error tolerance  between 1 and 0. OUTPUT: (p.117) Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level  Does not model repeats, lack of coverage, and size of target

® Pei-Jie Wu19 Models --Multicontig Involve internal linkage of the fragments in the layout Nonlink: there is a fragment that properly contains the overlap on both sides Weakest link: the smallest size of any link t-contig: the weakest link of a layout is at least as large as t Example 4.4 Definition: p.119

® Pei-Jie Wu20 Algorithms Greedy algorithm Acyclic subgraphs (no errors and know orientation)

® Pei-Jie Wu21 Algorithms --Representing overlaps Over multigraph OM(F) of a collection F is the directed, weighted multigraph Set V of nodes of this structure is just F itself. A directed edge from a to a different fragment b with weight t  0 exists if the suffix of a with t characters is a prefix of b May be many edges from a to b No self-loops

® Pei-Jie Wu22 Algorithms --Paths originating superstrings Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e Figure 4.15 –Example in p.121 Equation 4.3 Hamiltonian paths: A path that goes through every vertex Equation 4.4 –Minimizing |S(P)|  maximizing w(P)

® Pei-Jie Wu23 Algorithms --Shortest superstrings as paths A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b. THEOREM 4.1 COROLLARY 4.1 LEMMA 4.1 THEOREM 4.2

® Pei-Jie Wu24 Algorithms --The greedy algorithm Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. OM(F)  OG(F) “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge

® Pei-Jie Wu25 Algorithms --The greedy algorithm Three conditions we have to test before accepting an edge in our Hamiltonian path: –Edges are processed in nonincreasing order by weight –The procedure ends when we have exactly n-1 edges, or –when the accepted edges induce a connected subgraph. Figure 4.16 Example 4.5 –Figure 4.17

® Pei-Jie Wu26 Algorithms --Acyclic subgraphs Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA. “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly. Figure 4.18

® Pei-Jie Wu27 Algorithms --Acyclic subgraphs The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph. Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph. THEOREM 4.5 Algorithm: Topological sorting Example 4.6 –Figure 4.19, 4.20 and 4.21

® Pei-Jie Wu28 Heuristics None of the formalisms proposed for fragment assembly are entirely adequate Fragment assembly can be viewed as a multiple alignment problem with some additional feature: –Each fragment can participate with either the direct or the reverse-complemented sequence. –The sequences themselves are usually much shorter than the alignment itself.

® Pei-Jie Wu29 Heuristics Three criteria according to the second feature: –Scoring  Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal  Lower the entropy, the better  Coverage:  A fragment covers a column i if it participates in this column either with a character or with an internal space.  Linkage  The way individual fragment are linked in the layout is another determinant of layout quality.  Figure 4.22

® Pei-Jie Wu30 Heuristics --Assembly in practice Practical implementations often divide the whole problem in three phase: –Finding overlaps –Building a layout –Computing the consensus

® Pei-Jie Wu31 Heuristics --Assembly in practice The first step in any assembly problem is fragment overlap delection. Determine reverse complement Consider fragments entirely contained in other fragment Recall Section –Figure 4.23 Finding overlaps

® Pei-Jie Wu32 Heuristics --Assembly in practice Finding a good ordering of fragments in a contig No algorithm that is simple and general enough There are four issues to keep in mind when building paths: –Every path has a corresponding complement path –It is not necessary to include contain fragments –Cycles usually indicate the presence of repeats –Unbalanced coverage may be related to repeats as well (see Figure 4.13) Ordering fragments

® Pei-Jie Wu33 Heuristics --Assembly in practice Building a layout from a path in an overlap graph Two techniques related to alignment construction: –The first one helps in building a good layout from a path in the presence of errors.  Example 4.7  Implement: Figure 4.24 –The second one focuses on locally improving an already constructed layout  Example 4.8 in Figure 4.25  Implement: sum-of-pairs scoring scheme Alignment and consensus