Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

22C:19 Discrete Math Graphs Fall 2010 Sukumar Ghosh.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Assembling Algorithms and Techniques Upmanyu Misra Computational Issues in Molecular Biology CSE
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 CHAPTER 4 - PART 2 GRAPHS 1.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
The Theory of NP-Completeness
DNA Fragment Assembly CIS 667 Spring 2004 February 18.
CSE 421 Algorithms Richard Anderson Lecture 27 NP Completeness.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Approximation Algorithms
1 Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE, Lan CSE497 Feb.24, 2004.
22C:19 Discrete Math Graphs Spring 2014 Sukumar Ghosh.
De-novo Assembly Day 4.
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Physical Mapping of DNA Shanna Terry March 2, 2004.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Physical Mapping of DNA BIO/CS 471 – Algorithms for Bioinformatics.
Approximation Algorithms
CSC 413/513: Intro to Algorithms NP Completeness.
Polynomial-time reductions We have seen several reductions:
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Chap. 4 FRAGMENT ASSEMBLY OF DNA Introduction to Computational Molecular Biology Chapter 4.
Greedy Algorithms for the Shortest Common Superstring Overview by Anton Nesterov Saint Petersburg State University Russia Original paper by A. Frieze,
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.
Fragment Assembly 蔡懷寬 We would like to know the Target DNA sequence.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
LIMITATIONS OF ALGORITHM POWER
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
NP Completeness Piyush Kumar. Today Reductions Proving Lower Bounds revisited Decision and Optimization Problems SAT and 3-SAT P Vs NP Dealing with NP-Complete.
Analyzing Sequences. Sequences: An Evolutionary Perspective Evolution occurs through a set of modifications to the DNA These modifications include point.
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
Introduction to NP Instructor: Neelima Gupta 1.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
More NP-Complete and NP-hard Problems
Richard Anderson Lecture 26 NP-Completeness
Introduction to Genome Assembly
CS 598AGB Genome Assembly Tandy Warnow.
ICS 353: Design and Analysis of Algorithms
Graph Theory.
Richard Anderson Lecture 25 NP-Completeness
Richard Anderson Lecture 28 NP-Completeness
CSE 589 Applied Algorithms Spring 1999
Richard Anderson Lecture 27 Survey of NP Complete Problems
Fragment Assembly 7/30/2019.
Presentation transcript:

Fragment Assembly of DNA BIO/CS 471 – Algorithms for Bioinformatics

Fragment Assembly2 Limitations to sequencing  You must have a primer of known sequence to initiate PCR  Only about 1000nts can be sequenced in a single reaction  The sequencing process is slow, so it is beneficial to do as much in parallel as possible Primer hopping Shotgun approach

Fragment Assembly3 Shotgun Sequencing

Fragment Assembly4 The Ideal Case  Find maximal overlaps between fragments: ACCGT CGTGC TTAC TACCGT --ACCGT CGTGC TTAC TACCGT— TTACCGTGC Consensus sequence determined by vote

Fragment Assembly5 Quality Metrics  The coverage at position i of the target or consensus sequence is the number of fragments that overlap that position  Two contigs No coverage Target:

Fragment Assembly6 Quality Metrics  Linkage – the degree of overlap between fragments Target: Perfect coverage, poor average linkage poor minimum linkage

Fragment Assembly7 Real World Complications  Base call errors  Chimeric fragments, contamination (e.g. from the vector) --ACCGT CGTGC TTAC TGCCGT— TTACCGTGC --ACC-GT CAGTGC TTAC TACC-GT— TTACC-GTGC --ACCGT CGTGC TTAC TAC-GT— TTACCGTGC Base Call ErrorDeletion ErrorInsertion Error

Fragment Assembly8 Unknown Orientation A fragment can come from either strand CACGT ACGT ACTACG GTACT ACTGA CTGA  CACGT  -ACGT  --CGTAGT  -----AGTAC  ACTGA  CTGA

Fragment Assembly9 Repeats  Direct repeats AXBXCXD AXCXBXD

Fragment Assembly10 Repeats  Direct repeats AXBYCXDYE AXDYCXBYE

Fragment Assembly11 Repeats  Inverted repeats XX XX

Fragment Assembly12 Sequence Alignment Models  Shortest common superstring Input: A collection, F, of strings (fragments) Output: A shortest possible string S such that for every f  F, S is a superstring of f.  Example: F = { ACT, CTA, AGT } S = ACTAGT

Fragment Assembly13 Problems with the SCS model xx xx´x´  Directionality of fragments must be known  No consideration of coverage  Some simple consideration of linkage  No consideration of base call errors

Fragment Assembly14 Reconstruction  Deals with errors and unknown orientation  Definitions f is an approximate substring of S at error level  when d s (f, S)    | f | d s = substring edit distance:  Reconstruction Input: A collection, F, of strings, and a tolerance level,  Output: Shortest possible string, S, such that for every f  F : Match = 0 Mismatch = 1 Gap = 1

Fragment Assembly15 Reconstruction Example  Input: F = { ATCAT, GTCG, CGAG, TACCA }  = 0.25  Output: ATGAT CGAC -CGAG ----TACCA ACGATACGAC ATCAT GTCG d s ( CGAG, ACGATACGAC ) = 1 = 0.25  4 So this output is OK for  = 0.25

Fragment Assembly16 Gaps in Reconstruction  Reconstruction allows gaps in fragments: AT-GA----- ATCGATAGAC d s = 1

Fragment Assembly17 Limitations of Reconstruction  Models errors and unknown orientation  Doesn’t handle repeats  Doesn’t model coverage  Only handles linkage in a very simple way  Always produces a single contig

Fragment Assembly18 Contigs  Sometimes you just can’t put all of the fragments together into one contiguous sequence: No way to tell the order of these two contigs. ? No way to tell how much sequence is missing between them.

Fragment Assembly19 Multicontig  Definitions A layout, L, is a multiple alignment of the fragments  Columns numbered from 1 to | L | Endpoints of a fragment: l(f) and r(f) An overlap is a link is no other fragment completely covers the overlap LinkNot a link

Fragment Assembly20 Multicontig  More definitions The size of a link is the number of overlapping positions The weakest link is the smallest link in the layout A t-contig has a weakest link of size t A collection, F, admits a t-contig if a t-contig can be constructed from the fragments in F ACGTATAGCATGA GTA CATGATCA ACGTATAG GATCA A link of size 5

Fragment Assembly21 Perfect Multicontig  Input: F, and t  Output: a minimum number of collections, C i, such that every C i admits a t-contig Let F = { GTAC, TAATG, TGTAA } --TAATG TGTAA-- GTAC t = 3 TGTAA TAATG GTAC t = 1

Fragment Assembly22 Handling errors in Multicontig  The image of a fragment is the portion of the consensus sequence, S, corresponding to the fragment in the layout  S is an  -consensus for a collection of fragments when the edit distance from each fragment, f, and its image is at most   | f | TATAGCATCAT CGTC CATGATCA ACGGATAG GTCCA ACGTATAGCATGATCA An  -consensus for  = 0.4

Fragment Assembly23 Definition of Multicontig  Input: A collection, F, of strings, an integer t  0, and an error tolerance  between 0 and 1  Output: A partition of F into the minimum number of collections C i such that every C i admits a t-contig with an  -consensus

Fragment Assembly24 Example of Multicontig  Let  = 0.4, t = 3 TATAGCATCAT ACGTC CATGATCAG ACGGATAG GTCCAG ACGTATAGCATGATCAG

Fragment Assembly25 Algorithms  Most of the algorithms to solve the fragment assembly problem are based on a graph model  A graph, G, is a collection of edges, e, and vertices, v. Directed or undirected Weighted or unweighted  We will discuss representations and other issues shortly… A directed, unweighted graph

Fragment Assembly26 The Maximum Overlap Graph  The text calls it an overlap multigraph  Each directed edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v a b d c TACGA CTAAAG ACCC GACA weight edges omitted!

Fragment Assembly27 Paths and Layouts  The path dbc leads to the alignment: a b d c TACGA CTAAAG ACCC GACA GACA ACCC CTAAAG

Fragment Assembly28 Superstrings  Every path that covers every node is a superstring  Zero weight edges result in alignments like:  Higher weights produce more overlap, and thus shorter strings  The shortest common superstring is the highest weight path that covers every node GACA GCCC TTAAAG

Fragment Assembly29 Graph formulation of SCS  Input: A weighted, directed graph  Output: The highest-weight path that touches every node of the graph Does this problem sound familiar?

Fragment Assembly30 The Greedy Algorithm Algorithm greedy Sort edges in increasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End for End Algorithm Figure 4.16, page 125

Fragment Assembly31 Greedy Example

Fragment Assembly32 Greedy does not always find the best path ATGCTGCAT GCC 0

Fragment Assembly33 Tools for Shotgun Sequencing

Fragment Assembly34 Common Difficulty  Each of these problems is a method for modeling fragment assembly  Each of these problems is provably intractable  How?

Fragment Assembly35 Embedding problems  Suppose I told you that I had found a clever way to model the TSP as a shortest common superstring problem Paths between cities are represented as fragments The shortest path is the shortest common superstring of the fragments  If this is true, then there are only two possibilities: 1.This problem is just as intractable as TSP 2.TSP is actually a tractable problem!

Fragment Assembly36 NP-Complete Problems  There is a collection of problems that computer scientists believe to be intractable TSP is one of them  Each of them has been modeled as one or more of the other NP-complete problems  If you solve one, you solve them all  A problem, p, is NP-hard if you can model one of these NP-complete problems as an instance of p

Fragment Assembly37 NP-Completeness TSPP NP 3-SAT Graph 3-coloring Vertex cover Subset sum Set packing Bin packing

Fragment Assembly38 P = NP? NP 3-SAT Graph 3-coloring Vertex cover Subset sum Set packing Bin packing P NP