DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Longest Common Subsequence
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Chapter 8: Graph Algorithms July/23/2012 Name: Xuanyu Hu Professor: Elise de Doncker.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Combinatorial Algorithms
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Analysis of Algorithms CS 477/677
Approximation Algorithms for the Traveling Salesperson Problem.
Computational Genomics and Proteomics Sequencing genomes.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
Sequencing tutorial Peter HANTZ EMBL Heidelberg.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
GRAPH Learning Outcomes Students should be able to:
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Lecture 3 1.Protein Function prediction using network concepts 2.Application of network concepts in DNA sequencing.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Approximation Algorithms for TSP Tsvi Kopelowitz 1.
Introduction to Graphs. This Lecture In this part we will study some basic graph theory. Graph is a useful concept to model many problems in computer.
Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute.
Sequencing tutorial Peter HANTZ EMBL Heidelberg.
CIRCUITS, PATHS, AND SCHEDULES Euler and Königsberg.
Lecture 52 Section 11.2 Wed, Apr 26, 2006
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Introduction to Graph Theory
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
1 Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas.
Grade 11 AP Mathematics Graph Theory Definition: A graph, G, is a set of vertices v(G) = {v 1, v 2, v 3, …, v n } and edges e(G) = {v i v j where 1 ≤ i,
Graph Algorithms © Jones and Pevzner © Robert Simons
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
CSCI2950-C Lecture 3 DNA Sequencing and Fragment Assembly
CS296-5 Genomes, Networks, and Cancer
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
CSE 5290: Algorithms for Bioinformatics Fall 2011
Graph Algorithms in Bioinformatics
ICS 353: Design and Analysis of Algorithms
Genome Assembly.
Graph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Fragment Assembly 7/30/2019.
Presentation transcript:

DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from

Outline The Basic Shotgun Sequencing Strategy The shortest superstring problem –Graph algorithms Sequencing by Hybridization

The Basic Shotgun Sequencing Strategy Step 1: Fragment Sequencing cut many times at random (Shotgun) genomic segment Get one or two reads from each segment ~500 bp

The Basic Shotgun Sequencing Strategy Step 2: Fragment Assembly Cover region with ~7-fold redundancy Overlap reads and extend to reconstruct the original genomic region reads

Step 1: Generating Read (Sanger Method) 1. Start at primer (restriction site) 2. Grow DNA chain 3. Include ddNTPs 4. Stops reaction at all possible points 5. Separate products by length, using gel electrophoresis TAA... T …T

Step 2: Shortest Superstring Problem Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s 1, s 2,…., s n Output: A string s that contains all strings s 1, s 2,…., s n as substrings, such that the length of s is minimized Complexity: NP – complete How likely is the found s indeed the original genome? s approaches the genome as n  if no sequencing error and fragmentation is random

Shortest Superstring Problem: Example How do we solve such a problem efficiently? - Greedy algorithms (approximation) - Efficient algorithms exist for special cases and are related to “graph algorithms”

A Greedy Algorithm for SSP For each pair of (segment) strings, compute an overlap score Merge the pair with the highest score Repeat until no more strings can be merged If multiple strings are left, any concatenation of them would be a solution. Think about an example when this algorithm is not optimal…

A Special Case of SSP When each segment is an L-mer (L-gram), linear algorithm exists! This makes it attractive to do “sequencing by hybridization”…

Sequencing By Hybridization Attach all possible DNA probes of length l (e.g., l = 8) to a flat surface, each probe at a distinct and known location. This set of probes is called the DNA array. Apply a solution containing fluorescently labeled DNA fragment to the array. The DNA fragment hybridizes with those probes that are complementary to substrings of length l of the fragment, allowing us to see which l-mers match the DNA fragment

Hybridization on DNA Array

l-mer composition Define Spectrum ( s, l ) as the unordered multiset of all possible (n – l + 1) l-mers in a string s of length n The order of individual elements in Spectrum ( s, l ) does not matter

l-mer composition For example, for s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG}

l-mer composition For example, for s = TATGGTGC all of the following are equivalent representations of Spectrum ( s, 3 ): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} We usually choose the lexicographically maximal representation as the canonical one.

Different sequences – the same spectrum Different sequences may have the same spectrum: Spectrum(GTATCT,2)= Spectrum(GTCTAT,2)= {AT, CT, GT, TA, TC}

The SBH Problem Goal: Reconstruct a string from its l-mer composition Input: A set S, representing all l-mers from an (unknown) string s Output: String s such that Spectrum ( s,l ) = S and the length of s is minimum How likely is the found s indeed the original genome? s approaches the genome as l  if no sequencing error

How do we solve the SBH problem efficiently? The solution is related to graph algorithms…

Detour … Graph Algorithms

The Bridge Obsession Problem Bridges of Königsberg Find a tour crossing every bridge just once Leonhard Euler, 1735

Formalization of Königsberg Bridge Problem: Graph & Eulerian Cycle Graph G=(V,E) –V= Vertices; E= Edges Eulerian cycle: A cycle that visits every edge exactly once Linear time algorithm exists More complicated Königsberg

Hamiltonian Cycle Problem Find a cycle that visits every vertex exactly once NP – complete Game invented by Sir William Hamilton in 1857

Balanced Graphs A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing vertices: in(v)=out(v)

Euler Theorem A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing vertices: in(v)=out(v) Theorem: A connected graph is Eulerian if and only if each of its vertices is balanced.

Euler Theorem: Proof Eulerian  balanced for every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v)=out(v) balanced  Eulerian ???

Algorithm for Constructing an Eulerian Cycle a. Start with an arbitrary vertex v and form an arbitrary cycle with unused edges until a dead end is reached. Since the graph is Eulerian this dead end is necessarily the starting point, i.e., vertex v.

Algorithm for Constructing an Eulerian Cycle (cont’d) b. If cycle from (a) above is not an Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform step (a) again, using vertex w as the starting point. Once again, we will end up in the starting vertex w.

Algorithm for Constructing an Eulerian Cycle (cont’d) c.Combine the cycles from (a) and (b) into a single cycle and iterate step (b).

Euler Theorem: Extension Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi- balanced vertices and all other vertices are balanced.

End of Detour… Let’s see how graph algorithms can help DNA sequencing…

Reducing SSP to TSP (Traveling Salesman Problem) Define overlap ( s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. Construct a graph with n vertices representing the n strings s 1, s 2,…., s n. Insert edges of length overlap ( s i, s j ) between vertices s i and s j. Find the shortest path which visits every vertex exactly once. This is the Traveling Salesman Problem (TSP), which is also NP – complete.

Reducing SSP to TSP (cont’d)

SSP to TSP: An Example S = { ATC, CCA, CAG, TCC, AGT } SSP AGT CCA ATC ATCCAGT TCC CAG ATCCAGT TSP ATC CCA TCC AGT CAG

SBH: Hamiltonian Path Approach S = { ATG AGG TGC TCC GTC GGT GCA CAG } Path visited every VERTEX once ATGAGGTGCTCC H GTC GGT GCACAG ATGCAGGTCC

SBH: Hamiltonian Path Approach A more complicated graph: S = { ATG TGG TGC GTG GGC GCA GCG CGT }

SBH: Hamiltonian Path Approach S = { ATG TGG TGC GTG GGC GCA GCG CGT } Path 1: ATGCGTGGCA ATGGCGTGCA Path 2:

SBH: Eulerian Path Approach S = { ATG, TGC, GTG, GGC, GCA, GCG, CGT } Vertices correspond to ( l – 1 ) – mers : { AT, TG, GC, GG, GT, CA, CG } Edges correspond to l – tuples from S AT GT CG CA GC TG GG Path visited every EDGE once

SBH: Eulerian Path Approach S = { AT, TG, GC, GG, GT, CA, CG } corresponds to two different paths: ATGGCGTGCA ATGCGTGGCA AT TG GC CA GG GT CG AT GT CG CA GCTG GG

Some Difficulties with SBH Fidelity of Hybridization: difficult to detect differences between probes hybridized with perfect matches and 1 or 2 mismatches Array Size: Effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology. Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in the future

The Problem of Repeats Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Green and blue fragments are interchangeable when assembling repetitive DNA We’ll talk about how to deal with this in the next lecture….

What You Should Know The basic idea of shotgun sequencing The shortest superstring problem formulation and its limitation The reduction of the shortest superstring problem to graph traversal problems Eulerian path algorithm