Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Introduction to Graphs
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Data Structures Using C++
Combinatorial Algorithms
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Approximation Algorithms for the Traveling Salesperson Problem.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Graphs. Graph A “graph” is a collection of “nodes” that are connected to each other Graph Theory: This novel way of solving problems was invented by a.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
GRAPH Learning Outcomes Students should be able to:
De-novo Assembly Day 4.
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.
Data Structures Using C++ 2E
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
SPANNING TREES Lecture 21 CS2110 – Spring
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Chapter 2 Graph Algorithms.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Algorithms and Running Time Algorithm: Well defined and finite sequence of steps to solve a well defined problem. Eg.,, Sequence of steps to multiply two.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Human Genome.
WK15. Vertex Cover and Approximation Algorithm By Lin, Jr-Shiun Choi, Jae Sung.
SPANNING TREES Lecture 20 CS2110 – Fall Spanning Trees  Definitions  Minimum spanning trees  3 greedy algorithms (incl. Kruskal’s & Prim’s)
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Graphs. Graph Definitions A graph G is denoted by G = (V, E) where  V is the set of vertices or nodes of the graph  E is the set of edges or arcs connecting.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Review: Graph Theory in Bioinformatics Yunkai Liu Assistant Professor Computer Science Department University of South Dakota.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
Grade 11 AP Mathematics Graph Theory Definition: A graph, G, is a set of vertices v(G) = {v 1, v 2, v 3, …, v n } and edges e(G) = {v i v j where 1 ≤ i,
CSE 589 Applied Algorithms Spring 1999 Prim’s Algorithm for MST Load Balance Spanning Tree Hamiltonian Path.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
Greedy Technique.
Genome sequence assembly
Graph theory Definitions Trees, cycles, directed graphs.
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Graph Algorithm.
Graphs Chapter 13.
Genome Assembly.
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Fragment Assembly 7/30/2019.
Presentation transcript:

Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey

The sequencing problem We want to determine the identity of the base pairs that make up: – A single large molecule of DNA – The genome of a single cell/individual organism – The genome of a species But we can’t (currently) “read” off the sequence of an entire molecule all at once

The strategy: substrings We do have the ability to read or detect short pieces (substrings) of DNA – Sanger sequencing: bp/read – Hybridization arrays: 8-30bp/probe – Latest technologies: 454 Genome Sequencer FLX: bp/read Illumina Genome Analyzer: bp/read

Sanger sequencing Classic sequencing technique: “Chain-termination method” Replication terminated by inclusion of dideoxynucleotide (ddNTP) ACTACGCATCTAGCTACGTACGTACGTACGTTAGCTGAC TGATGCGTAGAT tag primer CGATGC DNA polymerase ACTACGCATCTAGCTACGTACGTACGTACGTTAGCTGAC TGATGCGTAGAT CGATGCA ddATP

Sequencing gels Run replication in four separate test tubes – Each with one of some concentration of either ddATP, ddTTP, ddGTP, or ddCTP Depending on when ddNTP is included, different length fragments are synthesized Fragments separated by length with electrophoresis gel Sequence can be read from bands on gel

Universal DNA arrays Array with all possible oligonucleotides (short DNA sequence) of a certain length as probes Sample is labeled and then washed over array Hybridization is detected from labels ATCGAAATCGAA chip probe ATCGACATCGAC ATCGAGATCGAG ATCGATATCGAT ATCGCAATCGCA ATCGCCATCGCC … ACTAGCGTCACTAGCGTC tag

Reading a DNA array

Latest technologies 454: – “Sequencing by synthesis” – Light emitted and detected on addition of a nucleotide by polymerase – Mb / 10 hour run Illumina – Also “sequencing by synthesis” – up to ~6.5 Gb/day – Uses fluorescently-labeled reversible nucleotide terminators – Like Sanger, but detects added nucleotides with laser after each step

Two sequencing paradigms 1.Fragment assembly – For technologies that produce “reads” Sanger, 454, Illumina, SOLiD, etc. 2.Spectral assembly – For technologies that produce “spectra” Universal DNA arrays The two paradigms are actually closely related

Shotgun Sequencing Fragment Assembly Multiple copies of sample DNA Randomly fragment DNA Sequence sample of fragments Assemble reads

The fragment assembly problem Given: A set of reads (strings) {s 1, s 2, …, s n } Do: Determine a large string s that “best explains” the reads What do we mean by “best explains”? What assumptions might we require?

Shortest superstring problem Objective: Find a string s such that – all reads s 1, s 2, …, s n are substrings of s – s is as short as possible Assumptions: – Reads are 100% accurate – Identical reads must come from the same location on the genome – “best” = “simplest”

Shortest superstring example Reads: {ACG, CGA, CGC, CGT, GAC, GCG, GTA, TCG} What is the shortest superstring you can come up with? TCGACGCGTA (length 10)

Algorithms for shortest substring problem This problem turns out to be NP-complete Simple greedy strategy: while # strings > 1 do merge two strings with maximum overlap loop Conjectured to give string with length ≤ 2 × minimum length “2-approximation” Other algorithms will require graph theory…

Graph Basics A graph (G) consists of vertices (V) and edges (E) G = (V,E) Edges can either be directed (directed graphs) or undirected (undirected graphs)

Vertex degrees The degree of a vertex: the # of edges incident to that vertex For directed graphs, we also have the notion of – indegree: The number incoming edges – outdegree: The number of outgoing edges degree(v 2 ) = 3 indegree(v 2 ) = 1 outdegree(v 2 ) = 2

Overlap graph For a set of sequence reads S, construct a directed weighted graph G = (V,E,w) – with one vertex per read (v i corresponds to s i ) – edges between all vertices (a complete graph) – w(v i,v j ) = overlap(s i,s j ) = length of longest suffix of s i that is a prefix of s j

Overlap graph example Let S = {AGA, GAT, TCG, GAG} AGA GAT TCG GAG

Assembly as Hamiltonian Path Hamiltonian Path: path through graph that visits each vertex exactly once AGA GAT TCG GAG Path: AGAGATCG

Shortest superstring as TSP minimize superstring length  minimize hamiltonian path length in overlap graph with edge weights negated This is essentially the Traveling Salesman Problem (also NP-complete) AGA GAT TCG GAG Path: GAGATCG Path length: -5 String length: 7

The Greedy Algorithm Let G be a graph with fragments as vertices, and no edges to start Create a queue, Q, of overlap edges, with edges in order of increasing weight While G is disconnected – Pop the next possible edge e = (u,v) off of Q – If outdegree(u) = 0 and indegree(v) = 0 and e does not create a cycle Add e to G

Greedy Algorithms Definition: An algorithm that always takes the best immediate, or local, solution while finding an answer. Greedy algorithms find the overall, or globally, optimal solution for some optimization problems, but may find less-than-optimal solutions for some instances of other problems. Paul E. Black, "greedy algorithm", in Dictionary of Algorithms and Data Structures [online], Paul E. Black, ed., U.S. National Institute of Standards and Technology. 2 February

Greedy Algorithm Examples Prim’s Algorithm for Minimum Spanning Tree – Minimum spanning tree: a set of n-1 edges that connects a graph of n vertices and that has minimal total weight – Prim’s algorithm adds the edge with the smallest weight at each step Proven to give an optimal solution Traveling Salesman Problem – Greedy algorithm chooses to visit closest vertex at each step – Can give far-from-optimal answers

Sequencing by Hybridization (SBH) SBH array has probes for all possible k-mers For a given DNA sample, array tells us whether each k-mer is PRESENT or ABSENT in the sample The set of all k-mers present in a string s is called its spectrum Example: – s = ACTGATGCAT – spectrum(s, 3) = {ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC}

Example DNA Array AAACAGATCACCCGCTGAGCGGGTTATCTGTT AA ACX AG ATX CA CC CG CTX GAX GCX GG GT TA TC TGXX TT Sample: ACTGATGCAT Spectrum (k=4): {ACTG, ATGC, CTGA,GATG, GCAT,TGAT, TGCA}

SBH Problem Given: A set S of k-mers Do: Find a string s, such that spectrum(s,k) = S {ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC} ?

SBH as Eulerian path Could use Hamiltonian path approach, but not useful due to NP-completeness Instead, use Eulerian path approach Eulerian path: A path through a graph that traverses every edge exactly once Construct graph with all (k-1)-mers as vertices For each k-mer in spectrum, add edge from vertex representing first k-1 characters to vertex representing last k-1 characters

Properties of Eulerian graphs It will be easier to consider Eulerian cycles: Eulerian paths that form a cycle Graphs that have an Eulerian cycle are simply called Eulerian Theorem: A connected graph is Eulerian if and only if each of its vertices are balanced A vertex v is balanced if indegree(v) = outdegree(v) There is a polynomial-time algorithm for finding Eulerian cycles!

Seven Bridges of Königsberg Euler answered the question: “Is there a walk through the city that traverses each bridge exactly once?”

Eulerian cycle algorithm Start at any vertex v, traverse unused edges until returning to v While the cycle is not Eulerian – Pick a vertex w along the cycle for which there are untraversed outgoing edges – Traverse unused edges until ending up back at w – Join two cycles into one cycle

Joining cycles v v w v

Eulerian Path -> Eulerian Cycle If a graph has an Eulerian Path starting at s and ending at t then – All vertices must be balanced, except for s and t which may have |indegree(v) – outdegree(v)| = 1 – If and s and t are not balanced, add an edge between them to balance Graph now has an Eulerian cycle which can be converted to an Eulerian path by removal of the added edge

SBH graph example {ACT, ATG, CAT, CTG, GAT, GCA, TGA, TGC} AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

SBH difficulties In practice, sequencing by hybridization is hard – Arrays are often inaccurate -> incorrect spectra False positives/negatives – Need long probes to deal with repetitive sequence But the number of probes needed is exponential in the length of the probes! There is a limit to the number of probes per array (currently between 1-10 million probes / array)

Fragment assembly challenges Read errors – Complicates computing read overlaps Repeats – Roughly half of the human genome is composed of repetitive elements – Repetitive elements can be long (1000s of bp) – Human genome 1 million Alu repeats (~300 bp) 200,000 LINE repeats (~1000 bp)

Overlap-Layout-Consensus Most common assembler strategy 1.Overlap: Find all significant overlaps between reads, allowing for errors 2.Layout: Determine path through overlapping reads representing assembled sequence 3.Consensus: Correct for errors in reads using layout

Consensus TGACTGCGCTGCATCGTATCGTATC ATCGTCTCGTAGCTGACTGCGCTGC ATCGTATCGAATCGTAG GTATCGTAGCTGACTGCGCTGC TGACTGCGCTGCATCGTATCGTATCGTAGCTGACTGCGCTGC Layout Consensus

Whole Genome Sequencing Two main strategies: 1.Clone-by-clone mapping Fragment genome into large pieces, insert into BACs (Bacterial Artificial Chromosomes) Choose tiling set of BACs: overlapping set that covers entire genome Shotgun sequence the BACs 2.Whole-genome shotgun Shotgun sequence the entire genome at once

Assembly in practice Assembly methods used in practice are complex – But generally follow one of the two approaches Reads as vertices Reads as edges (or paths of edges) Assemblies do not typically give whole chromosomes – Instead gives a set of “contigs” – contig: contiguous piece of sequence from overlapping reads – contigs can be ordered into scaffolds with extra information (e.g., paired end reads)

Cloning and Paired-end reads DNA fragment vector Insert fragment into vector Sequence ends of insert using flanking primers … transform bacteria with vector and grow

Paired-end read advantages Scaffolding: layout of adjacent, but not overlapping, contigs Gap filling: contig paired-end reads scaffold gap

Sequence assembly summary Two general algorithmic strategies – Overlap graph hamiltonian paths – Eulerian paths in k-mer graphs Biggest challenge – Repeats! Large genomes have a lot of repetitive sequence Sequencing strategies – Clone-by-clone: break the problem into smaller pieces which have fewer repeats – Whole-genome shotgun: use paired-end reads to assemble around and inside repeats