Sequence Assembly: Concepts BMI/CS 576 Sushmita Roy September 2012 BMI/CS 576.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Graph Algorithms in Bioinformatics. Outline Introduction to Graph Theory Eulerian & Hamiltonian Cycle Problems Benzer Experiment and Interval Graphs DNA.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Reducibility Class of problems A can be reduced to the class of problems B Take any instance of problem A Show how you can construct an instance of problem.
Section 2.1 Euler Cycles Vocabulary CYCLE – a sequence of consecutively linked edges (x 1,x2),(x2,x3),…,(x n-1,x n ) whose starting vertex is the ending.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
Analysis of Algorithms CS 477/677
Approximation Algorithms for the Traveling Salesperson Problem.
Complexity ©D.Moshkovitz 1 Paths On the Reasonability of Finding Paths in Graphs.
Genome Assembly Charles Yan Fragment Assembly Given a large number of fragments, such as ACC AC AT AC AT GG …, the goal is to figure out the original.
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
GRAPH Learning Outcomes Students should be able to:
Genome Reconstruction: A Puzzle With a Billion Pieces Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau & Pavel Pevzner University.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Physical Mapping of DNA Shanna Terry March 2, 2004.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Slide 14-1 Copyright © 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
Graph Theory Topics to be covered:
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Algorithms and Running Time Algorithm: Well defined and finite sequence of steps to solve a well defined problem. Eg.,, Sequence of steps to multiply two.
Sequence Assembly Fall 2015 BMI/CS 576 Colin Dewey
Fuzzypath – Algorithms, Applications and Future Developments
Sequence Assembly BMI/CS 576 Fall 2010 Colin Dewey.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Data Structures & Algorithms Graphs
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
NP-Complete problems.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
© 2006 Pearson Addison-Wesley. All rights reserved 14 A-1 Chapter 14 Graphs.
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
I can describe the differences between Hamilton and Euler circuits and find efficient Hamilton circuits in graphs. Hamilton Circuits I can compare and.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
Grade 11 AP Mathematics Graph Theory Definition: A graph, G, is a set of vertices v(G) = {v 1, v 2, v 3, …, v n } and edges e(G) = {v i v j where 1 ≤ i,
CSCI2950-C Lecture 2 DNA Sequencing and Fragment Assembly
Short reads: 50 to 150 nt (nucleotide)
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CSCI2950-C Genomes, Networks, and Cancer
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Eulerian tours Miles Jones MTThF 8:30-9:50am CSE 4140 August 15, 2016.
Introduction to Genome Assembly
Discrete Maths 9. Graphs Objective
CS 598AGB Genome Assembly Tandy Warnow.
Graph Theory.
Graphs Chapter 13.
Genome Assembly.
Graphs Chapter 11 Objectives Upon completion you will be able to:
Graph Algorithms in Bioinformatics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Fragment Assembly 7/30/2019.
Presentation transcript:

Sequence Assembly: Concepts BMI/CS Sushmita Roy September 2012 BMI/CS 576

Goals for this unit Shortest superstring problem Greedy and non-polynomial time algorithms Graph-theoretic concepts – Euler path – Hamiltonian path k-mer spectrum – Debruijn graphs Popular algorithms for assembly

Sequencing problem What is the base composition of – A single molecule of DNA – A genome – Many genomes We cannot read the entire genomic sequence but can do this in pieces

Finding the genomic sequence Fragment the genome into short strings Sequence/read smaller strings Assemble these strings into the larger genome

Sequencing as a fragment assembly Multiple copies of sample DNA Randomly fragment DNA Sequence sample of fragments Assemble reads

Sequence assembly is like solving a big jigsaw puzzle

Different types of sequencing methods Sanger sequencing: 800bp-1000 bp – Chain termination method 454 sequencing: bp Illumina Genome Analyzer: bp Helicos: 30bp

First and second generation First generation – Automated Sanger sequencing – Amplification through bacterial colonies (in vivo) – bp – Low error – Slow and expensive (limited parallelization: in 96 or 384 capillary- based electrophoresis) Second generation – 454 sequencing, Illumina, Helicos, SOLiD – Amplification on arrays (in vitro) – bp – High error – Fast and cheap (parallel acquisition of many DNA sequences)

De novo assembly: The fragment assembly problem Given: A set of reads (strings) {s 1, s 2, …, s n } Do: Determine a superstring s that “best explains” the reads What do we mean by “best explains”?

Shortest superstring problem (SSP) Objective: Find a string s such that – all reads s 1, s 2, …, s n are substrings of s – s is as short as possible Assumptions: – Reads are 100% accurate – Identical reads must come from the same location on the genome – “best” = “simplest”

Shortest superstring example Given the reads: {ACG, CGA, CGC, CGT, GAC, GCG, GTA, TCG} What is the shortest superstring you can come up with? This problem is a computationally difficult problem – NP-complete TCGACGCGTA (length 10)

NP complete problems NP (non-deterministic polynomial) problems: – A problem for which we don’t know a polynomial algorithm – But we don’t whether a polynomial algorithm exists or not – You can check a solution in polynomial time NP complete – A problem that is NP – Is at least as hard as other problems in NP

SSP as the traveling salesman problem Traveling salesman problem – Given a set of cities and distances between them – Find the shortest path covering all cities, but no city more than once SSP->TSP – replace edge overlap by negative edge overlap – minimize over path length

Algorithms for shortest superstring problem Greedy Overlap Layout Consensus Eulerian path – Debruijn graphs

Greedy algorithms Definition: An algorithm that takes the best immediate, or local steps while finding an answer. May find less‐than-optimal solutions for some instances of problems.

Greedy algorithm for SSP Simple greedy strategy: while # strings > 1 do merge two strings with maximum overlap loop Overlap: size of substring common between two strings Conjectured to give string with length ≤ 2 × minimum length

Algorithms for shortest superstring problem Greedy Overlap Layout consensus (OLC) Eulerian path – Debruijn graphs The rest require graph theory.

Graph theory basics a graph (G) consists of vertices (V) and edges (E) G = (V,E) edges can either be directed (directed graphs) or undirected (undirected graphs)

Graph theory basics degree of a vertex: number of neighbors of a vertex – In degree: number of incoming edges – Out degree: number of outgoing edges path from u to v: number of connected edges starting from u and ending at v cycle is a path that begins and ends at the same vertex

Overlap Layout consensus methods Overlap graph creation Layout Consensus

Overlap graph For a set of sequence reads S, construct a directed weighted graph G = (V,E,w) – One vertex per read (v i corresponds to s i ) – edges between all vertices (a complete graph) – w(v i,v j ) = overlap(s i,s j ) length of longest suffix of s i that is a prefix of s j AGAT prefix suffix

Overlap graph example Let S = {AGA, GAT, TCG, GAG} AGA GAT TCG GAG

Overlap graph example Let S = {AGA, GAT, TCG, GAG} AGA GAT TCG GAG

Layout: Find the assembly Identify paths corresponding to segments of the genome The general case is finding the Hamiltonian path

Hamiltonian path Hamiltonian path: path through graph that visits each vertex exactly once AGA GAT TCG GAG Path: AGAGATCG Are there more Hamiltonian paths?

Hamiltonian path A path that visits every vertex only once Checking whether a graph has a Hamiltonian path or not is very difficult – NP complete

Assembly as a Hamiltonian path finding Hamiltonian path is an NP-complete problem – Similar to the Traveling salesman problem nevertheless overlap graphs are often used for sequence assembly

Consensus Once we have a path, determine the DNA sequence Implied by the arrangement of reads along the chosen path GTATCGTAGCTGACTGCGCTGC ATCGTCTCGTAGCTGACTGCGC TGCATCGTATCGAAGCGGAC TGACTGCGCTGCATCGTATCGTAGC Consensus: TGACTGCGCTGCATCGTATCGTAGCGGACTGCGCTGC

Summary OLC: Overlap layout consensus Overlap: compute the maximal substring common between two strings Layout: arrange the reads on a graph based on the overlap – Exact solution requires the Hamiltonian path->NP complete! – Greedy solution is used in practice

Euler Path and cycles A path in a graph that includes all edges An Euler cycle is an Euler path that begins and ends at the same vertex Finding an Euler path is easy – Linear in the number of edges of a graph

Seven Bridges of Königsberg Euler answered the question: “Is there a walk through the city that traverses each bridge exactly once?”

Properties of Eulerian graphs cycle: a path in a graph that starts/ends on the same vertex Eulerian cycle: a path that visits every edge of the graph exactly once Theorem: A connected graph has an Eulerian cycle if and only if each of its vertices are balanced A vertex v is balanced if indegree(v) = outdegree(v) There is a linear-time algorithm for finding Eulerian cycles!

Eulerian cycle algorithm start at any vertex v, traverse unused edges until returning to v while the cycle is not Eulerian – pick a vertex w along the cycle for which there are untraversed outgoing edges – traverse unused edges until ending up back at w – join two cycles into one cycle

Finding cycles 1) start at arbitrary vertex2) start at vertex along cycle with untraversed edges

Finding cycles 3) join cycles4) start at vertex along cycle with untraversed edges

Finding cycles 5) join cycles

Eulerian paths a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1 a connected graph has an Eulerian path if and only if it contains at most two semibalanced vertices

Eulerian path  Eulerian cycle If a graph has an Eulerian Path starting at w and ending at x then – All vertices must be balanced, except for w and x which are semibalanced – If and w and x are not balanced, add an edge between them to balance Graph now has an Eulerian cycle which can be converted to an Eulerian path by removal of the added edge

Eulerian path  Eulerian cycle

Spectrum of a sequence k-mer spectrum of s is the set of all k-mers that are present in the sequence ATTACAG – 3-spectrum: {ATT, TTA, TAC, ACA, CAG} – 4-spectrum: {ATTA,TTAC,TACA,ACAG}

deBruijn graph {ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT} AT CA CG GC GG GT TG ATG TGC TGG GGC GCA GCG CGT GTG in a deBruijn graph – edges represent k-mers that occur in s – vertices correspond to (k-1)-mers – Directionality goes from the k-1 prefix to the k-1 suffix

deBruijn graphs/Eulerian approach for assembly Find a DNA sequence containing all k-mers? In a deBruijn graph, can we find a path that visits every edge of the graph exactly once?

Assembly as finding Eulerian paths Eulerian path: path that visits every edge exactly once we can frame the assembly problem as finding Eulerian paths in a deBruijn graph resulting sequences contain all k-mers AT CA CG GC GG GT TG ATG TGC TGG GGC GCA GCG CGT GTG assembly: ATGGCGTGCA

Eulerian path-> Assembly But there are exponentially many cycles for a given graph! – Most assemblers pick one Better for next-generation sequencing data – Breaking reads into k-mers loses connectivity

Assembly is complicated because Of repeats – Hard to know where repeats begin or end – Reads are not long enough – Human genome has many different repeats SINEs (Short interspersed elements, 300bp) LINEs (Long interspersed elements, 1000 bps) Gene duplications Sequencing errors – Distinguishing true from error-based overlaps.

Sequence assembly in practice approaches are based on these ideas, but include a lot of heuristics “best” approach varies depending on length of reads, amount of repeats in the genome, availability of paired-end reads Assemblies are done iteratively – Contigs: contiguous piece of sequence from overlapping reads – Scaffolds: ordered contigs based on paired end reads

Sequence assembly in practice Most sequencing platforms produce a fastq file – Sequence quality scores: Phred scores Q=-10log 10 P, P is the probability of a base being wrong – Estimated from known sequences N50: a statistic used to assess the quality of assembly – The length l of a contig such that all contigs of length l’<l cover 50% of the assembly – Other NX also defined. – Higher the N50 the better.

Paired end reads cut many times at random genome reads are sequenced ends of each fragment one approach to reducing ambiguity in assembly is to use paired end reads Paired end tags

Implemented Algorithms for assembly Overlap layout consensus – Celera: was developed for human genome Finds uniquely assembled contigs Throws out regions with very high reads Resolve forks using mate pairs. – Arachne Uses mate pairs from the beginning – Use efficient k-mer based indexing to estimate read overlap Eulerian path/deBruijn path – Velvet

Velvet for assembling genomes from 32-50bp reads Debruijn graphs Generate k-mers per read Simplify – Merge linear chains Error correction – Remove tips: disconnected ends <2k length – Remove bubbles: Bus tour algorithm Do a breadth-first search. Each time you hit a node already seen consider merging Handle repeats – Breadcrumb

From reads to deBruijn graphs deBruijn Graph CTCCGA GGACTG TCCCCG GGGTGG AAGAGAGACACTCTTTTT 4-mer spectrum AAGA ACTC ACTG ACTT AGAC CCGA CGAC CTCC CTGG CTTT GACT GGAC GGGA TCCG TGGG human genome: ~ 3B nodes, ~10B edges AAGACA GGACTT … Reads

Simplifying the graph in Velvet CTCCG A CTGGGA AAGAGACACTCTTT CTCCGA GGACTG TCCCCG GGGTGG AAGAGAGACACTCTTTTT collapse linear subgraphs

Bus tour algorithm A A B B C C D D E E B1B1 B1B1 C1C1 C1C1 Z Z Trace shortest path from A A A B B C C D D E E B1B1 B1B1 C1C1 C1C1 Z Z D is reached through two paths

Bus tour algorithm A A B B C C D D E E B1B1 B1B1 C1C1 C1C1 Z Z 1 2 A A B B C C D D E E Z Z A B C C D D E E Z Z Align and merge to shorter path Path 1 < Path 2 Simplify

Breadcrumb for handling repeats It all boils down to connecting two long segments/contigs Use paired end reads to connect long segments Connect intermediate smaller segments using other paired end tags

Genome projects 1000 genomes to study human genetic variation 1001 Arabidopsis genomes i5K: 5000 insect genomes 16,000 vertebrate genomes Human microbiome project

Summary Sequence assembly: Shortest superstring problem Graph theoretic approaches to address this – Greedy algorithm – Eulerian paths – Debruijn path In practice much harder because – Repeats – Errors Algorithms use non-trivial post-processing to produce final assembly