8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Next Generation Sequencing, Assembly, and Alignment Methods
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
Introduction to Bioinformatics Algorithms DNA Mapping and Brute Force Algorithms.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Introduction to Bioinformatics Algorithms DNA Mapping and Brute Force Algorithms.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Genome sequencing and assembling
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
Introduction to Bioinformatics Algorithms Exhaustive Search and Branch-and-Bound Algorithms for Partial Digest Mapping.
Introduction to Bioinformatics Algorithms Graph Algorithms in Bioinformatics.
De-novo Assembly Day 4.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Mouse Genome Sequencing
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
394C March 5, 2012 Introduction to Genome Assembly.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Fuzzypath – Algorithms, Applications and Future Developments
Metagenomics Assembly Hubert DENISE
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Introduction to Bioinformatics Algorithms DNA Mapping and Brute Force Algorithms.
Human Genome.
Locating and sequencing genes
A new Approach to Fragment Assembly in DNA Sequenceing Fei wu April,24,2006.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Physical Mapping – Restriction Mapping.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
COSC 3101A - Design and Analysis of Algorithms 14 NP-Completeness.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Introduction to Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Graph Algorithms in Bioinformatics
An Eulerian path approach to DNA fragment assembly
CSE 5290: Algorithms for Bioinformatics Fall 2009
Fragment Assembly 7/30/2019.
Presentation transcript:

8. DNA Sequencing

Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved produces ACG and ACGTAAGC Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base) Collect resulting fragments by length Animations AF=109929&babsrc=SP_ss&mntrId=28a4cdb e4fcc176b AF=109929&babsrc=SP_ss&mntrId=28a4cdb e4fcc176b ( ) ( DNA Sequencing

Later, sequencing machines sequence nt fragments, called read Reads are assembled into a continuous genome (difficult) Shotgun sequencing Current Next Generation Sequencing (NGS) DNA Sequencing

Shotgun method Break up DNA into small fragments, each of which is sequenced Use computer to search for overlap Build a master sequence Good for short prokaryote genomes For n fragments, # of possible overlaps is 2n(n-1) Repeats in sequences are problems Shot-gun Method

Shot-gun Sequencing

Assemby with F-R constraint Assemby without F-R constraint Scaffold with F-R Constraint

For long genomes, use genetic markers Use shot gun method and locate known markers in the master sequence Known genes can be markers Genetic Maps

Restriction endonuclease An enzyme binding to specific DNA sequences, and making double- stranded cut at or near the sequences Type II always cut at the same place (over 2,500 type II) e.g., HindII cuts at GTGCAC or GTTAAC Restriction Map

Probability of restriction site being cut = 1: complete digest Distance between successive cuts is known and accurate <1 : partial digest Distances across more than one restriction site are generated Complete and Partial Digest

X = {x 1 =0, x 2,..., x n }: an ordered set of n points on a line ΔX = {x i - x j | 1 ≤ i<j ≤ n}: a multiset of pairwise distances with ( n 2 ) elements Partial Digest Problem (PDP) Given a multiset L containing ( n 2 ) integers of pairwise distances Find a set X of n integers such than ΔX = L Also, called Turnpike problem, reconstructing highway from pairs of exits Unique set X is not always possible e.g., if ΔA = Δ(A+v), where Δ(A+v) = {a+v|a Є A} (one set is a shift of another set) A = {0,2,4,7,10}, Δ(A+100) = {100, 102, 104, 107, 110} e.g., if ΔA = Δ(-A) A = {0,2,4,7,10}, Δ(-A) = {-10, -7, -4, -2, 0} In general, U + V and U – V are homometric Partial Digest Problem (PDP)

PDP(1) Brute force approach Given L, Compute ΔX for every possible combination of X Until X is found such that ΔX = L Need to examine ( M-1 n-2 ) different set of positions => O(M n-2 ) BruteForcePDP(L, n) M ← max(L) for every set of n-2 integers 0< x 2 <... <x n-1 <M X ← {0 < x 2 <... <x n-1 <M } Form ΔX from X if ΔX = L return X return “No Solution”

PDP(2) Brute force approach Given L, Identical to BruteForcePDP() except that x i Є L Need to examine ( |L| n-2 ) different set of positions => O(M 2n-4 ) BruteForcePDP(L, n) M <- max(L) for every set of n-2 integers 0< x 2 <... <x n-1 <M from L X ← {0, < x 2 <... <x n-1 <M } Form ΔX from X if ΔX = L return X return “No Solution”

PDP(3) Steven Skiena, 1990 Largest in L determines the two outermost points in X e.g. L = {2,2,3,3,4,5,6,7,8,10} Pick 10: X={0,10} L = {2,2,3,3,4,5,6,7,8) Pick 8: X={0,2,10} or X={0,8,10} L = {2,3,3,4,5,6,7} Pick 7: x3=3 should include x3-x2=1 X={0,2,7,10} L = {2,3,3,4,5,6}...

PartialDigest(L) width ← max(L) DELETE(width, L) X ← {0, width} PLACE(L, X) [ Δ(y, X): multiset of distances between a point y and all points in set X] PLACE(L, X) if L is empty output X return y ← max(L) if Δ(y, X) is subset of L add y to X and remove Δ(y, X) from L PLACE(L,X) remove y from X and add Δ(y, X) to L if Δ(width-y, X) is subset of L add width-y to X and remove Δ(width-y, X) PLACE(L, X) remove width-y from X and add Δ(width-y, X) to L return

Shortest Superstring Problem Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s 1, s 2, …., s n Output: A shortest string s that contains all strings s 1, s 2, …., s n {001}

Shortest Superstring Problem -2 Define overlap( s i, s j ) The length of the longest prefex of s j that matches a suffix of s i Shortest Superstring problem becomes Traveling salesman problem with vertices for strings and edges of overlaps

DNA Arrays Sequencing by Hybridization (SBH) millions of short DNA fragments called probes in a chip Input DNA sequence reacts to fragments in an array (chip) via base complementary property

Base coverage A sample (genome) is amplified A base is the sample is copied into many reads But, reads are randomly generated Poisson distribution Similarly, k-mers Still, Poisson distribution, but different

Coverage Depth and Extent Coverage Depth The avg number of times each base or k-mer is sequenced Coverage Extent The ratio of genome covered by at least one base or k- mer Given a genome of size G, read length L, read number N Total number of bases (n b ) and k-mers (n k ) n b = N*L; n k = N*(L-k+1) n b /n b = L/(L-k+1)

Coverage Depth and Extent Coverage Depth of bases (d b ) and k-mers (d b ) d b = n b /G; d k = n k /G d b / d k = L/(L-k+1) For the de novo sequencing, these relationships can be used to estimate the unknown genome size (G) and coverage depth for bases (d b ) from read data before assembly from G = n k /d k and d b = d k * L/(L-k+1)

Coverage Depth or Sequencing Depth Coverage Depth (d b ) is called sequencing depth (c) From Poisson, prob. of non-coverage is P(X=0) = exp(-c) Coverage extent is P(X>0) = 1- exp(-c) To cover >99% of a genome, c>4.6 To ensure the whole genome is covered, # of uncovered bases G*exp(-c)<1 Human genome (3 Gb): c>22

SBH Given an unknown DNA sequence, DNA array provides All strings of length l that the sequence contains No information about their positions Spectrum ( s, l ) For string s of length n, the l -mer composition with multiset of n-l +1 l -mers in s l =3, s =TATGGTGC  Spectrum( s.l ) = {TAT, ATG, TGG, GGT, GTG, TGC}

SBH as a Hamiltonian Path Problem Two l -mers overlap if overlap(p,q) = l-1 Hamiltonian Path Problem Given Spectrum ( s, l ), and a vertex for every l -mer in Spectrum ( s, l ) Connect every two vertices if two vertices overlap, So that visit every vertex Overlap-Layout-Consensus (OLC) NP-complete

OLC  Conventional shotgun sequencing  Overlap-layout-consensus  Use computer to search for overlap: trying for all possible pairs of fragments  Layout: putting fragments together  Consensus: error correction  Good for short prokaryote genomes  For n fragments, # of possible overlaps is 2n(n-1)  Difficult  No solution for “repeat problem” to find correct path in the layout step  Produce sequencing errors  Programs  PHRAP, CAP, TIGR, CELERA

SBH as an Eulerian Path Problem A graph with all ( l -1)-mers (later) edges corresponding to l -mers from Spectrum ( s, l ) Find a path visiting every edge exactly once

Eulerian Path Problem Repeatedly find Eulerian cycles in the graph Linear time

De Bruijn Graph Partition read fragments into fixed-size k-mers k = 27, for example Each ( k-1) -mer becomes a graph node

OLC vs. De Bruijn Graph

De Bruijn Graph  Eulerian Graph  De Bruijn Graph  Glue parallel links with multiplicity (e.g., multiplicity of 3)  Tangle: # of input edges is not equal to # of output edges

De Bruijn Graph  How to construct de Bruijn graph from collections of sequencing reads ?  Gluing requires knowledge of finished sequence  Cannot construct de Bruijn graph from collection of sequencing reads until sequencing is completed  Let s be a sequencing read with error  If genome sequence G is known, errors in s can be done by aligning s against G  But G is not known until the last “consensus” step  EULER uses SA to minimize errors in the first step

ABySS (Assembly By Short Sequencing) Simpson, Velvet Zerbino and Birny, Euler Pevzner, SOAPdenovo (Short Oligonucleotide Alignment Program) Beijing Genomics Institute Programs

ABySS Proceeds in two stages Stage 1 All possible k-mers are generated from reads Remove read errors and construct initial contigs Stage 2 Use mate-pairs to extend contigs Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers

ABySS – Stage 1 Three steps Load read data into distributed de Bruijn graph Resolve read errors Merge graph nodes Load read data into distributed de Bruijn graph Reads with unknown bases are discarded Each read is broken into (read_length-k+1) overlapping k-mers A k-mer is assigned to one cluster node Compute adjacency of k-mers For each k-mer, a message is sent to its eight possible neighbors If a neighbor exists, there must be a k-1 bp overlap

ABySS – Stage 1 (cont’d) Resolve read errors Remove dead-ends When correct k-mers of a read connect to incorrect k-mers, They are likely to be unique and most will not have an extension One end of the branch will terminate with no extension Dead-end branches are traced backward to the ambiguous point and are removed if their lengths are shorter than a threshold Remove bubbles A branch diverges and rejoins later Caused by single base differences

ABySS – Stages 1 and 2 Vertex merging Merge vertices linked by unambiguous edges Contig merging Use paired-end info

ABySS Results  Genome of African male from NCBI Short Read Archive: Accession # SRA  3.5 Billion mate-paired reads, x42  Read length: bp, median fragment 210 bp  At k=27, 15h run time without paired-end info

ABySS Comparisons

Velvet Construct a graph Transform reads into roadmaps From a read, generate k-mers with read ID and position in the read (called roadmaps) Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers 2 nd database For each read, which k-mers are overlapped by subsequent reads Trace reads through the graph using roadmaps

Velvet Graph simplification A node with one outlink can be combined with a next node with one input link Error removal Focus on topological features Tips (dead-ends) shorter than 2k bubbles due to internal read errors (Tour Bus algorithm) Erroneous connections due to distant merging tips Breadcrumb – use read pairs to extend contigs

EULER, 2001  EULER  Implement Eulerian Path problem  Issues with real data  Reads may have errors  Error correction is typically done in ‘consensus’ stage  EULER corrects errors in the first step  SA (Spectral Alignment)  Repeat problem  De Bruijn graph

Spectral Alignment (SA)  Genome sequence G is not known, set G l of all l -mers present in G can be accurately predicted  An l -mer is called solid if it belongs to more than M reads  EULER Approach – approximate G l as a set of all solid l -mers  SBH problem without read errors  Construct a graph with edges corresponding to l -mers from Spectrum(s, l )  Find a path visiting every edge exactly once  SA  Given a string s and G l, find the minimum number of mutations in s that transform s such that Spectrum(s, l ) = G l  Can be efficiently programmed by dynamic programming

Spectral Alignment (SA)  Formulation  Given a set of reads R = {r 1,.., r n }, integer l, and upper bound Δ on the number of errors in each read  Spectrum S l is a set of all l -mers from reads r 1,.., r n and reverse complements r 1 ’,.., r n ’  Introduce up to Δ corrections in each read in R such that | S l | is minimized  Result  One correction in a read can correct l from R and l from R’  Reduces 86.5% of read errors  But, can create errors  One change in a read may change all reads in the region  Error introduction is OK as long as the errors from overlapping reads covering the same position are consistent, corresponding to a single mutation in a genome  Correct 234,410 errors, introduce 1,452 errors in NM

EULER - Results  No incorrect contigs

Summary of EULER  Eulerian Path approach – de Bruijn graph  Do error correction early – SA  Fill gaps ASAP

EULER +, 2004  EULER+, 2004  A-Bruijn graph  To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l -mer in de Bruijn assembly  Graph simplication algorithms to remove errors in edges  De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing

EULER – SR, 2008  EULER-SR, 2008  Focus on memory-efficient algorithm dealing with Short Reads  Results  Error correction  E. coli – 68% error-free reads  99.6% errors are corrected  12h on a single processor, 1/2h for assembly

EULER – SR -- Result

EULER – USR, 2009  EULER-USR  Show results of EULER-SR on error-prone Illumina read data  Show 35-nt reads are sufficient when mate-pairs are used

SOAPdenovo – Schematic Overview

Stages Short Read Data Data Fragment and paired-end libraries are sequenced using various insert sizes Read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb. Preassembly error correction Identify low frequency (occurring <3 times) 17-mers and correcting them to the candidate with the highest frequency Number of distinct 25-mers was reduced from 14.6 B to 5.0 B for an Asian genome

Stage 2 De Bruijn Graph Only the single-end and paired-end reads with short insert sizes (<1 kb) were used Because of the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process Further error correction in the graph Clip tips (less than 50 bp) Remove low coverage links Resolve tiny repeats longer than k, but less than read lengths Merge bubbles Removed 323 M (6.5%) tip nodes and filtered M low- coverage nodes, resolved 4.4 M tiny repeats, merged 4.2 M bubbles for Asian genome

OLC vs. de Bruijn Graph

Benefits

Scaffold linkage Connecting contigs Paired-end read Mate-pair