Download presentation
Presentation is loading. Please wait.
Published byBethanie Mitchell Modified over 9 years ago
1
8. DNA Sequencing
2
Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved produces ACG and ACGTAAGC Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base) Collect resulting fragments by length Animations http://search.babylon.com/?q=sanger+sequencing+animation& AF=109929&babsrc=SP_ss&mntrId=28a4cdb3000000000000 001e4fcc176b http://search.babylon.com/?q=sanger+sequencing+animation& AF=109929&babsrc=SP_ss&mntrId=28a4cdb3000000000000 001e4fcc176b http://www.youtube.com/watch?v=UT9wqaVCH5s (http://www.mun.ca/biology/scarr/4241_RMC_Sequencing.html ) http://www.youtube.com/watch?v=UT9wqaVCH5s (http://www.mun.ca/biology/scarr/4241_RMC_Sequencing.html DNA Sequencing
3
Later, sequencing machines sequence 500-700 nt fragments, called read Reads are assembled into a continuous genome (difficult) Shotgun sequencing Current Next Generation Sequencing (NGS) www.cs.uml.edu/~kim/580/10_ngs.pdf DNA Sequencing
4
Shotgun method Break up DNA into small fragments, each of which is sequenced Use computer to search for overlap Build a master sequence Good for short prokaryote genomes For n fragments, # of possible overlaps is 2n(n-1) Repeats in sequences are problems Shot-gun Method
5
Shot-gun Sequencing
7
Assemby with F-R constraint Assemby without F-R constraint Scaffold with F-R Constraint
8
For long genomes, use genetic markers Use shot gun method and locate known markers in the master sequence Known genes can be markers Genetic Maps
9
Restriction endonuclease An enzyme binding to specific DNA sequences, and making double- stranded cut at or near the sequences Type II always cut at the same place (over 2,500 type II) e.g., HindII cuts at GTGCAC or GTTAAC Restriction Map
10
Probability of restriction site being cut = 1: complete digest Distance between successive cuts is known and accurate <1 : partial digest Distances across more than one restriction site are generated Complete and Partial Digest
11
X = {x 1 =0, x 2,..., x n }: an ordered set of n points on a line ΔX = {x i - x j | 1 ≤ i<j ≤ n}: a multiset of pairwise distances with ( n 2 ) elements Partial Digest Problem (PDP) Given a multiset L containing ( n 2 ) integers of pairwise distances Find a set X of n integers such than ΔX = L Also, called Turnpike problem, reconstructing highway from pairs of exits Unique set X is not always possible e.g., if ΔA = Δ(A+v), where Δ(A+v) = {a+v|a Є A} (one set is a shift of another set) A = {0,2,4,7,10}, Δ(A+100) = {100, 102, 104, 107, 110} e.g., if ΔA = Δ(-A) A = {0,2,4,7,10}, Δ(-A) = {-10, -7, -4, -2, 0} In general, U + V and U – V are homometric Partial Digest Problem (PDP)
12
PDP(1) Brute force approach Given L, Compute ΔX for every possible combination of X Until X is found such that ΔX = L Need to examine ( M-1 n-2 ) different set of positions => O(M n-2 ) BruteForcePDP(L, n) M ← max(L) for every set of n-2 integers 0< x 2 <... <x n-1 <M X ← {0 < x 2 <... <x n-1 <M } Form ΔX from X if ΔX = L return X return “No Solution”
13
PDP(2) Brute force approach Given L, Identical to BruteForcePDP() except that x i Є L Need to examine ( |L| n-2 ) different set of positions => O(M 2n-4 ) BruteForcePDP(L, n) M <- max(L) for every set of n-2 integers 0< x 2 <... <x n-1 <M from L X ← {0, < x 2 <... <x n-1 <M } Form ΔX from X if ΔX = L return X return “No Solution”
14
PDP(3) Steven Skiena, 1990 Largest in L determines the two outermost points in X e.g. L = {2,2,3,3,4,5,6,7,8,10} Pick 10: X={0,10} L = {2,2,3,3,4,5,6,7,8) Pick 8: X={0,2,10} or X={0,8,10} L = {2,3,3,4,5,6,7} Pick 7: x3=3 should include x3-x2=1 X={0,2,7,10} L = {2,3,3,4,5,6}...
15
PartialDigest(L) width ← max(L) DELETE(width, L) X ← {0, width} PLACE(L, X) [ Δ(y, X): multiset of distances between a point y and all points in set X] PLACE(L, X) if L is empty output X return y ← max(L) if Δ(y, X) is subset of L add y to X and remove Δ(y, X) from L PLACE(L,X) remove y from X and add Δ(y, X) to L if Δ(width-y, X) is subset of L add width-y to X and remove Δ(width-y, X) PLACE(L, X) remove width-y from X and add Δ(width-y, X) to L return
16
Shortest Superstring Problem Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s 1, s 2, …., s n Output: A shortest string s that contains all strings s 1, s 2, …., s n {001}111110101100011010000 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0
17
Shortest Superstring Problem -2 Define overlap( s i, s j ) The length of the longest prefex of s j that matches a suffix of s i Shortest Superstring problem becomes Traveling salesman problem with vertices for strings and edges of overlaps
18
DNA Arrays Sequencing by Hybridization (SBH) millions of short DNA fragments called probes in a chip Input DNA sequence reacts to fragments in an array (chip) via base complementary property
19
Base coverage A sample (genome) is amplified A base is the sample is copied into many reads But, reads are randomly generated Poisson distribution Similarly, k-mers Still, Poisson distribution, but different
21
Coverage Depth and Extent Coverage Depth The avg number of times each base or k-mer is sequenced Coverage Extent The ratio of genome covered by at least one base or k- mer Given a genome of size G, read length L, read number N Total number of bases (n b ) and k-mers (n k ) n b = N*L; n k = N*(L-k+1) n b /n b = L/(L-k+1)
22
Coverage Depth and Extent Coverage Depth of bases (d b ) and k-mers (d b ) d b = n b /G; d k = n k /G d b / d k = L/(L-k+1) For the de novo sequencing, these relationships can be used to estimate the unknown genome size (G) and coverage depth for bases (d b ) from read data before assembly from G = n k /d k and d b = d k * L/(L-k+1)
23
Coverage Depth or Sequencing Depth Coverage Depth (d b ) is called sequencing depth (c) From Poisson, prob. of non-coverage is P(X=0) = exp(-c) Coverage extent is P(X>0) = 1- exp(-c) To cover >99% of a genome, c>4.6 To ensure the whole genome is covered, # of uncovered bases G*exp(-c)<1 Human genome (3 Gb): c>22
25
SBH Given an unknown DNA sequence, DNA array provides All strings of length l that the sequence contains No information about their positions Spectrum ( s, l ) For string s of length n, the l -mer composition with multiset of n-l +1 l -mers in s l =3, s =TATGGTGC Spectrum( s.l ) = {TAT, ATG, TGG, GGT, GTG, TGC}
26
SBH as a Hamiltonian Path Problem Two l -mers overlap if overlap(p,q) = l-1 Hamiltonian Path Problem Given Spectrum ( s, l ), and a vertex for every l -mer in Spectrum ( s, l ) Connect every two vertices if two vertices overlap, So that visit every vertex Overlap-Layout-Consensus (OLC) NP-complete
27
OLC Conventional shotgun sequencing Overlap-layout-consensus Use computer to search for overlap: trying for all possible pairs of fragments Layout: putting fragments together Consensus: error correction Good for short prokaryote genomes For n fragments, # of possible overlaps is 2n(n-1) Difficult No solution for “repeat problem” to find correct path in the layout step Produce sequencing errors Programs PHRAP, CAP, TIGR, CELERA
28
SBH as an Eulerian Path Problem A graph with all ( l -1)-mers (later) edges corresponding to l -mers from Spectrum ( s, l ) Find a path visiting every edge exactly once
29
Eulerian Path Problem Repeatedly find Eulerian cycles in the graph Linear time
30
De Bruijn Graph Partition read fragments into fixed-size k-mers k = 27, for example Each ( k-1) -mer becomes a graph node
31
OLC vs. De Bruijn Graph
32
De Bruijn Graph Eulerian Graph De Bruijn Graph Glue parallel links with multiplicity (e.g., multiplicity of 3) Tangle: # of input edges is not equal to # of output edges
34
De Bruijn Graph How to construct de Bruijn graph from collections of sequencing reads ? Gluing requires knowledge of finished sequence Cannot construct de Bruijn graph from collection of sequencing reads until sequencing is completed Let s be a sequencing read with error If genome sequence G is known, errors in s can be done by aligning s against G But G is not known until the last “consensus” step EULER uses SA to minimize errors in the first step
36
ABySS (Assembly By Short Sequencing) Simpson, 2009 www.cs.uml.edu/~kim/580/08_abyss.pdf Velvet Zerbino and Birny, 2008 www.cs.uml.edu/~kim/580/08_velvet.pdf Euler Pevzner, 2001 www.cs.uml.edu/~kim/580/01_pevzner.pdf www.cs.uml.edu/~kim/580/09_chaisson.pdf SOAPdenovo (Short Oligonucleotide Alignment Program) Beijing Genomics Institute www.cs.uml.edu/~kim/580/09_soap.pdf Programs
37
ABySS Proceeds in two stages Stage 1 All possible k-mers are generated from reads Remove read errors and construct initial contigs Stage 2 Use mate-pairs to extend contigs Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers
38
ABySS – Stage 1 Three steps Load read data into distributed de Bruijn graph Resolve read errors Merge graph nodes Load read data into distributed de Bruijn graph Reads with unknown bases are discarded Each read is broken into (read_length-k+1) overlapping k-mers A k-mer is assigned to one cluster node Compute adjacency of k-mers For each k-mer, a message is sent to its eight possible neighbors If a neighbor exists, there must be a k-1 bp overlap
39
ABySS – Stage 1 (cont’d) Resolve read errors Remove dead-ends When correct k-mers of a read connect to incorrect k-mers, They are likely to be unique and most will not have an extension One end of the branch will terminate with no extension Dead-end branches are traced backward to the ambiguous point and are removed if their lengths are shorter than a threshold Remove bubbles A branch diverges and rejoins later Caused by single base differences
40
ABySS – Stages 1 and 2 Vertex merging Merge vertices linked by unambiguous edges Contig merging Use paired-end info
41
ABySS Results Genome of African male from NCBI Short Read Archive: Accession # SRA000271 3.5 Billion mate-paired reads, x42 Read length: 36-42 bp, median fragment 210 bp At k=27, 15h run time without paired-end info
42
ABySS Comparisons
43
Velvet Construct a graph Transform reads into roadmaps From a read, generate k-mers with read ID and position in the read (called roadmaps) Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers 2 nd database For each read, which k-mers are overlapped by subsequent reads Trace reads through the graph using roadmaps
45
Velvet Graph simplification A node with one outlink can be combined with a next node with one input link Error removal Focus on topological features Tips (dead-ends) shorter than 2k bubbles due to internal read errors (Tour Bus algorithm) Erroneous connections due to distant merging tips Breadcrumb – use read pairs to extend contigs
46
EULER, 2001 EULER Implement Eulerian Path problem Issues with real data Reads may have errors Error correction is typically done in ‘consensus’ stage EULER corrects errors in the first step SA (Spectral Alignment) Repeat problem De Bruijn graph
47
Spectral Alignment (SA) Genome sequence G is not known, set G l of all l -mers present in G can be accurately predicted An l -mer is called solid if it belongs to more than M reads EULER Approach – approximate G l as a set of all solid l -mers SBH problem without read errors Construct a graph with edges corresponding to l -mers from Spectrum(s, l ) Find a path visiting every edge exactly once SA Given a string s and G l, find the minimum number of mutations in s that transform s such that Spectrum(s, l ) = G l Can be efficiently programmed by dynamic programming
48
Spectral Alignment (SA) Formulation Given a set of reads R = {r 1,.., r n }, integer l, and upper bound Δ on the number of errors in each read Spectrum S l is a set of all l -mers from reads r 1,.., r n and reverse complements r 1 ’,.., r n ’ Introduce up to Δ corrections in each read in R such that | S l | is minimized Result One correction in a read can correct l from R and l from R’ Reduces 86.5% of read errors But, can create errors One change in a read may change all reads in the region Error introduction is OK as long as the errors from overlapping reads covering the same position are consistent, corresponding to a single mutation in a genome Correct 234,410 errors, introduce 1,452 errors in NM
49
EULER - Results No incorrect contigs
50
Summary of EULER Eulerian Path approach – de Bruijn graph Do error correction early – SA Fill gaps ASAP
51
EULER +, 2004 EULER+, 2004 A-Bruijn graph To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l -mer in de Bruijn assembly Graph simplication algorithms to remove errors in edges De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing
52
EULER – SR, 2008 EULER-SR, 2008 Focus on memory-efficient algorithm dealing with Short Reads Results Error correction E. coli – 68% error-free reads 99.6% errors are corrected 12h on a single processor, 1/2h for assembly
53
EULER – SR -- Result
54
EULER – USR, 2009 EULER-USR Show results of EULER-SR on error-prone Illumina read data Show 35-nt reads are sufficient when mate-pairs are used
56
SOAPdenovo – Schematic Overview
57
Stages Short Read Data Data Fragment and paired-end libraries are sequenced using various insert sizes Read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb. Preassembly error correction Identify low frequency (occurring <3 times) 17-mers and correcting them to the candidate with the highest frequency Number of distinct 25-mers was reduced from 14.6 B to 5.0 B for an Asian genome
58
Stage 2 De Bruijn Graph Only the single-end and paired-end reads with short insert sizes (<1 kb) were used Because of the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process Further error correction in the graph Clip tips (less than 50 bp) Remove low coverage links Resolve tiny repeats longer than k, but less than read lengths Merge bubbles Removed 323 M (6.5%) tip nodes and filtered 402.6 M low- coverage nodes, resolved 4.4 M tiny repeats, merged 4.2 M bubbles for Asian genome
59
OLC vs. de Bruijn Graph
61
Benefits
63
Scaffold linkage Connecting contigs Paired-end read Mate-pair
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.