Presentation is loading. Please wait.

Presentation is loading. Please wait.

8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved.

Similar presentations

Presentation on theme: "8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved."— Presentation transcript:

1 8. DNA Sequencing

2 Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved produces ACG and ACGTAAGC Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base) Collect resulting fragments by length Animations AF=109929&babsrc=SP_ss&mntrId=28a4cdb3000000000000 001e4fcc176b AF=109929&babsrc=SP_ss&mntrId=28a4cdb3000000000000 001e4fcc176b ( ) ( DNA Sequencing

3 Later, sequencing machines sequence 500-700 nt fragments, called read Reads are assembled into a continuous genome (difficult) Shotgun sequencing Current Next Generation Sequencing (NGS) DNA Sequencing

4 Shotgun method Break up DNA into small fragments, each of which is sequenced Use computer to search for overlap Build a master sequence Good for short prokaryote genomes For n fragments, # of possible overlaps is 2n(n-1) Repeats in sequences are problems Shot-gun Method

5 Shot-gun Sequencing


7 Assemby with F-R constraint Assemby without F-R constraint Scaffold with F-R Constraint

8 For long genomes, use genetic markers Use shot gun method and locate known markers in the master sequence Known genes can be markers Genetic Maps

9 Restriction endonuclease An enzyme binding to specific DNA sequences, and making double- stranded cut at or near the sequences Type II always cut at the same place (over 2,500 type II) e.g., HindII cuts at GTGCAC or GTTAAC Restriction Map

10 Probability of restriction site being cut = 1: complete digest Distance between successive cuts is known and accurate <1 : partial digest Distances across more than one restriction site are generated Complete and Partial Digest

11 X = {x 1 =0, x 2,..., x n }: an ordered set of n points on a line ΔX = {x i - x j | 1 ≤ i<j ≤ n}: a multiset of pairwise distances with ( n 2 ) elements Partial Digest Problem (PDP) Given a multiset L containing ( n 2 ) integers of pairwise distances Find a set X of n integers such than ΔX = L Also, called Turnpike problem, reconstructing highway from pairs of exits Unique set X is not always possible e.g., if ΔA = Δ(A+v), where Δ(A+v) = {a+v|a Є A} (one set is a shift of another set) A = {0,2,4,7,10}, Δ(A+100) = {100, 102, 104, 107, 110} e.g., if ΔA = Δ(-A) A = {0,2,4,7,10}, Δ(-A) = {-10, -7, -4, -2, 0} In general, U + V and U – V are homometric Partial Digest Problem (PDP)

12 PDP(1) Brute force approach Given L, Compute ΔX for every possible combination of X Until X is found such that ΔX = L Need to examine ( M-1 n-2 ) different set of positions => O(M n-2 ) BruteForcePDP(L, n) M ← max(L) for every set of n-2 integers 0< x 2 <... <x n-1 <M X ← {0 < x 2 <... <x n-1 <M } Form ΔX from X if ΔX = L return X return “No Solution”

13 PDP(2) Brute force approach Given L, Identical to BruteForcePDP() except that x i Є L Need to examine ( |L| n-2 ) different set of positions => O(M 2n-4 ) BruteForcePDP(L, n) M <- max(L) for every set of n-2 integers 0< x 2 <... <x n-1 <M from L X ← {0, < x 2 <... <x n-1 <M } Form ΔX from X if ΔX = L return X return “No Solution”

14 PDP(3) Steven Skiena, 1990 Largest in L determines the two outermost points in X e.g. L = {2,2,3,3,4,5,6,7,8,10} Pick 10: X={0,10} L = {2,2,3,3,4,5,6,7,8) Pick 8: X={0,2,10} or X={0,8,10} L = {2,3,3,4,5,6,7} Pick 7: x3=3 should include x3-x2=1 X={0,2,7,10} L = {2,3,3,4,5,6}...

15 PartialDigest(L) width ← max(L) DELETE(width, L) X ← {0, width} PLACE(L, X) [ Δ(y, X): multiset of distances between a point y and all points in set X] PLACE(L, X) if L is empty output X return y ← max(L) if Δ(y, X) is subset of L add y to X and remove Δ(y, X) from L PLACE(L,X) remove y from X and add Δ(y, X) to L if Δ(width-y, X) is subset of L add width-y to X and remove Δ(width-y, X) PLACE(L, X) remove width-y from X and add Δ(width-y, X) to L return

16 Shortest Superstring Problem Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s 1, s 2, …., s n Output: A shortest string s that contains all strings s 1, s 2, …., s n {001}111110101100011010000 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0

17 Shortest Superstring Problem -2 Define overlap( s i, s j ) The length of the longest prefex of s j that matches a suffix of s i Shortest Superstring problem becomes Traveling salesman problem with vertices for strings and edges of overlaps

18 DNA Arrays Sequencing by Hybridization (SBH) millions of short DNA fragments called probes in a chip Input DNA sequence reacts to fragments in an array (chip) via base complementary property

19 Base coverage A sample (genome) is amplified A base is the sample is copied into many reads But, reads are randomly generated Poisson distribution Similarly, k-mers Still, Poisson distribution, but different


21 Coverage Depth and Extent Coverage Depth The avg number of times each base or k-mer is sequenced Coverage Extent The ratio of genome covered by at least one base or k- mer Given a genome of size G, read length L, read number N Total number of bases (n b ) and k-mers (n k ) n b = N*L; n k = N*(L-k+1) n b /n b = L/(L-k+1)

22 Coverage Depth and Extent Coverage Depth of bases (d b ) and k-mers (d b ) d b = n b /G; d k = n k /G d b / d k = L/(L-k+1) For the de novo sequencing, these relationships can be used to estimate the unknown genome size (G) and coverage depth for bases (d b ) from read data before assembly from G = n k /d k and d b = d k * L/(L-k+1)

23 Coverage Depth or Sequencing Depth Coverage Depth (d b ) is called sequencing depth (c) From Poisson, prob. of non-coverage is P(X=0) = exp(-c) Coverage extent is P(X>0) = 1- exp(-c) To cover >99% of a genome, c>4.6 To ensure the whole genome is covered, # of uncovered bases G*exp(-c)<1 Human genome (3 Gb): c>22


25 SBH Given an unknown DNA sequence, DNA array provides All strings of length l that the sequence contains No information about their positions Spectrum ( s, l ) For string s of length n, the l -mer composition with multiset of n-l +1 l -mers in s l =3, s =TATGGTGC  Spectrum( s.l ) = {TAT, ATG, TGG, GGT, GTG, TGC}

26 SBH as a Hamiltonian Path Problem Two l -mers overlap if overlap(p,q) = l-1 Hamiltonian Path Problem Given Spectrum ( s, l ), and a vertex for every l -mer in Spectrum ( s, l ) Connect every two vertices if two vertices overlap, So that visit every vertex Overlap-Layout-Consensus (OLC) NP-complete

27 OLC  Conventional shotgun sequencing  Overlap-layout-consensus  Use computer to search for overlap: trying for all possible pairs of fragments  Layout: putting fragments together  Consensus: error correction  Good for short prokaryote genomes  For n fragments, # of possible overlaps is 2n(n-1)  Difficult  No solution for “repeat problem” to find correct path in the layout step  Produce sequencing errors  Programs  PHRAP, CAP, TIGR, CELERA

28 SBH as an Eulerian Path Problem A graph with all ( l -1)-mers (later) edges corresponding to l -mers from Spectrum ( s, l ) Find a path visiting every edge exactly once

29 Eulerian Path Problem Repeatedly find Eulerian cycles in the graph Linear time

30 De Bruijn Graph Partition read fragments into fixed-size k-mers k = 27, for example Each ( k-1) -mer becomes a graph node

31 OLC vs. De Bruijn Graph

32 De Bruijn Graph  Eulerian Graph  De Bruijn Graph  Glue parallel links with multiplicity (e.g., multiplicity of 3)  Tangle: # of input edges is not equal to # of output edges


34 De Bruijn Graph  How to construct de Bruijn graph from collections of sequencing reads ?  Gluing requires knowledge of finished sequence  Cannot construct de Bruijn graph from collection of sequencing reads until sequencing is completed  Let s be a sequencing read with error  If genome sequence G is known, errors in s can be done by aligning s against G  But G is not known until the last “consensus” step  EULER uses SA to minimize errors in the first step


36 ABySS (Assembly By Short Sequencing) Simpson, 2009 Velvet Zerbino and Birny, 2008 Euler Pevzner, 2001 SOAPdenovo (Short Oligonucleotide Alignment Program) Beijing Genomics Institute Programs

37 ABySS Proceeds in two stages Stage 1 All possible k-mers are generated from reads Remove read errors and construct initial contigs Stage 2 Use mate-pairs to extend contigs Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers

38 ABySS – Stage 1 Three steps Load read data into distributed de Bruijn graph Resolve read errors Merge graph nodes Load read data into distributed de Bruijn graph Reads with unknown bases are discarded Each read is broken into (read_length-k+1) overlapping k-mers A k-mer is assigned to one cluster node Compute adjacency of k-mers For each k-mer, a message is sent to its eight possible neighbors If a neighbor exists, there must be a k-1 bp overlap

39 ABySS – Stage 1 (cont’d) Resolve read errors Remove dead-ends When correct k-mers of a read connect to incorrect k-mers, They are likely to be unique and most will not have an extension One end of the branch will terminate with no extension Dead-end branches are traced backward to the ambiguous point and are removed if their lengths are shorter than a threshold Remove bubbles A branch diverges and rejoins later Caused by single base differences

40 ABySS – Stages 1 and 2 Vertex merging Merge vertices linked by unambiguous edges Contig merging Use paired-end info

41 ABySS Results  Genome of African male from NCBI Short Read Archive: Accession # SRA000271  3.5 Billion mate-paired reads, x42  Read length: 36-42 bp, median fragment 210 bp  At k=27, 15h run time without paired-end info

42 ABySS Comparisons

43 Velvet Construct a graph Transform reads into roadmaps From a read, generate k-mers with read ID and position in the read (called roadmaps) Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers 2 nd database For each read, which k-mers are overlapped by subsequent reads Trace reads through the graph using roadmaps


45 Velvet Graph simplification A node with one outlink can be combined with a next node with one input link Error removal Focus on topological features Tips (dead-ends) shorter than 2k bubbles due to internal read errors (Tour Bus algorithm) Erroneous connections due to distant merging tips Breadcrumb – use read pairs to extend contigs

46 EULER, 2001  EULER  Implement Eulerian Path problem  Issues with real data  Reads may have errors  Error correction is typically done in ‘consensus’ stage  EULER corrects errors in the first step  SA (Spectral Alignment)  Repeat problem  De Bruijn graph

47 Spectral Alignment (SA)  Genome sequence G is not known, set G l of all l -mers present in G can be accurately predicted  An l -mer is called solid if it belongs to more than M reads  EULER Approach – approximate G l as a set of all solid l -mers  SBH problem without read errors  Construct a graph with edges corresponding to l -mers from Spectrum(s, l )  Find a path visiting every edge exactly once  SA  Given a string s and G l, find the minimum number of mutations in s that transform s such that Spectrum(s, l ) = G l  Can be efficiently programmed by dynamic programming

48 Spectral Alignment (SA)  Formulation  Given a set of reads R = {r 1,.., r n }, integer l, and upper bound Δ on the number of errors in each read  Spectrum S l is a set of all l -mers from reads r 1,.., r n and reverse complements r 1 ’,.., r n ’  Introduce up to Δ corrections in each read in R such that | S l | is minimized  Result  One correction in a read can correct l from R and l from R’  Reduces 86.5% of read errors  But, can create errors  One change in a read may change all reads in the region  Error introduction is OK as long as the errors from overlapping reads covering the same position are consistent, corresponding to a single mutation in a genome  Correct 234,410 errors, introduce 1,452 errors in NM

49 EULER - Results  No incorrect contigs

50 Summary of EULER  Eulerian Path approach – de Bruijn graph  Do error correction early – SA  Fill gaps ASAP

51 EULER +, 2004  EULER+, 2004  A-Bruijn graph  To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l -mer in de Bruijn assembly  Graph simplication algorithms to remove errors in edges  De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing

52 EULER – SR, 2008  EULER-SR, 2008  Focus on memory-efficient algorithm dealing with Short Reads  Results  Error correction  E. coli – 68% error-free reads  99.6% errors are corrected  12h on a single processor, 1/2h for assembly

53 EULER – SR -- Result

54 EULER – USR, 2009  EULER-USR  Show results of EULER-SR on error-prone Illumina read data  Show 35-nt reads are sufficient when mate-pairs are used


56 SOAPdenovo – Schematic Overview

57 Stages Short Read Data Data Fragment and paired-end libraries are sequenced using various insert sizes Read lengths from 35 to 75 bp and insert sizes of 140 bp, 440 bp, 2.6 kb, 6 kb, and 9.6 kb. Preassembly error correction Identify low frequency (occurring <3 times) 17-mers and correcting them to the candidate with the highest frequency Number of distinct 25-mers was reduced from 14.6 B to 5.0 B for an Asian genome

58 Stage 2 De Bruijn Graph Only the single-end and paired-end reads with short insert sizes (<1 kb) were used Because of the high probability of chimeric reads in the long-insert pairs from the circularizing and fragmentation process Further error correction in the graph Clip tips (less than 50 bp) Remove low coverage links Resolve tiny repeats longer than k, but less than read lengths Merge bubbles Removed 323 M (6.5%) tip nodes and filtered 402.6 M low- coverage nodes, resolved 4.4 M tiny repeats, merged 4.2 M bubbles for Asian genome

59 OLC vs. de Bruijn Graph


61 Benefits


63 Scaffold linkage Connecting contigs Paired-end read Mate-pair

Download ppt "8. DNA Sequencing. Fred Sanger, Cambridge, England Partition copied DNA into four groups Each group has one of four bases starved ACGTAAGCTA with T starved."

Similar presentations

Ads by Google