Download presentation
Presentation is loading. Please wait.
Published byCatherine Simon Modified over 9 years ago
1
DNA Sequencing with Longer Reads Byung G. Kim Computer Science Dept. Univ. of Mass. Lowell kim@cs.uml.edu
2
Outline DNA distribution Counting problem DNA sequencing Reassemble DNA fragments Part of $1,000 genome sequencing
3
Bioinformatics What is bioinformatics ? Intersection biology statistics computer science Bioinformatics problems DNA sequences Protein sequences/structures Modeling/inference
4
DNA and RNA DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides A nucleotide has three parts: a sugar, a phophate and a base Four bases (monomers)
5
Double helix – 1953 Watson and Crick using X-ray diffraction Sugar-phosphate backbone is the outer part of the helix Two strands run in antiparallel directions Two strands are complementary Base pairing: A-T; G-C DNA Secondary Structure
6
In double strands # of A = # of T; # of G = # of C Erwin Chargaff’s 1 st Parity Rule, 1951 In a single strand ? # of A = # of T; # of G = # of C Erwin Chargaff’s 2nd Parity Rule How about oligomer (a few successive bases) frequencies ? Monomer Counts in DNA
7
Oligomer Frequencies Oligomer length = k Window of k sliding by one base (overlapping k-1 bases) A simple counting program May have to contend with long sequences An oligomer and its reverse complement ACT vs. AGT A C T A A G C G ……
8
Trimer Frequencies in Yeast
9
Trimer Frequencies in a Few Species
10
Observations Symmetric in oligomer and reverse complements Very similar among chromosomes Issues Two successive windows share (k-1) bases Are they independent ? -- probably not (consider a large k) Need jumping window Given an oligomer distribution with k, generate a random sequence according to the same distribution Distributions with (k+1) from the random and the real sequences
11
DNA Sequencing Problem A sample (genome) is amplified, and broken into multiple fragments Reconstruct the original sequence from its fragments (reads)
12
Shot-gun Sequencing
13
Base coverage A sample (genome) is amplified A base in the sample is copied into many reads But, reads are randomly generated Poisson distribution How many copies of reads to cover the whole genome ? Called sequencing depth (c) From Poisson, prob. of non-coverage is P(X=0) = exp(-c) # of uncovered bases G*exp(-c)<1 Human genome (3 Gb): c>22
14
Sequencing By Hybridization (SBH) More recent sequencing machines produce short k-mers, called reads Given an unknown DNA sequence, DNA array provides All strings of length l that the sequence contains No information about their positions Sequencing by Hybridization (SBH) Problem Reconstruct a string from its k - mer composition
15
SBH Spectrum ( s, k ) For string s of length n, the k -mer composition with a multiset of n-k +1 k -mers in s k =3, s =TATGGTGC Spectrum( s.k ) = {TAT, ATG, TGG, GGT, GTG, TGC} Sequencing by Hybridization (SBH) Problem Reconstruct a string from its l- mer composition Input: A set S, representing all k -mers from unknown string s Output: string s such that Spectrum( s, k ) = S
16
Shortest Superstring Problem Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s 1, s 2, …., s n Output: A shortest string s that contains all strings s 1, s 2, …., s n {001}111110101100011010000 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 0
17
SBH as a Hamiltonian Path Problem Two k -mers overlap if overlap(p,q) = k-1 Hamiltonian Path Problem Given Spectrum ( s, k ), and a vertex for every k -mer in Spectrum ( s, k ) Connect every two vertices if two vertices overlap, and visit every vertex Overlap-Layout-Consensus (OLC) NP-complete
18
OLC Conventional shotgun sequencing Overlap-layout-consensus Use a computer to search for overlap: trying for all possible pairs of fragments Layout: putting fragments together Consensus: error correction Good for short prokaryote genomes For n fragments, # of possible overlaps is 2n(n-1) Difficult No solution for “repeat problem” to find correct path in the layout step Produce sequencing errors Programs PHRAP, CAP, TIGR, CELERA
19
SBH as an Eulerian Path Problem Break up reads into short k -mers ( k = 27, for example) A graph with all ( k -1)-mers : De Bruijn Graph Edges corresponding to k -mers from Spectrum ( s, k ) Find a path visiting every edge exactly once
20
Eulerian Path Problem Repeatedly find Eulerian cycles in the graph Linear time
21
OLC vs. De Bruijn Graph
23
Scaffold linkage Connecting contigs Paired-end read Mate-pair
24
ABySS (Assembly By Short Sequencing) Simpson, 2009 www.cs.uml.edu/~kim/580/08_abyss.pdf Velvet Zerbino and Birny, 2008 www.cs.uml.edu/~kim/580/08_velvet.pdf Euler Pevzner, 2001 www.cs.uml.edu/~kim/580/01_pevzner.pdf www.cs.uml.edu/~kim/580/01_pevzner.pdf www.cs.uml.edu/~kim/580/09_chaisson.pdf SOAPdenovo (Short Oligonucleotide Alignment Program) Beijing Genomics Institute www.cs.uml.edu/~kim/580/09_soap.pdf SBH Programs
25
SBH Schematic Overview
26
ABySS Proceeds in two stages Stage 1 All possible k-mers are generated from reads Remove read errors and construct initial contigs Stage 2 Use mate-pairs to extend contigs Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers
27
Velvet Construct a graph Transform reads into roadmaps From a read, generate k-mers with read ID and position in the read (called roadmaps) Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers 2 nd database For each read, which k-mers are overlapped by subsequent reads Trace reads through the graph using roadmaps
28
EULER, 2001 EULER Implement Eulerian Path problem Issues with real data Reads may have errors Error correction is typically done in ‘consensus’ stage EULER corrects errors in the first step Repeat problem De Bruijn graph
29
EULER +, 2004 EULER+, 2004 A-Bruijn graph To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l -mer in de Bruijn assembly Graph simplication algorithms to remove errors in edges De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing
30
EULER – SR, 2008 EULER-SR, 2008 Focus on memory-efficient algorithm dealing with Short Reads Results Error correction E. coli – 68% error-free reads 99.6% errors are corrected 12h on a single processor, 1/2h for assembly
31
EULER – USR, 2009 EULER-USR Show results of EULER-SR on error-prone Illumina read data Show 35-nt reads are sufficient when mate-pairs are used
32
Comparisons
33
OLC vs. de Bruijn Graph
35
Difficulties in Graph-based Approaches De Bruijn Graph (DBG) approach A read (150-500 bp) is broken into a number of k-mers A read of 150 bp produces 123 k-mers when k=27 (n-k+1) New sequencing machines produce longer reads Problem Sizes OLC For a large genome, more than giga reads Eulerian Path k = 27 => ( k -1) = 26 4**(26) = 2**(52) (2**30 = 1 G)
36
New OLC Approach Preserve read info Data Partition Partition data according to front and rear k-mers Utilize known oligomer distribution of species
37
New OLC Approach Native OLC Pair-wise fragment overlap Partial OLC Partition data Create consensus string
38
New OLC Approach Data Partition Front buckets of identical first k-mers of reads Rear buckets of identical last k-mers of reads k=15, 4 15 = 2 30 = 1 billion buckets * 2 Join F and R buckets with the same k-mers Produce consensus contigs with base matches over a threshold Repeat generating new F & R buckets, Each time dealing with longer contigs Per-bucket sequencing
39
Benefits Hybrid of OLC and fixed-size de Bruijn graph First k or last k bp’s can be rapidly compared Parallelize the program with all possible combinations of s
40
Implementation Dual Index Bucket Map data structure according to prefix and suffix A set of reads A hash table (Java HashMap) mapping prefixes to a Java Vector of references (pointers) to fragments in the set Another hash table mapping suffixes to a vector of references to fragments in the set. Bucket size is a parameter
41
Algorithms Focused merge Start with a random (or the largest fragment) and keep merging from there until there is no way to merge further or the resulting contig is at least as large as the original input sequence. Sweep merge Sweep through each bucket repeatedly, making the best merge each time until no merges can be done Cutoff for maximum difference between merge of 2 fragments’ base ratios and the original sequence’s base ratios. Heuristic “oversize penalty” for merges that are significantly larger than the original sequence Heuristics for base ratio difference Average: calculate the base ratios for each sequence, average the differences for each base Maximum: take the maximum of the differences of ratio for any base.
42
Comparison of OLC algorithms Yeast genome Artificial Data with no read errors * with optional simulated annealing. StrategyContigsHeuristicTime Mixed Chromosome Support Worst Case Accuracy Ease of Use FocusedBeam Search* SingleBase ratios 2-15 seconds Unreliable/ Poor LowNot too bad SweepWide Search* MultipleBase ratios 1-30** seconds Excellent Good Needs Micro managi ng NaiveBrute Force SingleActual overlap ½ hour to a few hours NoneOptimalSimple
43
Summary Base and oligomer distributions in species are unique New OLC sequencing DBG may not be suitable for longer reads Utilize Data partition Base distribution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.