CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel: Room 08-14, level 8, S16, NUS
Outline First generation sequencing Next generation sequencing Third generation sequencing Analysis challenges
Sanger Sequencing DNA is fragmented Cloned to a plasmid vector Cyclic sequencing reaction Separation by electrophoresis Readout with fluorescent tags
Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig
Sequencing Types and Applications
Cyclic-Array Methods DNA is fragmented Adaptors ligated to fragments Several possible protocols yield array of PCR colonies. Enyzmatic extension with fluorescently tagged nucleotides. Cyclic readout by imaging the array.
Emulsion PCR Fragments, with adaptors, are PCR amplified within a water drop in oil. One primer is attached to the surface of a bead. Used by 454, Polonator and SOLiD.
Bridge PCR DNA fragments are flanked with adaptors. A flat surface coated with two types of primers, corresponding to the adaptors. Amplification proceeds in cycles, with one end of each bridge tethered to the surface. Used by Solexa.
Comparison of Existing Methods
Genome Assembly: Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient, pos.) aaactgcag aactgcagt acggatcta actgcagta cccaaactg cggatctac ctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc tacggatct tactacaca
Find pairs of reads sharing a k-mer, k ~ 24 Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || Caveat: repeats A k-mer that occurs N times, causes O(N 2 ) read/read comparisons ALU k-mers could cause up to 1,000,000 2 comparisons Solution: Discard all k-mers that occur “ too often ” Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available Genome Assembly: Find Overlapping Reads
Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Genome Assembly: Find Overlapping Reads
Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A replace T with C correlated errors— probably caused by repeats disentangle overlaps TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors Genome Assembly: Find Overlapping Reads
Genome Assembly: Merge Reads into Contigs Overlap graph: –Nodes: reads r 1 …..r n –Edges: overlaps (r i, r j, shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat
We want to merge reads up to potential repeat boundaries repeat region Unique Contig Overcollapsed Contig Genome Assembly: Merge Reads into Contigs
Ignore non-maximal reads Merge only maximal reads into contigs repeat region Genome Assembly: Merge Reads into Contigs
Read Length and Pairing Short reads are problematic, because short sequences do not map uniquely to the genome. Solution #1: Get longer reads. Solution #2: Get paired reads. ACTTAAGGCTGACTAGC TCGTACCGATATGCTG
Third Generation Sequencing Nanopore sequencing –Nucleic acids driven through a nanopore. –Differences in conductance of pore provide readout. Real-time monitoring of PCR activity –Read-out by fluorescence resonance energy transfer between polymerase and nucleotides or –Waveguides allow direct observation of polymerase and fluorescently labeled nucleotides
Nanopore sequencing Deamer, DW, and Akeson, M. ‘Nanopores and Nucleic Acids: prospects for ultrarapid sequencing’. Tibtech. Meller, A J. Phys.: Condens. Matter 15 (2003) R581–R607 Earlier Findings –Transmembrane voltage drives RNA through the protein nanopore α-hemolysin. –Passage of RNA through the pore reduces the ionic current –Blockage current is modulated by base identity PolyC – i block = 5 pA, PolyA – I block = 20 pA –Translocation rate depends on base identity PolyC - v = 3 µs/base PolyA – v = 20 µs/base
Automated Rapid DNA Sequencing with Nanopores Church, George M. ‘Genomes for All’ Scientific American, Jan 2006, pp Sequencing will require a better understanding of the physics of the interaction between DNA and protein pore during translocation.
Modeling of ssDNA Translocation F = zeVa –ze = effective charge / base –V = applied voltage –a = base-to-base distance F = (1)(1.6 x )(.125)(.4 x ) ~ 5k b T / a ~ 44 pN Basis for modeling –P(forward or backward) ~ exp(Fa/k B T) –Averaged over all monomers Model Assumptions: –Length of polymer = L >> pore length –With short polymers, membrane has 0 thickness D. K. Lubensky and D. R. Nelson, Biophys. J. 77, 1824 (1999). F
Experiment of ssDNA Translocation Conditions –Temp: 2 o C –Electrolyte solution –1M KCl, 1 mM Tris-EDTA buffer, pH 8.5 –Polymer Polydeoxyadenylic acid (poly(dA)) Length: 4 – 100 bases –Driving voltage: mV Meller, A., L. Nivon, D. Branton, Voltage-Driven DNA Translocations Through a Nanopore, Phys. Rev. Lett., 86,
23 Sequence Alignment as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap What is a good alignment?
24 How to rate an alignment? Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
25 Pairwise Alignment Sequence a: CTTAACT Sequence b: CGGATCAT An alignment of a and b: C---TTAACT CGGATCA--T Insertion gap Match Mismatch Deletion gap
26 Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T Insertion gap Deletion gap
27 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACT CGGATCA--T
28 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C C---TTAACT CGGATCA--T
29 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T CTCT C---TTAACT CGGATCA--T
30 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A CTTAACCTTAAC C---TTAACT CGGATCA--T
31 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T
32 Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T
33 Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT =7 7-3 =4 4+8 = =9 9-3 =6 C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T 6+8=14 Alignment score
34 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.
35 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n
36Initializations S 0,0 = 0 S 0,1 =-3, S 0,2 =-6, S 0,3 =-9, S 0,4 =-12, S 0,5 =-15, S 0,6 =-18, S 0,7 =-21, S 0,8 =-24 S 1,0 =-3, S 2,0 =-6, S 3,0 =-9, S 4,0 =-12, S 5,0 =-15, S 6,0 =-18, S 7,0 = C G G A T C A T CTTAACTCTTAACT Gap symbol: -3
37 S 1,1 = ? Option 1: S 1,1 = S 0,0 +w(a 1, b 1 ) = 0 +8 = 8 Option 2: S 1,1 =S 0,1 + w(a 1, -) = = -6 Option 3: S 1,1 =S 1,0 + w( -, b 1 ) = -3-3 = -6 Optimal: S 1,1 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3
38 S 1,2 = ? Option 1: S 1,2 = S 0,1 +w(a 1, b 2 ) = = -8 Option 2: S 1,2 =S 0,2 + w(a 1, -) = = -9 Option 3: S 1,2 =S 1,1 + w( -, b 2 ) = 8-3 = 5 Optimal: S 1,2 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3
39 S 2,1 = ? Option 1: S 2,1 = S 1,0 +w(a 2, b 1 ) = = -8 Option 2: S 2,1 =S 1,1 + w(a 2, -) = = 5 Option 3: S 2,1 =S 2,0 + w( -, b 1 ) = -6-3 = -9 Optimal: S 2,1 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3
40 S 2,2 = ? Option 1: S 2,2 = S 1,1 +w(a 2, b 2 ) = 8 -5 = 3 Option 2: S 2,2 =S 1,2 + w(a 2, -) = = 2 Option 3: S 2,2 =S 2,1 + w( -, b 2 ) = 5-3 = 2 Optimal: S 2,2 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3
41 S 3,5 = ? ? C G G A T C A T CTTAACTCTTAACT
42 S 3,5 = ? C G G A T C A T CTTAACTCTTAACT optimal score
43 C T T A A C – T C G G A T C A T C G G A T C A T CTTAACTCTTAACT 8 – 5 – = 14
44 Multiple sequence alignment MSA
45 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +
46 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= = = = 5 = 28 SP-score=5+18+5=28