Presentation is loading. Please wait.

Presentation is loading. Please wait.

CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel: 6516-6877

Similar presentations


Presentation on theme: "CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel: 6516-6877"— Presentation transcript:

1 CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel: 6516-6877 Email: phacyz@nus.edu.sg http://bidd.nus.edu.sg Room 08-14, level 8, S16, NUS phacyz@nus.edu.sg http://bidd.nus.edu.sgphacyz@nus.edu.sg http://bidd.nus.edu.sg

2 Outline First generation sequencing Next generation sequencing Third generation sequencing Analysis challenges

3 Sanger Sequencing DNA is fragmented Cloned to a plasmid vector Cyclic sequencing reaction Separation by electrophoresis Readout with fluorescent tags

4 Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a 500-900 long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig

5 Sequencing Types and Applications

6 Cyclic-Array Methods DNA is fragmented Adaptors ligated to fragments Several possible protocols yield array of PCR colonies. Enyzmatic extension with fluorescently tagged nucleotides. Cyclic readout by imaging the array.

7 Emulsion PCR Fragments, with adaptors, are PCR amplified within a water drop in oil. One primer is attached to the surface of a bead. Used by 454, Polonator and SOLiD.

8 Bridge PCR DNA fragments are flanked with adaptors. A flat surface coated with two types of primers, corresponding to the adaptors. Amplification proceeds in cycles, with one end of each bridge tethered to the surface. Used by Solexa.

9 Comparison of Existing Methods

10 Genome Assembly: Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient, pos.) aaactgcag aactgcagt acggatcta actgcagta cccaaactg cggatctac ctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc tacggatct tactacaca

11 Find pairs of reads sharing a k-mer, k ~ 24 Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || Caveat: repeats  A k-mer that occurs N times, causes O(N 2 ) read/read comparisons  ALU k-mers could cause up to 1,000,000 2 comparisons Solution:  Discard all k-mers that occur “ too often ” Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available Genome Assembly: Find Overlapping Reads

12 Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Genome Assembly: Find Overlapping Reads

13 Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A replace T with C correlated errors— probably caused by repeats  disentangle overlaps TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors Genome Assembly: Find Overlapping Reads

14 Genome Assembly: Merge Reads into Contigs Overlap graph: –Nodes: reads r 1 …..r n –Edges: overlaps (r i, r j, shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat

15 We want to merge reads up to potential repeat boundaries repeat region Unique Contig Overcollapsed Contig Genome Assembly: Merge Reads into Contigs

16 Ignore non-maximal reads Merge only maximal reads into contigs repeat region Genome Assembly: Merge Reads into Contigs

17 Read Length and Pairing Short reads are problematic, because short sequences do not map uniquely to the genome. Solution #1: Get longer reads. Solution #2: Get paired reads. ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

18 Third Generation Sequencing Nanopore sequencing –Nucleic acids driven through a nanopore. –Differences in conductance of pore provide readout. Real-time monitoring of PCR activity –Read-out by fluorescence resonance energy transfer between polymerase and nucleotides or –Waveguides allow direct observation of polymerase and fluorescently labeled nucleotides

19 Nanopore sequencing Deamer, DW, and Akeson, M. ‘Nanopores and Nucleic Acids: prospects for ultrarapid sequencing’. Tibtech. Meller, A J. Phys.: Condens. Matter 15 (2003) R581–R607 Earlier Findings –Transmembrane voltage drives RNA through the protein nanopore α-hemolysin. –Passage of RNA through the pore reduces the ionic current –Blockage current is modulated by base identity PolyC – i block = 5 pA, PolyA – I block = 20 pA –Translocation rate depends on base identity PolyC - v = 3 µs/base PolyA – v = 20 µs/base

20 Automated Rapid DNA Sequencing with Nanopores Church, George M. ‘Genomes for All’ Scientific American, Jan 2006, pp. 47-54. Sequencing will require a better understanding of the physics of the interaction between DNA and protein pore during translocation.

21 Modeling of ssDNA Translocation F = zeVa –ze = effective charge / base –V = applied voltage –a = base-to-base distance F = (1)(1.6 x10 -19 )(.125)(.4 x 10 -9 ) ~ 5k b T / a ~ 44 pN Basis for modeling –P(forward or backward) ~ exp(Fa/k B T) –Averaged over all monomers Model Assumptions: –Length of polymer = L >> pore length –With short polymers, membrane has 0 thickness D. K. Lubensky and D. R. Nelson, Biophys. J. 77, 1824 (1999). F

22 Experiment of ssDNA Translocation Conditions –Temp: 2 o C –Electrolyte solution –1M KCl, 1 mM Tris-EDTA buffer, pH 8.5 –Polymer Polydeoxyadenylic acid (poly(dA)) Length: 4 – 100 bases –Driving voltage: 70-300 mV Meller, A., L. Nivon, D. Branton, 2001. Voltage-Driven DNA Translocations Through a Nanopore, Phys. Rev. Lett., 86,3435-39

23 23 Sequence Alignment as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap What is a good alignment?

24 24 How to rate an alignment? Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

25 25 Pairwise Alignment Sequence a: CTTAACT Sequence b: CGGATCAT An alignment of a and b: C---TTAACT CGGATCA--T Insertion gap Match Mismatch Deletion gap

26 26 Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T Insertion gap Deletion gap

27 27 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACT CGGATCA--T

28 28 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C C---TTAACT CGGATCA--T

29 29 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T CTCT C---TTAACT CGGATCA--T

30 30 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A CTTAACCTTAAC C---TTAACT CGGATCA--T

31 31 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

32 32 Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

33 33 Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT 852 -1+8 =7 7-3 =4 4+8 =12 12-3 =9 9-3 =6 C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T 6+8=14 Alignment score

34 34 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.

35 35 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n

36 36Initializations S 0,0 = 0 S 0,1 =-3, S 0,2 =-6, S 0,3 =-9, S 0,4 =-12, S 0,5 =-15, S 0,6 =-18, S 0,7 =-21, S 0,8 =-24 S 1,0 =-3, S 2,0 =-6, S 3,0 =-9, S 4,0 =-12, S 5,0 =-15, S 6,0 =-18, S 7,0 =-21 0-3-6-9-12-15-18-21-24 -3 -6 -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT Gap symbol: -3

37 37 S 1,1 = ? Option 1: S 1,1 = S 0,0 +w(a 1, b 1 ) = 0 +8 = 8 Option 2: S 1,1 =S 0,1 + w(a 1, -) = -3 - 3 = -6 Option 3: S 1,1 =S 1,0 + w( -, b 1 ) = -3-3 = -6 Optimal: S 1,1 = 8 0-3-6-9-12-15-18-21-24 -3 ? -6 -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

38 38 S 1,2 = ? Option 1: S 1,2 = S 0,1 +w(a 1, b 2 ) = -3 -5 = -8 Option 2: S 1,2 =S 0,2 + w(a 1, -) = -6 - 3 = -9 Option 3: S 1,2 =S 1,1 + w( -, b 2 ) = 8-3 = 5 Optimal: S 1,2 =5 0-3-6-9-12-15-18-21-24 -38? -6 -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

39 39 S 2,1 = ? Option 1: S 2,1 = S 1,0 +w(a 2, b 1 ) = -3 -5 = -8 Option 2: S 2,1 =S 1,1 + w(a 2, -) = 8 - 3 = 5 Option 3: S 2,1 =S 2,0 + w( -, b 1 ) = -6-3 = -9 Optimal: S 2,1 =5 0-3-6-9-12-15-18-21-24 -385 -6? -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

40 40 S 2,2 = ? Option 1: S 2,2 = S 1,1 +w(a 2, b 2 ) = 8 -5 = 3 Option 2: S 2,2 =S 1,2 + w(a 2, -) = 5 - 3 = 2 Option 3: S 2,2 =S 2,1 + w( -, b 2 ) = 5-3 = 2 Optimal: S 2,2 =3 0-3-6-9-12-15-18-21-24 -385 -65? -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

41 41 S 3,5 = ? 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-5? -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT

42 42 S 3,5 = ? 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55-49 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT optimal score

43 43 C T T A A C – T C G G A T C A T 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55-49 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT 8 – 5 –5 +8 -5 +8 -3 +8 = 14

44 44 Multiple sequence alignment MSA

45 45 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +

46 46 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + + -5-3+8-3+8= 5 + 8-3-3+8+8= 18 + -5+8-3-3+8= 5 = 28 SP-score=5+18+5=28


Download ppt "CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel: 6516-6877"

Similar presentations


Ads by Google