CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel: 6516-6877

Slides:



Advertisements
Similar presentations
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Advertisements

Next Generation Sequencing, Assembly, and Alignment Methods
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Sequencing and Sequence Alignment
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
DNA Sequencing.
Assembly.
DNA Sequencing and Assembly
DNA Sequencing.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
DNA Sequencing. CS262 Lecture 9, Win07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
DNA Sequencing.
DNA Sequencing. CS262 Lecture 9, Win06, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
CS273a Lecture 4, Autumn 08, Batzoglou DNA Sequencing.
CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields A brief description.
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
CS 6293 Advanced Topics: Current Bioinformatics
CS 6030 – Bioinformatics Summer II 2012 Jason Eric Johnson
De-novo Assembly Day 4.
Analyzing your clone 1) FISH 2) “Restriction mapping” 3) Southern analysis : DNA 4) Northern analysis: RNA tells size tells which tissues or conditions.
CS 394C March 19, 2012 Tandy Warnow.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2011 DNA Sequencing.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Genome sequencing Haixu Tang School of Informatics.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Error model for massively parallel (454) DNA sequencing Sriram Raghuraman (working with Haixu Tang and Justin Choi)
CS273a Lecture 4, Autumn 08, Batzoglou CS273a 2013 DNA Structure.
CZ5225 Methods in Computational Biology Lecture 2-3: Protein Families and Family Prediction Methods Prof. Chen Yu Zong Tel:
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
VL Algorithmische BioInformatik (19710) WS2015/2016 Woche 7 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie.
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Next-generation sequencing technology
Part 3 Gene Technology & Medicine
DNA Sequencing.
DNA Sequencing Project
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Next-generation sequencing technology
Sequencing Technologies
AMPLIFYING AND ANALYZING DNA.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Chapter 14 Bioinformatics—the study of a genome
Introduction to Bioinformatics II
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
CSE 589 Applied Algorithms Spring 1999
Plant Biotechnology Lecture 2
Presentation transcript:

CZ5225: Modeling and Simulation in Biology Lecture 9: Next Generation Sequencing Prof. Chen Yu Zong Tel: Room 08-14, level 8, S16, NUS

Outline First generation sequencing Next generation sequencing Third generation sequencing Analysis challenges

Sanger Sequencing DNA is fragmented Cloned to a plasmid vector Cyclic sequencing reaction Separation by electrophoresis Readout with fluorescent tags

Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT.. 2. Merge some “good” pairs of reads into longer contigs 3. Link contigs to form supercontigs Some Terminology read a long word that comes out of sequencer mate pair a pair of reads from two ends of the same insert fragment contig a contiguous sequence formed by several overlapping reads with no gaps supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs consensus sequence derived from the sequene multiple alignment of reads in a contig

Sequencing Types and Applications

Cyclic-Array Methods DNA is fragmented Adaptors ligated to fragments Several possible protocols yield array of PCR colonies. Enyzmatic extension with fluorescently tagged nucleotides. Cyclic readout by imaging the array.

Emulsion PCR Fragments, with adaptors, are PCR amplified within a water drop in oil. One primer is attached to the surface of a bead. Used by 454, Polonator and SOLiD.

Bridge PCR DNA fragments are flanked with adaptors. A flat surface coated with two types of primers, corresponding to the adaptors. Amplification proceeds in cycles, with one end of each bridge tethered to the surface. Used by Solexa.

Comparison of Existing Methods

Genome Assembly: Find Overlapping Reads aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacaca (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient, pos.) aaactgcag aactgcagt acggatcta actgcagta cccaaactg cggatctac ctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatc tacggatct tactacaca

Find pairs of reads sharing a k-mer, k ~ 24 Extend to full alignment – throw away if not >98% similar TAGATTACACAGATTAC ||||||||||||||||| T GA TAGA | || TACA TAGT || Caveat: repeats  A k-mer that occurs N times, causes O(N 2 ) read/read comparisons  ALU k-mers could cause up to 1,000,000 2 comparisons Solution:  Discard all k-mers that occur “ too often ” Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand and computing resources available Genome Assembly: Find Overlapping Reads

Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Genome Assembly: Find Overlapping Reads

Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA insert A replace T with C correlated errors— probably caused by repeats  disentangle overlaps TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA In practice, error correction removes up to 98% of the errors Genome Assembly: Find Overlapping Reads

Genome Assembly: Merge Reads into Contigs Overlap graph: –Nodes: reads r 1 …..r n –Edges: overlaps (r i, r j, shift, orientation, score) Note: of course, we don’t know the “color” of these nodes Reads that come from two regions of the genome (blue and red) that contain the same repeat

We want to merge reads up to potential repeat boundaries repeat region Unique Contig Overcollapsed Contig Genome Assembly: Merge Reads into Contigs

Ignore non-maximal reads Merge only maximal reads into contigs repeat region Genome Assembly: Merge Reads into Contigs

Read Length and Pairing Short reads are problematic, because short sequences do not map uniquely to the genome. Solution #1: Get longer reads. Solution #2: Get paired reads. ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

Third Generation Sequencing Nanopore sequencing –Nucleic acids driven through a nanopore. –Differences in conductance of pore provide readout. Real-time monitoring of PCR activity –Read-out by fluorescence resonance energy transfer between polymerase and nucleotides or –Waveguides allow direct observation of polymerase and fluorescently labeled nucleotides

Nanopore sequencing Deamer, DW, and Akeson, M. ‘Nanopores and Nucleic Acids: prospects for ultrarapid sequencing’. Tibtech. Meller, A J. Phys.: Condens. Matter 15 (2003) R581–R607 Earlier Findings –Transmembrane voltage drives RNA through the protein nanopore α-hemolysin. –Passage of RNA through the pore reduces the ionic current –Blockage current is modulated by base identity PolyC – i block = 5 pA, PolyA – I block = 20 pA –Translocation rate depends on base identity PolyC - v = 3 µs/base PolyA – v = 20 µs/base

Automated Rapid DNA Sequencing with Nanopores Church, George M. ‘Genomes for All’ Scientific American, Jan 2006, pp Sequencing will require a better understanding of the physics of the interaction between DNA and protein pore during translocation.

Modeling of ssDNA Translocation F = zeVa –ze = effective charge / base –V = applied voltage –a = base-to-base distance F = (1)(1.6 x )(.125)(.4 x ) ~ 5k b T / a ~ 44 pN Basis for modeling –P(forward or backward) ~ exp(Fa/k B T) –Averaged over all monomers Model Assumptions: –Length of polymer = L >> pore length –With short polymers, membrane has 0 thickness D. K. Lubensky and D. R. Nelson, Biophys. J. 77, 1824 (1999). F

Experiment of ssDNA Translocation Conditions –Temp: 2 o C –Electrolyte solution –1M KCl, 1 mM Tris-EDTA buffer, pH 8.5 –Polymer Polydeoxyadenylic acid (poly(dA)) Length: 4 – 100 bases –Driving voltage: mV Meller, A., L. Nivon, D. Branton, Voltage-Driven DNA Translocations Through a Nanopore, Phys. Rev. Lett., 86,

23 Sequence Alignment as a Mathematical Problem: Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC Best Alignment: ATTCTTGC ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap What is a good alignment?

24 How to rate an alignment? Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

25 Pairwise Alignment Sequence a: CTTAACT Sequence b: CGGATCAT An alignment of a and b: C---TTAACT CGGATCA--T Insertion gap Match Mismatch Deletion gap

26 Alignment Graph Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T Insertion gap Deletion gap

27 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C C C---TTAACT CGGATCA--T

28 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A C C---TTAACT CGGATCA--T

29 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T CTCT C---TTAACT CGGATCA--T

30 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A CTTAACCTTAAC C---TTAACT CGGATCA--T

31 Graphic representation of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

32 Pathway of an alignment Sequence a: CTTAACT Sequence b: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

33 Alignment Score Sequence a: CTTAACT Sequence b: CGGATCAT =7 7-3 =4 4+8 = =9 9-3 =6 C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T 6+8=14 Alignment score

34 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.

35 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n

36Initializations S 0,0 = 0 S 0,1 =-3, S 0,2 =-6, S 0,3 =-9, S 0,4 =-12, S 0,5 =-15, S 0,6 =-18, S 0,7 =-21, S 0,8 =-24 S 1,0 =-3, S 2,0 =-6, S 3,0 =-9, S 4,0 =-12, S 5,0 =-15, S 6,0 =-18, S 7,0 = C G G A T C A T CTTAACTCTTAACT Gap symbol: -3

37 S 1,1 = ? Option 1: S 1,1 = S 0,0 +w(a 1, b 1 ) = 0 +8 = 8 Option 2: S 1,1 =S 0,1 + w(a 1, -) = = -6 Option 3: S 1,1 =S 1,0 + w( -, b 1 ) = -3-3 = -6 Optimal: S 1,1 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

38 S 1,2 = ? Option 1: S 1,2 = S 0,1 +w(a 1, b 2 ) = = -8 Option 2: S 1,2 =S 0,2 + w(a 1, -) = = -9 Option 3: S 1,2 =S 1,1 + w( -, b 2 ) = 8-3 = 5 Optimal: S 1,2 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

39 S 2,1 = ? Option 1: S 2,1 = S 1,0 +w(a 2, b 1 ) = = -8 Option 2: S 2,1 =S 1,1 + w(a 2, -) = = 5 Option 3: S 2,1 =S 2,0 + w( -, b 1 ) = -6-3 = -9 Optimal: S 2,1 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

40 S 2,2 = ? Option 1: S 2,2 = S 1,1 +w(a 2, b 2 ) = 8 -5 = 3 Option 2: S 2,2 =S 1,2 + w(a 2, -) = = 2 Option 3: S 2,2 =S 2,1 + w( -, b 2 ) = 5-3 = 2 Optimal: S 2,2 = ? C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

41 S 3,5 = ? ? C G G A T C A T CTTAACTCTTAACT

42 S 3,5 = ? C G G A T C A T CTTAACTCTTAACT optimal score

43 C T T A A C – T C G G A T C A T C G G A T C A T CTTAACTCTTAACT 8 – 5 – = 14

44 Multiple sequence alignment MSA

45 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +

46 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= = = = 5 = 28 SP-score=5+18+5=28