Next Generation DNA Sequencing

Next Generation DNA Sequencing
IPM-NUS Workshop on Computational Biology Next Generation DNA Sequencing Mehdi Sadeghi

DNA sequencing methodologies: 1977
Maxam-Gilbert base modification by general and specific chemicals. depurination or depyrimidination. single-strand excision. not amenable to automation Sanger DNA replication. substitution of substrate with chain-terminator chemical. more efficient automation?

DNA sequencing: biochemistry
5’ purine or pyrimidine P O OH P O OH P O OH HO C N O purine or pyrimidine O N O P O C O OH 3’ OH

DNA sequencing: Sanger dideoxy method
purine or pyrimidine P O OH P O OH P O OH HO C N O dideoxyribonucleoside triphosphate (ddNTP) H

DNA sequencing: Sanger dideoxy method
purine or pyrimidine P O OH P O OH P O OH HO C N O purine or pyrimidine O chain termination method N O P O C O OH H

DNA sequencing: Chemistry

template + primers + polymerase +label at? 1 dCTP dTTP dGTP dATP ddATP* 2 dCTP dTTP dGTP dATP ddGTP* 3 dCTP dTTP dGTP dATP ddTTP* 4 dCTP dTTP dGTP dATP ddCTP* electrophoresis A•T G•C T•A C•G extension

template + polymerase + 1 dCTP dTTP dGTP dATP ddATP primer 2 dCTP dTTP dGTP dATP ddGTP primer 3 dCTP dTTP dGTP dATP ddTTP primer 4 dCTP dTTP dGTP dATP ddCTP primer electrophoresis A•T G•C T•A C•G extension

template + polymerase + dCTP dTTP dGTP dATP ddATP ddGTP ddTTP ddCTP electrophoresis A•T G•C T•A C•G extension

Capillary electrophoresis

ABI 370s-series

DNA sequencing: Computation

DNA sequencing

DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA
Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time

Genome Sequencing AC..GC TT..TC CG..CA TG..GT TC..CC GA..GC TG..AC CT..TG GT..GC AT..AT TT..CC AA..GC Short DNA sequences Genome Short fragments of DNA ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA ACGTGACCGGTACTGGTAACGTACA CCTACGTGACCGGTACTGGTAACGT ACGCCTACGTGACCGGTACTGGTAA CGTATACACGTGACCGGTACTGGTA ACGTACACCTACGTGACCGGTACTG GTAACGTACGCCTACGTGACCGGTA CTGGTAACGTATACCTCT... Sequenced genome 15 15

Sequencing strategies
Whole genome

DNA sequencing – vectors
Shake DNA fragments Known location (restriction site) Vector Circular genome (bacterium, plasmid) + =

Different types of vectors
Size of insert Plasmid 2,000-10,000 Can control the size Cosmid 40,000 BAC (Bacterial Artificial Chromosome) 70, ,000 YAC (Yeast Artificial Chromosome) > 300,000 Not used much recently

Sanger sequencing DNA is fragmented Cloned to a plasmid vector
Cyclic sequencing reaction Separation by electrophoresis Readout with fluorescent tags

Sanger Sequencing Advantages Disadvantages Long reads (~750bps)
Suitable for small projects Disadvantages Low throughput Expensive 20

Method to sequence longer regions
genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~500 bp ~500 bp

Reconstructing the Sequence (Fragment Assembly)
Cover region with ~7-fold redundancy (7X) Overlap reads and extend to reconstruct the original genomic region reads

Definition of Coverage
Length of genomic segment: L Number of reads: n Length of each read: l Definition: Coverage C = n l / L How much coverage is enough?

Assembly: How Much DNA? Output Input Low coverage:
Lander and Waterman, 1988 Low coverage: A few pieces to assemble many contigs, many gaps Input Output many pieces to assemble High coverage: a few contigs, a few gaps 24

Challenges with Fragment Assembly
Sequencing errors ~1-2% of bases are wrong Repeats false overlap due to repeat

Repeats Bacterial genomes: 5% Mammals: 50% Repeat types:
Low-Complexity DNA (e.g. ATATATATACATA…) Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) Transposons SINE (Short Interspersed Nuclear Elements) e.g., ALU: ~300-long, 106 copies LINE (Long Interspersed Nuclear Elements) ~4000-long, 200,000 copies LTR retroposons (Long Terminal Repeats (~700 bp) at each end) cousins of HIV Gene Families genes duplicate & then diverge (paralogs) Recent duplications ~100,000-long, very similar copies

What can we do about repeats?
Two main approaches: Cluster the reads Link the reads

Strategies for whole-genome sequencing
Hierarchical – Clone-by-clone Break genome into many long pieces Map each long piece onto the genome Sequence each piece with shotgun Example: Yeast, Worm, Human, Rat Online version of (1) – Walking Start sequencing each piece with shotgun Construct map as you go Example: Rice genome Whole genome shotgun One large shotgun pass on the whole genome Example: Drosophila, Human (Celera), Neurospora, Mouse, Rat, Fugu

Hierarchical Sequencing

Hierarchical Sequencing Strategy
a BAC clone map genome Obtain a large collection of BAC clones Map them onto the genome (Physical Mapping) Select a minimum tiling path Sequence each clone in the path with shotgun Assemble Put everything together

Methods of physical mapping
Goal: Make a map of the locations of each clone relative to one another Use the map to select a minimal set of clones to sequence Methods: Hybridization Digestion

1. Hybridization p1 pn Short words, the probes, attach to complementary words Construct many probes Treat each BAC with all probes Record which ones attach to it Same words attaching to BACS X, Y  overlap

Hybridization – Computational Challenge
p1 p2 …………………….pm Matrix: m probes  n clones (i, j): 1, if pi hybridizes to Cj 0, otherwise Definition: Consecutive ones matrix 1s are consecutive in each row & col Computational problem: Reorder the probes so that matrix is in consecutive-ones form Can be solved in O(m3) time (m > n) 0 0 1 …………………..1 C1 C2 ……………….Cn 1 1 0 …………………..0 1 0 1…………………...0 pi1pi2…………………….pim ……………..0 ……………..0 ……………..0 Cj1Cj2 ……………….Cjn ……… ………

pi1pi2………………………………….pim pi1pi2…………………….pim ……………..0 ……………..0 ……………..0 Cj1Cj2 ……………….Cjn Cj1Cj2 ……………….Cjn ……… ……… If we put the matrix in consecutive-ones form, then we can deduce the order of the clones & which pairs of clones overlap

p1 p2 …………………….pm Additional challenge: A probe (short word) can hybridize in many places in the genome Computational Problem: Find the order of probes that implies the minimal probe repetition Equivalent: find the shortest string of probes such that each clone appears as a substring APX-hard Solutions: Greedy, Probabilistic, Lots of manual curation 0 0 1 …………………..1 C1 C2 ……………….Cn 1 1 0 …………………..0 1 0 1…………………...0

2. Digestion Restriction enzymes cut DNA where specific words appear
Cut each clone separately with an enzyme Run fragments on a gel and measure length Clones Ca, Cb have fragments of length { li, lj, lk }  overlap Double digestion: Cut with enzyme A, enzyme B, then enzymes A + B

Online Clone-by-clone The Walking Method

The Walking Method Build a very redundant library of BACs with sequenced clone-ends (cheap to build) Sequence some “seed” clones “Walk” from seeds using clone-ends to pick library clones that extend left & right

Walking: An Example

Walking off a Single Seed
Low redundant sequencing Many sequential steps

Walking off a single clone is impractical
Cycle time to process one clone: 1-2 months Grow clone Prepare & Shear DNA Prepare shotgun library & perform shotgun Assemble in a computer Close remaining gaps A mammalian genome would need 15,000 walking steps !

Walking off several seeds in parallel
Efficient Inefficient Few sequential steps Additional redundant sequencing In general, can sequence a genome in ~5 walking steps, with <20% redundant sequencing

Using Two Libraries Most inefficiency comes from closing a small gap with a much larger clone Solution: Use a second library of small clones

Whole-Genome Shotgun Sequencing

Whole Genome Shotgun Sequencing
cut many times at random plasmids (2 – 10 Kbp) forward-reverse paired reads known dist cosmids (40 Kbp) ~500 bp ~500 bp

Better assembly of contigs, gap lengths estimation
Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994) ~(length―1,000) ~500 bp 15Kbp mates contig 1 contig 2 resolving repeats 2Kbp mates Better assembly of contigs, gap lengths estimation 48

Advantages & Disadvantages of Strategies
Hierarchical Sequencing ADV. Easy assembly DIS. Build library & physical map; redundant sequencing Whole Genome Shotgun (WGS) ADV. No mapping, no redundant sequencing DIS. Difficult to assemble and resolve repeats The Walking method – motivation Sequence the genome clone-by-clone without a physical map The only costs involved are: Library of end-sequenced clones (cheap) Sequencing

Sequencing of Human Genome
Public Consortium Many years of hard work More than BAC clones Each containing about 100kb fragment Together provided a tiling path through each human chromosome Amplification in bacterial culture Isolation, select pieces about 2-3 kb Subcloned into plasmid vectors, amplification, isolation recreate contigs Refinement, gap closure, sequence quality improvement (less 1 error/ bases) BAC based approaches toward WGS

Sanger Sequencing 2007: Global Ocean Sampling 1994: H. Influenzae
~3,000 organisms, 7Gbp (Venter et al.) 1994: H. Influenzae 1.8 Mbp (Fleischmann et al.) 1980 1990 2000 1982: lambda virus DNA stretches up to Kbp (Sanger et al.) 2001: H. Sapiens, D. Melanogaster 3 Gbp (Venter et al.) 51

Sequencing the Human Genome
2001: Human Genome Project 2.7G$, 11 years 2001: Celera 100M$, 3 years 2007: 454 1M$, 3 months 2008: ABI SOLiD 60K$, 2 weeks Log10(price) 2010: 5K$, a few days 2009: Illumina, Helicos 40-50K$ I would like to begin with an overview of the history of human genome sequencing. Despite significant improvements … it was clear that Sanger sequencing would not make massive DNA sequencing at a low cost and high speed feasible. Several technologies were developed at the time, of which the 454 Life Sciences sequencer was the first to become commercial in years later it was used for … Whether …, but the direction is clear: in a few years from now very fast and cheap sequencing technologies will be available for commercial and research purposes 2012: 100$, <24 hrs? 2000 2005 2010 Year 52

2nd Generation: Pyrosequencing
Sequencing by synthesis Advantages: Accurate Parallel processing Easily automated Eliminates the need for labeled primers and nucleotides No need for gel electrophoresis

Pyrosequencing Basic idea:
Visible light is generated and is proportional to the number of incorporated nucleotides 1pmol DNA = 6*1011 ATP = 6*109 photons at 560nm DNA Polymerase I from E.coli. pyrophospate From fireflies, oxidizes luciferin and generates light

Pyrosequencing 1st Method Solid Phase Immobilized DNA 3 enzymes
Wash step to remove nucleotides after each addition

Pyrosequencing 2nd Method Liquid Phase
3 enzymes + apyrase (nucleotide degradation enzyme) Eliminates need for washing step In the well of a microtiter plate: primed DNA template 4 enzymes Nucleotides are added stepwise Nucleotide-degrading enzyme degrade previous nucleotides

Pyrosequencing

Pyrosequencing Results:

Pyrosequencing Smaller sequences
Disadvantages Smaller sequences Nonlinear light response after more than 5-6 identical nucleotides

Next Generation Sequencing: Why Now?
Motivation: HGP and its derivatives, personalized medicine Short reads applications: (re-)sequencing, other methods (e.g. gene expression) Advancements in technology NGS is a general term refering to all post-Sanger sequencing technologies that enable massive sequencing at low cost. NGS may be further divided into polony-sequencing based technologies which require the amplification of DNA prior to sequencing, and single molecule sequencing which do not. Motivation for new technologies drives its roots not only from potentially commercial usage such as in personalised medicine, but also from government supported projects suce as the HGP or the 1000 genomes projects aiming to sequence the genomes of 1000 individuals around the world with price tag for genome sequencing single genomes set to 50,000$. other than de-novo sequencing Potential applications include re-sequencing, and also gene expression analysis, both can make use of short reads which are offered by all current technologies. So despite the read-length barrier of the new technologies, sequencers still became commercial. And of course – advancements in chemistry, microscopy and other related technologies enabled the new sequencing technologies. 60 60

High Parallelism is Achieved in Polony Sequencing
Sanger Polony Polony sequencing refers to all commercial technologies except for Helicos. Polony sequencing takes place using array of polonies, in which all amplicons of the same DNA fragment are clustered together on the same region of the array. These groups of amplicons were termed polonies, shortcut for polymerase colonies. The degree of parallelism that can be achieved through Sanger sequencing is only a fraction of what can be achieved in polony sequencing 62 62

Next Generation Sequencing
DNA is fragmented Adaptors ligated to fragments Several possible protocols yield array of PCR colonies. Emulsion PCR Bridge PCR Enyzmatic extension with fluorescently tagged nucleotides. Cyclic readout by imaging the array.

Next Generation Sequencing
454 Life Sciences/Roche Genome Sequencer FLX: currently produces million bases per day per machine Published 1 million bases of Neanderthal DNA in 2006 May 2007 published complete genome of James Watson (3.2 billion bases ~20x coverage) Solexa/Illumina 10 GB per machine/week May 2008 published complete genomes for 3 hapmap subjects (14x coverage) ABI SOLiD 20 GB per machine/week

“Paradigm Shift” Standard ABI “Sanger” sequencing
96 samples/day Read length ~750 bp Total = 70,000 bases of sequence data 454 was the game changer! ~400,000 different templates (reads)/day Read length ~250 bp Total = 100,000,000 bases of sequence data!!!

Solexa ups the Game Solexa (Illumina GA)
60,000,000 different sequence templates (yes that is an 60 million reads) 36 bp read length 4 billion bases of DNA per run (3 days)

Each system works differently, but they are all based on a similar principals:
Shear target DNA into small pieces bind individual DNA molecules to a solid surface, amplify each molecule into a cluster copy one base at a time and detect different signals for A, C, T, & G bases requires very precise high-resolution imaging of tiny features (charge-coupled device (CCD) )

One tile on Solexa Sequencer

Huge Amount of Image Data
The raw image data is truly huge: 1-2 TB for the Solexa, more for ABI-SOLID, less for 454 The images are immediately processed into intensity data (spots w/ location and brightness) Intensity data is then processed into basecalls (A, C, T, or G plus a quality score for each) Basecall data is on the order of 5-10 GB per run (or a week of runs for 454).

454 First high-throughput DNA sequencer, commercially
available in 2004 Now produces ~500 MB reads of 500 bp Run of 8 samples in 10 hours, so can do multiple runs/week Uses pyrosquencing, beads, and a microtiter plate Low error rate, but insert/delete problems with homopolymers (stretches of a single base)

Illumina Genome Analyzer
Originally developed by Solexa, now subsidiary of Illumina. Commercially available in 2006 Now produces 8-12 million reads per sample of 36 bp length = 10 GB/week. Run takes 3 days for 7 samples. Low error rate, mostly base changes, few indels

Call Sequence We established that we could amplify polonies
But can we sequence them? Rather than sequencing them directly, I decided to try to work out the protocol using oligonucleotide templates attached to acrylamide to work out issues with attachment chemistry, as well as the nucleotide and polymerase chemistries. First question, can we get a specific single base extension with our 74

ABI-SOLiD First commercially available in late 2007
Currently capable of producing 20 GB of data per run (week) Most users generate 6 GB/run Reads ~30 bp long Uses unique sequence-by-ligation method “color-space” data Very low error rate

Cost per GB On Solexa, cost is ~$6,000 per GB ABI-SOLiD $6,000 per GB
454 is more like $85,000 per GB – but much longer reads are valuable for many projects [Mardis, E.R., Trends in Genetics 24: ] Note that the human genome is 3 GB, but 1x coverage is not nearly adequate for clinical purposes. But costs are coming down more than 2x per year.

Short Reads Short reads from Nex-Gen machines are a challenge (Solexa = 36 bp) Very hard to assemble whole genomes Difficult to get any information on repeat regions Requires many-fold coverage New algorithms needed for many traditional bioinformatics operations

Third generation Nanopore sequencing
Nucleic acids driven through a nanopore. Differences in conductance of pore provide readout. Real-time monitoring of PCR activity Read-out by fluorescence resonance energy transfer between polymerase and nucleotides or Waveguides allow direct observation of polymerase and fluorescently labeled nucleotides

Comparison of existing methods

Terminology Ligase: Polymerase Chain Reaction (PCR) :
An enzyme that links two large molecules by forming a new chemical bond. DNA ligase: A special type of ligase that can link together two DNA strands that have double-strand break, or join the ends on only one of the two strand. Ligase Polymerase Chain Reaction (PCR) : is a scientific technique in molecular biology to amplify a single or a few copies of a piece of DNA across several orders of magnitude, generating thousands to millions of copies of a particular DNA sequence

Emulsion PCR Fragments, with adaptors, are PCR amplified within a water drop in oil. One primer is attached to the surface of a bead. Used by 454, Polonator and SOLiD.

Bridge PCR DNA fragments are flanked with adaptors.
A flat surface coated with two types of primers, corresponding to the adaptors. Amplification proceeds in cycles, with one end of each bridge tethered to the surface. Used by Solexa.

Generation of Polony array: DNA Beads (454, SOLiD)
Generation of polony array is done as follows: The process begins with the mixing of the DNA fragments ligased to connectors with beads, PCR components and primers in water. The components are mixed with oil in order to create “microreactors”, which are droplets of water containing all necessary components for PCR. Next, PCR is performed with the new copies in each microreactor being attached to the bead. Finally, the emulsion and empty beads are removed and we are left with only DNA containing beads. DNA Beads are generated using Emulsion PCR 84 84

Generation of Polony array: DNA Beads (454, SOLiD)
The beads are loaded onto an array containing pico-liter scale wells. Together with small beads containing the enzymes required for the reactions the DNA beads are placed into the wells. DNA Beads are placed in wells 85

Sequencing: Pyrosequencing (454)
Sequence readout is done using the - When a nucleotide is incorporated – pyrophosphate is released Complementary strand elongation: DNA Polymerase 86 86

Two Short Read Techologies
Illumina GA ABI SOLID 87

Work Flow

Technology Overview: Solexa/Illumina Sequencing
How does polony sequencing work? Here are the steps in our protocol. Step 1: Make a Library of linear DNA molecules This library can be made by random priming or ligation Every molecule in this library has a Universal Primer on either end In the middle is the Variable region that differs from molecule to molecule. This is the DNA that we want to sequence. Made without cloning into E.coli so very high complexity. Very Dilute amounts of this library and pour an acryalmide gel on a glass microscope slide. Include PCR reagents Cycle in PCR machine adapted for slides. Because solution so dilute each template mol is relatively far from other mole. PCR proceeds - acrylamide restricts diffusion of amplification products so that tney remain localized to the template molecule. A polymerase colony forms (POLONY) 5,000,000 89

Immobilize DNA to Surface
The goal is to amplify millions of “polonies” on a single slide. But then we want to perform enzymatic manipulations on this DNA To facilitate this, one of the primers used in the PCR has a 5’ acrylic group that allows copolymerization into the gel. 90

Technology Overview: Solexa Sequencing
Now we’ve amplified polonies, How are we going to sequence them. Here’s how we do this. 91

Sequencing Technology Overview

Sequence Colonies The first thing I want to ask is can we amplify single molecules at high densities in acrylamide? Alexander Chetverin had previously demonstrated growing RNA colonies in agarose using Q beta replicase system so we were optimistic that DNA colonies could be grown. Add various amount of templatre (1 kind 236 base pairs), amplify, stain with DNA DYE SYBR GREEN - see fluorescent spheres? Number of spheres linear with amount of template added Pick sphere and run on gel – right size. Conclude that these spheres were DNA colonies or polonies. Now the size of the polonies shown here is much larger than we would like. We could fit 2100 of these polonies on a single slide and we want millions. 93

Sequence Colonies But, by increasing the length of the DNA template and increasing the acrylamide concentration, we can get polonies as small as 5 microns. This would enable 15,000,000 distinguishable polonies assuming Poisson statistics. 95

Call Sequence We established that we could amplify polonies
But can we sequence them? Rather than sequencing them directly, I decided to try to work out the protocol using oligonucleotide templates attached to acrylamide to work out issues with attachment chemistry, as well as the nucleotide and polymerase chemistries. First question, can we get a specific single base extension with our 96

454 vs Solexa Read length: 40 bp Read length: 400 bp
Number of reads: millions Per-base cost cheaper Ideal for application requiring short reads Read length: 400 bp Number of reads: Per-base cost greater de novo assembly, metagenomics

ABI SOLiD 99

Sequencing By Ligation
100

ABI SOLiD 101

Sequencing: Fluorescently Labeled Nucleotides (ABI SOLiD)
No more ABI, now it’s Life Technologies Preparation stage is similar to 454, beads diameter is 1 micron Based on DNA ligase rather than polymerase At the end of each iteration the z-z-z is removed before the next round 2 chars are read at each round – enables error correction. Complementary strand elongation: DNA Ligase 102 102

ABI SOLiD 103

ABI SOLiD 104

ABI SOLiD 105

ABI SOLiD This allows for error correction: See board
Raw error rate = ~3% Corrected error rate = ~0.1% 106

Applications “If you build it, they will come.”
An explosion of scientific innovation! Every new technology enables new applications, which are not directly foreseen by the original developers of the tech. Cheap access to high-volume sequencing becomes a data collection method for many different types of experimental applications

Applications Ancient DNA
DNA mixtures from diverse ecosystems, metagenomics Resequencing previously published reference strains Identification of all mutations in an organism Expand the number of available genomes Comparative studies Deciphering cell’s transcripts at sequence level without knowledge of the genome sequence Sequencing extremely large genomes, crop plants Detection of cancer specific alleles avoiding traditional cloning Chip-seq: interactions protein-DNA Epigenomics Detecting ncRNA Genetic human variation : SNP, CNV (diseases)

Usage of sequencing data
Transcriptome (RNA) sequencing Differential expression Alternative splicing Complete/targeted genome (DNA) resequencing Polymorphism and mutation discovery

De Novo sequencing New species/strains
Challenge of assembly with short reads 8x coverage of 3 GB genome = 750 million fragments Exponential problem for all-vs-all algorithm Again big problem with repeats Assemble contigs, fill gaps Paired-end reads are essential Can sequence the entire genome of a microbe in a single run

Assembly

Resequencing (mutation discovery/genotyping)
A lot of current sequencing effort is spent on re-sequencing genomes of known species Individual humans (1000 Genomes Project) Experimental organisms – looking for genetic variation, copy number variation Challenge is to (quickly) align millions of sequence reads to a reference genome with some % of mismatches Challenge to accurately call SNPs and indels Problems with repeated sequences – both tandem and dispersed repeats

Read length and pairing
ACTTAAGGCTGACTAGC TCGTACCGATATGCTG Short reads are problematic, because short sequences do not map uniquely to the genome. Solution #1: Get longer reads. Solution #2: Get paired reads.

Read Length is Not As Important For Resequencing
114

Paired End Reads are Important!
Repetitive DNA Unique DNA Single read maps to multiple positions Paired read maps uniquely Read 1 Read 2 Known Distance 115

RNA Sequencing “Digital Gene Expression” or “RNA-Seq”
Truly accurate gene expression measurements Can replace gene expression microarrays 25% more sensitive Does not rely on hybridization (no %GC bias, no cross-hybridization between related genes) Discover novel genes (and other kinds of RNA molecules) one experiment found that 34% of human transcripts were not from known genes Sultan et al, Science Aug 15;321(5891):

More information from RNA
Can capture true alternative splicing information Sequence of splice-junctions One study found 4,096 previously unknown splice junctions in 3,106 human genes Different transcription start and end points for RNA molecules Allelic variation (SNPs) Small RNAs

Metagenomics Survey/discovery all of the species present in an Environmental or Medical sample “Human Microbiome” disease vs. healthy microbe populations in mouth, intestines, skin, reproductive tract, etc Complete multiple genome sequencing Complete multi-species transcript profiling (metabolic reconstruction) Deep sampling of genetic variation in microbial populations (frequency of drug resistant, toxin producing, etc.)

Informatics is the Bottleneck
Scientists are currently able to generate sequence data much faster/more easily than they are able to analyze it Customized analysis / Bioinformatics consulting is needed for every project

Bioinformatics Challenges
Need for large amount of CPU power Informatics groups must manage compute clusters Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment Very large text files (~10 million lines long) Impossible memory usage and execution time

Future Directions Sequencing will continue to get much faster and cheaper, by 4-10x per year for several more years. complete human genome sequencing will be available as a clinical diagnostic tool within 2-3 years. Data storage and analysis bottleneck Data security/privacy issues

Overview Whole genome shotgun sequencing genomic segment
AC..GC TT..TC CG..CA TG..GT TC..CC GA..GC TG..AC CT..TG GT..GC AT..AT TT..CC AA..GC Short DNA sequences ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA… ACGTGGTAATGGCGTATACACCCTTAGGCCATA

Next-generation sequencing technologies, which were introduced in the last few years, have revolutionized the sequencing landscape. Those new technologies produce up to 30 GB of data through massively parallel sequencing within a few days Those next-generation sequencing reads are also characterized through much shorter sequences and higher sequencing errors 123

Genomes Transcriptomes Metagenomes De Novo Assembly Template Based Assembly

De Novo sequencing New species/strains Challenge of assembly with short reads 8x coverage of 3 GB genome = 750 million fragments (32 bp) Exponential problem for all-vs-all algorithm Again big problem with repeats Assemble contigs, fill gaps Paired-end reads are essential Can sequence the entire genome of a microbe in a single run

Genoem Sequencing Assembly Algorithms
Shotgun sequencing assembly problem Find the shortest common superstring of a set of sequences. Given strings {s1, s2, …} find the shortest string T such that every si is a substring of T. This is NP-hard.

Greedy Algorithm Nodes are fragments Edges means there exist overlaps.
Weight are number of overlaps found after calculateing pairwise alignments of all fragments.

Greedy Algorithm Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e Hamiltonian paths: A path that goes through every vertex

Greedy Algorithm Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge

Genoem Sequencing Assembly Algorithms
Shotgun sequencing assembly problem continued. Greedy algorithms were the first successful assembly algorithm implemented. Used for organisms such as bacteria, single-celled eukaryotes. Because of the greedy algorithm’s limitations, two other algorithms were derived.

Genoem Sequencing Assembly Algorithms Overlap-layout-consensus
An assembler builds the graph Output is a set of nonintersecting simple paths, each path being a contigue.

Overlap-layout-consensus
Overlap-layout-consensus method for assembly. Build an overlap graph where each node represents a read. An edge exists between two reads if they overlap. Traverse the graph to find unambiguous paths which form contigs.

Overlap-layout-consensus
Overlap graph for a bacterial genome. The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right). The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Next-generation sequencing
Lower cost / base pair Very short fragment lengths (25-75bps) High error rate Inherent ability to do paired-end (mate-pair) sequencing.

Next generation sequencing
Paired-End sequencing (Mate pairs) Sequence two ends of a fragment of known size. Currently fragment length (insert size) can range from 200– 10,000 bps

Next-generation sequencing
Challenging to assembly data. Short fragment length = very small overlap therefore many false overlaps Sequenced up to 100x coverage, increase in data size. Large number of reads + short overlap + higher error rate make traditional overlap - layout - consensus approach impractical.

Current approaches Euler / De Bruijn approach.
Introduced as a alternative to overlap-layout-consensus approach in capillary sequencing. More suited for short read assembly.

Genoem Sequencing Assembly Algorithms Eularian path
Eularian path – a path that visits all edges of a graph Breaks reads into overlapping n-mers. Source – n-1 prefix and destination is the n-1 suffix corresponding to an n-mer. Basic problem is to find a path that uses all the edges. Eularian path is more efficient.

Eulerian Circuits and Paths
Eulerian Circuit – visits each edge in a graph exactly once, and ends at the same vertex in which it started. a b c d f e a-d-b-f-e-d-f-c-b-a is an Eulerian cycle in this particular graph Eulerian Path – visits each edge in a graph exactly once. a b c d f e j i h g a-b-c-d-e-f-g-c-h-f-i-j is an Eulerian trail in this particular graph

{AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC}
De Bruijn Graphs {AGC, ATC, ATT, CAG, CAT, GCA, TCA, TTC} Nodes are (k-1)-mers Edges are k-mers The set of k-mers is called a k-spectrum Finding shortest string with given k-spectrum. CA GC AG TC AT TT 140

De Bruijn Graphs Break each read sequence to overlapping fragments of size k. (k-mers) Form De Bruijn graph such that each (k-1)-mer represents a node in the graph. Edge exists between node a to b iff there exists a k-mer such that it’s prefix is a and suffix is b. Traverse the graph in unambiguous path to form contigs.

De Bruijn Graphs K = 4

Eulerian Path Approach to DNA Fragment Assembly
Ultimately, converts an NP-complete Hamilton Path Problem into a simplified Eulerian Path Problem through construction of a de Bruijn graph The number of ways to reconstruct the graph is equivalent to the number of paths which follow the respective directions and travel through all edges The resulting problem is that there are a number of different Eulerian Paths through this graph, and we cannot tell which would resemble the original path

Eulerian Superpath Problem
Eulerian Superpath Problem – Given an Eulerian Graph and a collection of paths on this graph, find an Eulerian path in this graph that contains all these paths as subpaths. The original Eulerian Path Problem is a case of the Eulerian Superpath Problem, in which every path is a single edge. Solving: Take graph G and the system of paths P, and transform these to a new graph G1 and a new system P1. With the goal in mind that there is a one-to-one correspondence (equivalence) between (G,P) and (G1,P1), we go on to make a series of these transformations. (G,P) → (G1,P1) → (G2,P2) →…→ (Gk,Pk) All these transformations should lead to a system Pk in which every path is represented by one edge. Since all transformations from beginning to end are equal, every solution of EPP in (Gk,Pk) will provide a solution to the ESPP in (G,P).

An x,y-detachment for no multiple edges
Let x = (vin,vmid) and y = (vmid,vout) be two consecutive edges in G and Px,y be all paths from P that include x,y as a subpath. P→x is the paths from P that end on x and Py→ is the collection of paths from P that start with y. Adding a new edge z = (vin,vout) to delete the edges x and y. We can substitute z instead of x,y in all paths from Px,y, x in all paths from P→x, and y in all paths from Py→. Thus, reducing an ESPP to an EPP.

De Bruijn Graphs Elegant way of representing the problem.
Very fast execution. Error correction can be handled in the graph. De Bruijn graph size can be huge. ~200GB for human genomes. Does not use pair information in initial phase, resulting in overlay complicated graphs.

Repeats Repeats in the sequence
Assembly programs should detect repeats in the assembly process and not after. Incorrect genome reconstruction Assemblers should try to resolve correctly as many repeats as possible.

Repeats Detecting repeats Euler assembly program
Finds repeats by complex parts of the graph constructed during the assembly process. Researchers look into these complex areas to try and resolve repeats. Assemblers can use clone mate (paired end) information to find incorrect assemblies. This is based on finding clone-mate pairs too close or too far from one another.

ASSEMBLY OF READS WITH ERRORS
Errors in read data greatly complicate the task of fragment assembly. Error correction is performed prior to assembly by solving the error correction problem.

Resequencing (mutation discovery/genotyping)
A lot of current sequencing effort is spent on re-sequencing genomes of known species Individual humans (1000 Genomes Project) Experimental organisms – looking for genetic variation, copy number variation Challenge is to (quickly) align millions of sequence reads to a reference genome with some % of mismatches Challenge to accurately call SNPs and indels Problems with repeated sequences – both tandem and dispersed repeats

New Challenge Need to alignment programs to map short sequencing reads from next-generation sequencing technologies to a reference genome are introduced 151

The reads mapping problem
given a set of reads R, for each read r ∈R, find its target regions on the reference genome G, such that for each target region t there are at most k mismatches between r and t. 152

Aligner Program 153

Aligner algorithms Aligner algorithms can be divide in to two categories : Seeded alignments algorithms (BLAST like) Burrows-Wheeler transform based algorithms 154

Seed alignment algorithm
BLAST is the most popular tool. Requires a query sequence to search for, and a sequence to search against Step 1: Make a k-letter word list of the query sequence. Step 2: List the possible matching words step 3: extend the match to find the high similarity pair TAGGACCTAACC GACCACCTTTT The first step is word match. The second step is to extend the match to both directions until the sum of scores is below some threshold. Word match is the pre evidence of high similarity. We only explore the space around seed matches. That is the reason that blast is faster than apply full S/W algorithm on the whole space. Most of the running time Is spent on the extension of seed matches. DP or Smith Watherman algorithm TAGGACCTAACC GACCACCTTTT 155 155

Blast algorithm Find seeded matches of 11 base pairs
Extend each match to right and left, until the scores drop too much, to form an alignment Report all local alignments Example: AGCGATGTCACGCGCCCGTATTTCCGTA TCGGATCTCACGCGCCCGGCTTACCGTG `` | | | | | | | | | | | | | | | | | | | | 156

Spaced seed 1 means a required match 0 means “don’t care” position
Spaced Seed: nonconsecutive matches and optimized match positions. Represent BLAST seed by Spaced seed: 1 means a required match 0 means “don’t care” position The length of the seed is the string length, and the weight of the seed is the number of 1s in the string. This seemingly simple change makes a huge difference: significantly increases hit to homologous region while reducing bad hits. 157

Multiple simultaneous seeds
Multiple simultaneous seeds are defined as a set of seeds. ∏= {seed1, seed2,…seed i,…, seedn} ∏ detects a similarity if at least one of the component seeds detects the similarity Example Simultaneous seeds {1101, 1011} detect similarities , , Important fact about MSS is they could find more seed matches than single seed 158 158

Prefix Tree (Trie) The prefix trie for string X is a tree where each edge is labeled with a symbol and the string concatenation of the edge symbols on the path from a leaf to the root gives a unique prefix of X. On the prefix trie, the string concatenation of the edge symbols from a node to the root gives a unique substring of X . The prefix trie of X is identical to the suffix trie of reverse of X and therefore suffix trie theories can also be applied to prefix trie 159

Burrows–Wheeler Transform
Let ∑ be an alphabet. Symbol $ is not present in and is lexicographically smaller than all the symbols in ∑ A string X=a0a1 ...an−1 is always ended with symbol $ (i.e. an−1=$) Suffix array S of X is a permutation of the integers 0...n−1 such that S(i) is the start position of the i-th smallest suffix. 160

For compute S(.), string X is circulated to generate strings, which are then lexicographically sorted. 161

After sorting, the positions of the first symbols form the suffix array. BWT(X) is the last column of the sorted matrix. 162

Burrows–Wheeler Transform is Reversible
163

Most algorithms for constructing suffix array require at least nlog2n bits of working space, which amounts to 12GB for human genome. Recently, Hon et al. (2007) gave a new algorithm that uses n bits of working space and only requires <1GB memory at peak time for constructing the BWT of human genome 164

If string W is a substring of X, the position of each occurrence of W in X will occur in an interval in the suffix array. Based on this observation, we define: R(W) = min{k :W is the prefix of XS(k)} R’(W) = max{k :W is the prefix of XS(k)} (Xi=X[i,n−1] a suffix of X) In particular, ifW is an empty string, R(W)=1 and R’(W)=n−1. 165

The interval [R(W) ,R(W)’] is called the SA interval of W and the set of positions of all occurrences of W in X is {S(k) :R(W) ≤k≤ R(W)’} For example the SA interval of string ‘go’ is [1,2] The suffix array values in this interval are 3 and 0 which give the positions of all the occurrences of ‘go’ in the “googol”. 166

Knowing the intervals in suffix array we can get the positions. Therefore, sequence alignment is equivalent to searching for the SA intervals of substrings of X that match the query. For the exact matching problem, we can find only one such interval 167

Search a Qurey We can compute SA intervals for all node in the trie and each read map equivalent to search the tree. 168

Search a Qurey We can compute SA intervals for all node in the trie and each read map equivalent to search the tree. Serch qurey W=‘gol’ Reverse W: ‘log’ 169

With one mismatch allowed
Inexact Search Serch qurey ‘lol’ With one mismatch allowed 170

Next Generation DNA Sequencing

Similar presentations

Presentation on theme: "Next Generation DNA Sequencing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Next Generation DNA Sequencing

Similar presentations

Presentation on theme: "Next Generation DNA Sequencing"— Presentation transcript:

Similar presentations

About project

Feedback