Assembly by short paired-end reads Wing-Kin Sung National University of Singapore
State of genome sequencing Thousands bacterial genomes plus a few dozen higher organisms are sequenced There are still a lot of genomes waiting for us to sequence. Personalize sequencing is also a big market. In summary, we need cheaper and faster sequencing.
Bio-technology: DNA-PETs What data is used for genome assembly? DNA-PET is a paired-end tag extracted from the genome – Each tag is of length readlength. (e.g. readlength = 35) – The span of the DNA-PET is fixed (e.g. 1kb, 5kb, 10kb, or 20kb) ACTCAGCACCTTACGGCGTGCATCA TACGTTCTGAACGGCAGTACAAACT Readlength Span of the paired-end read
Bio-technology: DNA-PETs Some genome Sonication Size selection Pair-end sequencing
Sequence Assembly Problem Given the paired-end reads, can we assemble them to reconstruct the genome?
Agenda A short discussion on the data quality A brief review of existing methods PE-Assember An example demonstrates the use of assembly Scaffolding
QUALITY OF PAIRED-END SEQUENCING
Paired-end sequencing 1kb 10kb 20kb Size selection Circularize, ligation, and cut Sequencing
Size selection is not exact Sample fragment length distribution 300bp paired-end library10,000bp mate pair library
Errors in DNA Sequencing Ligation errors – Occur in mate-pair libraries during library construction. – Two unrelated reads are paired together. Chr1 Chr2 5’ and 3’ ends of two different fragments put together
Errors in DNA Sequencing Sequencing errors – Caused by ‘misreading’ bases by sequencing machine. – In most sequencing technologies, sequencing errors are more likely to occur towards end of the read. ACGTGAGGATGACACGATAGCCA ACGTGAGCATGACACGATAGCCA Actual DNA sequence Sequence, as interpreted by machine. Machine incorrectly reads this position as a C
EXISTING METHODS
SSAKE, VCAKE and SHARCGS Base by base 3’ extension. Currently, it can assemble short genome
De Bruijn graph approach Velvet, Euler-USR, Abyss, IDBA E.g. input = {AAGACTC, ACTCCGACTG, ACTGGGAC, GGACTTT} List of 3-mers = {AAG, AGA, GAC, ACT, CTC, TCC, CCG, CGA, CGA, CTG, TGG, GGG, GGA, CTT, TTT} AAGACTCCGACTGGGACTTT AAGACTC ACTCCGACTG ACTGGGAC GGACTTT Mark J. Chaisson, Dumitru Brinza and Pavel A. Pevzner. De novo fragment assembly with short mate-paired reads: Does read length matter? Genome Res. 19: Daniel Zerbino and Ewan Birney. Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 18:
ALLPATHS Builds unipath-graph by repeatedly overlapping the unipaths. Highly accuracy. However, it is slow and memory intensity. Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18: Combine these.. To obtain this.
Summary of current status Base-by-base extension approach can work only for short genome. De Bruijn graph approaches are fast. – However, the size of De Bruijn graph increases exponently with the error rate. – When error rate increase, this approach is not accurate. ALLPATHS is accurate. – However, it is too slow and not memory efficient for handling big genome. Furthermore, little has been done on using paired-end information explicitly. Is it possible to have a reasonably fast and memory-efficient method which is accurate?
PE-ASSEMBLER
Idea (I) Suppose we have two possible ways (green or red reads) to extend the sequence. How can we decide if we extend the red or the green read? or c t
Idea (II) If we use the paired-end information, we can make decision early! or c t
PE-Assembler Instead of using de Bruijn graph approaches, we use the simple base-by-base extension approach similar to SSAKE, VCAKE and SHARCGS. Moreover, we try to utilize paired-end reads to resolve ambiguity.
PE-Assembler Input: a set of paired-end reads 1.Read screening 2.Seed building 3.Contig extension 4.Scaffolding 5.Gap filling
Read screening (I) Read screening identify ‘solid’ reads – i.e. error free and non-repetitive reads. From the set of reads, we count the frequency of each k-mer. If a k-mer occurs once, it is likely to be sequencing error. If a k-mer occurs too many times, it is likely to be repeat. acgtcgagtcaggtacgt acgtcgagtc cgtcgagtca gtcgagtcag tcgagtcagg cgagtcaggt gagtcaggta agtcaggtac gtcaggtacg tcaggtacgt
Read screening (II) A read is said to be ‘solid’ if the frequencies of all its k-mers are in the blue region. Those solid reads are the starting points for extension. acgtcgagtcaggtacgt acgtcgagtc cgtcgagtca gtcgagtcag tcgagtcagg cgagtcaggt gagtcaggta agtcaggtac gtcaggtacg tcaggtacgt
Seed building (I) A seed is a contiguous region in the target genome which is of length at least MaxSpan. Starting from some ‘solid’ reads, we extend the read from both 5’ and 3’ ends.
Seed building (II) In case there are multiple feasible extensions, we can resolve it by checking the mates. In the following example, g has support while a does not have support. Hence, g is correct.
Seed building (III) The previous method cannot resolve ambiguities arising due to sequencing error. In such case, we extend every candidate base up to a distance of ReadLength. Any extension path arising due to sequencing errors is likely to be terminated prematurely. If only one candidate path can reach the full distance, then that path is assumed to be the correct extension. ACGTCA AC CCGT TC X GC X TCGAT GC X ReadLength
Contig extension The contig extension step aims to extend each verified seed to form a longer contig iteratively. Since each verified seed is longer than maxSpan, we can extend the seed using paired-end reads.
Scaffolding problem Input: – A set of contigs of some genome X – A set of DNA-PETs of some genome X Scaffolding finds the correct ordering and the orientation of the contigs.
Scaffolding (II) It demarcates all repeat regions within assembled contigs. Build the contig graph Identify a linear order of the contigs
Demarcating all repeat regions within assembled contigs Map all paired-end reads onto the contigs. The mode of the read density is assumed to be the expected read coverage across the genome. Any region with read density higher than 1.5 times of the mode is considered as a repeat region. mode1.5 mode Repeat region density frequency
Build contig graph
Identify a linear order of the contigs Case 1: 1 discordant edge Case 2: 2 discordant edges
Gap filling (I) From scaffolding, we identify adjacent contigs. Those gaps are usually generated by repeat regions. Since we have paired-end reads from both 5’ and 3’, we may be able to fill-in the gap.
Gap filling (II)
Parallelization 1.Read screening – Sequential: This step is largely disk bound. 2.Seed building – Run in multiple threads. If two threads use the same set of paired-end reads, rewind one of the threads. 3.Contig extension – Run in multiple threads. If two threads use the same set of paired-end reads, rewind one of the threads. 4.Scaffolding – Graph building: Run in multiple threads. – Actual scaffolding: Sequential 5.Gap filling – Run in multiple threads.
Simulation data OrganismE. coliS. pombeHG 18 - Chr 10 No. of contigs/chromosomes131 Genome length4,639,658bp12,571,820bp135,374,737bp Library200bp1kbp10kbp200bp1kbp10kbp200bp1kbp10kbp Read length (bp)35 75 Average insert size (bp) Insert size range (average ± bp)± 40± 200 ± 2000± 40± 200 ± 2000± 40± 200 ± 2000 No. of paired reads3.31m 8.98m 45.12m9.02m Coverage50x 10x Seq. error rate2.0% Ligation error rate0.0%2.0% 0.0%2.0% 0.0%2.0% We perform simulation on 3 organisms.
Simulation data E. coliS. pombeHG18 chr10 PAVelvetAllpaths2AbyssSOAPPAVelvetAllpaths2AbyssSOAPPAAbyssSOAP Contig statistics No. of contigs (>200bp) Average length (kb) Maximum length (kb) Contig N50 size (kb) Contig N90 size (kb) Coverage Evaluation Large misassemblies Segment maps Performance Total execution time (min) N/A240 Peak memory usage (gb) N/A48 Velvet and Allpath2 are not efficient enough to handle the dataset for HG18 chr10. N50 length of the assembly is defined as the length such that contigs of equal or longer than that length account for 50% of the total length. N90 is defined similarly. Segment map: Divide the genome into bins of 1000bp. Count the number of bins which are the same as the reference genome.
Experimental data We obtained 4 real-life datasets from Allpath2 paper.
Experimental data S. aureusE. coli PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp) Average length (kb) Maximum length (kb) Contig N50 size (kb) Contig N90 size (kb) Coverage Evaluation Large misassemblies Segment maps Performance Total execution time (min) Peak memory usage (gb) S. pombeN. Crassa PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp) Average length (kb) Maximum length (kb) Contig N50 size (kb) Contig N90 size (kb) Coverage Evaluation Large misassemblies Segment maps Performance Total execution time (min) Peak memory usage (gb)6.615N/A N/A25.6
Running time Single CPU, multiple core
EXAMPLE APPLICATION
Burkholderia species Burkholderia pseudomallei (Bp) – Causative agent of melioidosis, a serious infectious disease of humans and animals with an overall fatality rate of 50% Burkholderia thailandensis (Bt) – non-pathogenic to mammals Why Bp can infect human? – Likely required for Bp to colonize and infect mammals. These include the gain of a Bp- specific capsular polysaccharide gene cluster. Wrinkled colonies Round colonies
Bt E555 My collaborator Patrick Tan thinks virulence and nonvirulence is not a black and white issue. There should be some intermediate state. He looked for 28 Bt strains. He finds Bt E555. It is a mixture of smooth and wrinkled colonies. Mixture of smooth and wrinkled colonies
Sequencing of Burkholderia thailandensis (Bt E555) We sequenced Bt E555 using Solexa Genome Analyser II. – 12.5M paired-end reads – Each read is of length 100bp – Insert size is We map the sequences on the Bt reference E264.
De novo assembly of Bt E555 using PEassembler 521 contigs N50: bp Total length: bp Longest contig: bp Shortest contig: 250 bp In particular, contig 19 (24k bp) is similar to the Bp-like CPS in Burkholderia pseudomallei. It replaces EPS.
Phenotype of Bp-like CPS Bp colonies are wrinkled. Bt colonies are round and smooth BtE555 exhibited a mixture of smooth and wrinkled colonies. BtE555 CPS KO develop round colonies with no wrinkling. This suggested that Bp-like CPS expression may contribute to the wrinkled colonies. Wrinkled colonies Mixture of smooth and wrinkled colonies Round colonies
SCAFFOLDING
Formal definition of the scaffolding problem Input: A set of contigs and edges – each edge spans Output: An ordering of the contigs s.t. the number of discordant edges is minimized Discordant edge
Scaffolding problem is NP-hard Huson et al (2002) showed that scaffolding is NP- hard. A number of heuristics solutions – Celera Assembler [Myers et al,2000] - Euler [Pevzner et al, 2001] – Jazz [Chapman et al, 2002] - Arachne [Batzoglou et al,2002] – Velvet [Zerbino et al,2008] - Bambus [Pop, et al, 2004] Can we solve the problem optimally? Is optimal solution better?
A parameter width (w) Since every contig has some minimum length and every edge span a fixed length, – we expected every edge span at most w contigs for some constant width w. At most w contig
Two parts Fixed parameter polynomial time algorithm – We showed that the running time of the scaffolder depends on a parameter “width” Graph Contraction – We proposed a way to reduce the graph
Scaffolding when no discordant edge When there is no discordant edge, a naïve solution is: – Enumerate all possible signed permutation of the contigs in a tree. Prune the subtree if the scaffold is not feasible. +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time +A-C+B +A-C-B … … … ABCD
Observation Lemma: Consider two scaffolds S 1 and S 2. If both scaffolds share a common tail of width w, – Then, both S 1 and S 2 have a feasible solution or both don’t have. Proof: Based on Bandwidth Problem [Saxe, 1980] – Orientation of Nodes – Direction of Edges – Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), (1980) Upper Bound (W)
Scaffold Tail is Sufficient Analogous to Bandwidth Problem [Saxe, 1980] – Orientation of Nodes – Direction of Edges – Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), (1980) Upper Bound (W)
Equivalence class of scaffolds – S 1 and S 2 have the same tail -> They are in the same class – Feature of equivalence class: – - Use of the same set of contigs; – - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F …
Scaffolding with p discordant edges When there are discordant edges, we just try all possible ways to discard the p discordant edges. Then, we run the scaffolding with no discordant edges. This gives an O(|E| |V| w+p )-time algorithm.
Graph Contraction 20k
Graph Contraction
Gap Estimation 60 Utility – Genome finishing(Genome Size Estimation) – Scaffold Correctness Calculate Gap Sizes – Maximum Likelihood – Quadratic Function – Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 μ,σμ,σ * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)
Runtime Comparison 61 ◆ E. coli ★ B. pseudomallei ◆ S. cerevisiae ◆ D. melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s ◆ Simulated dataset Coverage of 2x80bp PETs with insert size 300bp: 40X Coverage of 2x50bp PETs with insert size 10kbp: 2X Contigs assembled using Velvet ◆ Simulated datasets using MetaSim ★ In house data ★ B. pseudomallei Coverage of 100bp 454 reads: 20X Coverage of 2x20kbp PETs with insert sizelibrary: 2.8X Contigs assembled using Newbler
Scaffold Contiguity 62
Scaffold Correctness 63
Scaffold Correctness 64 E.coliS. cerevisiaeD. melanogaster Opera134 Bambus
Reference Pramila Nuwantha Ariyaratne, Wing-Kin Sung: PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27(2): (2011) Song Gao, Niranjan Nagarajan, Wing-Kin Sung: Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. RECOMB 2011:
Acknowledgement Bioinformatics – Zhizhuo – Xueliang – Chandana – Rikky – Gao Song – Pramila – Charlie Lee – Guoliang Li – Han Xu – Fabi Infectious Disease – Patrick Tan Sequencing group – Ruan Yijun – Wei Chialin – Yao Fei – Liu Jun – Herve Thoreau – Sequencing team