Download presentation
Presentation is loading. Please wait.
Published byMaud Kristina Jenkins Modified over 9 years ago
1
Assembly by short paired-end reads Wing-Kin Sung National University of Singapore
2
State of genome sequencing Thousands bacterial genomes plus a few dozen higher organisms are sequenced There are still a lot of genomes waiting for us to sequence. Personalize sequencing is also a big market. In summary, we need cheaper and faster sequencing.
3
Bio-technology: DNA-PETs What data is used for genome assembly? DNA-PET is a paired-end tag extracted from the genome – Each tag is of length readlength. (e.g. readlength = 35) – The span of the DNA-PET is fixed (e.g. 1kb, 5kb, 10kb, or 20kb) ACTCAGCACCTTACGGCGTGCATCA TACGTTCTGAACGGCAGTACAAACT Readlength Span of the paired-end read
4
Bio-technology: DNA-PETs Some genome Sonication Size selection Pair-end sequencing
5
Sequence Assembly Problem Given the paired-end reads, can we assemble them to reconstruct the genome?
6
Agenda A short discussion on the data quality A brief review of existing methods PE-Assember An example demonstrates the use of assembly Scaffolding
7
QUALITY OF PAIRED-END SEQUENCING
8
Paired-end sequencing 1kb 10kb 20kb Size selection Circularize, ligation, and cut Sequencing
9
Size selection is not exact Sample fragment length distribution 300bp paired-end library10,000bp mate pair library
10
Errors in DNA Sequencing Ligation errors – Occur in mate-pair libraries during library construction. – Two unrelated reads are paired together. Chr1 Chr2 5’ and 3’ ends of two different fragments put together
11
Errors in DNA Sequencing Sequencing errors – Caused by ‘misreading’ bases by sequencing machine. – In most sequencing technologies, sequencing errors are more likely to occur towards end of the read. ACGTGAGGATGACACGATAGCCA ACGTGAGCATGACACGATAGCCA Actual DNA sequence Sequence, as interpreted by machine. Machine incorrectly reads this position as a C
12
EXISTING METHODS
13
SSAKE, VCAKE and SHARCGS Base by base 3’ extension. Currently, it can assemble short genome
14
De Bruijn graph approach Velvet, Euler-USR, Abyss, IDBA E.g. input = {AAGACTC, ACTCCGACTG, ACTGGGAC, GGACTTT} List of 3-mers = {AAG, AGA, GAC, ACT, CTC, TCC, CCG, CGA, CGA, CTG, TGG, GGG, GGA, CTT, TTT} AAGACTCCGACTGGGACTTT AAGACTC ACTCCGACTG ACTGGGAC GGACTTT Mark J. Chaisson, Dumitru Brinza and Pavel A. Pevzner. De novo fragment assembly with short mate-paired reads: Does read length matter? Genome Res. 19:336-346. 2009 Daniel Zerbino and Ewan Birney. Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 18: 821-829. 2008
15
ALLPATHS Builds unipath-graph by repeatedly overlapping the unipaths. Highly accuracy. However, it is slow and memory intensity. Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18: 810-820. 2008 Combine these.. To obtain this.
16
Summary of current status Base-by-base extension approach can work only for short genome. De Bruijn graph approaches are fast. – However, the size of De Bruijn graph increases exponently with the error rate. – When error rate increase, this approach is not accurate. ALLPATHS is accurate. – However, it is too slow and not memory efficient for handling big genome. Furthermore, little has been done on using paired-end information explicitly. Is it possible to have a reasonably fast and memory-efficient method which is accurate?
17
PE-ASSEMBLER
18
Idea (I) Suppose we have two possible ways (green or red reads) to extend the sequence. How can we decide if we extend the red or the green read? or c t
19
Idea (II) If we use the paired-end information, we can make decision early! or c t
20
PE-Assembler Instead of using de Bruijn graph approaches, we use the simple base-by-base extension approach similar to SSAKE, VCAKE and SHARCGS. Moreover, we try to utilize paired-end reads to resolve ambiguity.
21
PE-Assembler Input: a set of paired-end reads 1.Read screening 2.Seed building 3.Contig extension 4.Scaffolding 5.Gap filling 1. 2. 3. 4. 5.
22
Read screening (I) Read screening identify ‘solid’ reads – i.e. error free and non-repetitive reads. From the set of reads, we count the frequency of each k-mer. If a k-mer occurs once, it is likely to be sequencing error. If a k-mer occurs too many times, it is likely to be repeat. acgtcgagtcaggtacgt acgtcgagtc cgtcgagtca gtcgagtcag tcgagtcagg cgagtcaggt gagtcaggta agtcaggtac gtcaggtacg tcaggtacgt
23
Read screening (II) A read is said to be ‘solid’ if the frequencies of all its k-mers are in the blue region. Those solid reads are the starting points for extension. acgtcgagtcaggtacgt acgtcgagtc cgtcgagtca gtcgagtcag tcgagtcagg cgagtcaggt gagtcaggta agtcaggtac gtcaggtacg tcaggtacgt
24
Seed building (I) A seed is a contiguous region in the target genome which is of length at least MaxSpan. Starting from some ‘solid’ reads, we extend the read from both 5’ and 3’ ends.
25
Seed building (II) In case there are multiple feasible extensions, we can resolve it by checking the mates. In the following example, g has support while a does not have support. Hence, g is correct.
26
Seed building (III) The previous method cannot resolve ambiguities arising due to sequencing error. In such case, we extend every candidate base up to a distance of ReadLength. Any extension path arising due to sequencing errors is likely to be terminated prematurely. If only one candidate path can reach the full distance, then that path is assumed to be the correct extension. ACGTCA AC CCGT TC X GC X TCGAT GC X ReadLength
27
Contig extension The contig extension step aims to extend each verified seed to form a longer contig iteratively. Since each verified seed is longer than maxSpan, we can extend the seed using paired-end reads.
28
Scaffolding problem Input: – A set of contigs of some genome X – A set of DNA-PETs of some genome X Scaffolding finds the correct ordering and the orientation of the contigs.
29
Scaffolding (II) It demarcates all repeat regions within assembled contigs. Build the contig graph Identify a linear order of the contigs
30
Demarcating all repeat regions within assembled contigs Map all paired-end reads onto the contigs. The mode of the read density is assumed to be the expected read coverage across the genome. Any region with read density higher than 1.5 times of the mode is considered as a repeat region. mode1.5 mode Repeat region density frequency
31
Build contig graph
32
Identify a linear order of the contigs Case 1: 1 discordant edge Case 2: 2 discordant edges
33
Gap filling (I) From scaffolding, we identify adjacent contigs. Those gaps are usually generated by repeat regions. Since we have paired-end reads from both 5’ and 3’, we may be able to fill-in the gap.
34
Gap filling (II)
35
Parallelization 1.Read screening – Sequential: This step is largely disk bound. 2.Seed building – Run in multiple threads. If two threads use the same set of paired-end reads, rewind one of the threads. 3.Contig extension – Run in multiple threads. If two threads use the same set of paired-end reads, rewind one of the threads. 4.Scaffolding – Graph building: Run in multiple threads. – Actual scaffolding: Sequential 5.Gap filling – Run in multiple threads.
36
Simulation data OrganismE. coliS. pombeHG 18 - Chr 10 No. of contigs/chromosomes131 Genome length4,639,658bp12,571,820bp135,374,737bp Library200bp1kbp10kbp200bp1kbp10kbp200bp1kbp10kbp Read length (bp)35 75 Average insert size (bp)235103510035235103510035275107510075 Insert size range (average ± bp)± 40± 200 ± 2000± 40± 200 ± 2000± 40± 200 ± 2000 No. of paired reads3.31m 8.98m 45.12m9.02m Coverage50x 10x Seq. error rate2.0% Ligation error rate0.0%2.0% 0.0%2.0% 0.0%2.0% We perform simulation on 3 organisms.
37
Simulation data E. coliS. pombeHG18 chr10 PAVelvetAllpaths2AbyssSOAPPAVelvetAllpaths2AbyssSOAPPAAbyssSOAP Contig statistics No. of contigs (>200bp)656442831993118116465034842624901518238 Average length (kb)777.482606107.622.322.8394.767.975.323.13530.22.96.6 Maximum length (kb)2492.6708.6593.71632323519.6856.1851297.3468.7403.565.2155.8 Contig N50 size (kb)2492.6398.3373.363.849.91487.7273226.880.199.862.45.313 Contig N90 size (kb)2146109.9115.433.912.4363.654.459.536.71911.11.73.6 Coverage10.99590.99850.990.98870.97780.99350.9860.98380.98910.90890.92040.8714 Evaluation Large misassemblies0110110170145171605 Segment maps0.99680.94740.99180.93310.93660.96420.94440.96830.92720.94310.86170.63610.322 Performance Total execution time (min)2110227435101407349811748N/A240 Peak memory usage (gb)2.32.929.72.95.94.57.76668.115.1N/A48 Velvet and Allpath2 are not efficient enough to handle the dataset for HG18 chr10. N50 length of the assembly is defined as the length such that contigs of equal or longer than that length account for 50% of the total length. N90 is defined similarly. Segment map: Divide the genome into bins of 1000bp. Count the number of bins which are the same as the reference genome.
38
Experimental data We obtained 4 real-life datasets from Allpath2 paper.
39
Experimental data S. aureusE. coli PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp)2460141872112125277 Average length (kb)119.84820518.3176.837.5184.121.4 Maximum length (kb)949.9475.61122.8175.1895.9356.61015.3160.4 Contig N50 size (kb)685.8314.9477.263.8428.8105.6337.155.2 Contig N90 size (kb)107.537.798431.9143.125.481.731.8 Coverage0.99450.98990.99240.98280.99560.99190.99630.9896 Evaluation Large misassemblies05010401 Segment maps0.98480.96660.98550.94560.98730.9560.99180.9455 Performance Total execution time (min)1789513342522229 Peak memory usage (gb)1.92.8202.63.36.937.65.3 S. pombeN. Crassa PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp)16936235310282708507916879916 Average length (kb)72.133.733.81312.86.818.33.8 Maximum length (kb)571.1443257.2136.8156.271161.256 Contig N50 size (kb)147.7110.6503620.711.617.68.1 Contig N90 size (kb)4033.212.212.3---1 Coverage0.96970.97820.9520.97930.8740.8770.78380.887 Evaluation Large misassemblies3262271627318395 Segment maps0.95510.94260.9260.91080.82060.77440.74660.7128 Performance Total execution time (min)36412548307214162665196331 Peak memory usage (gb)6.615N/A6.62145N/A25.6
40
Running time Single CPU, multiple core
41
EXAMPLE APPLICATION
42
Burkholderia species Burkholderia pseudomallei (Bp) – Causative agent of melioidosis, a serious infectious disease of humans and animals with an overall fatality rate of 50% Burkholderia thailandensis (Bt) – non-pathogenic to mammals Why Bp can infect human? – Likely required for Bp to colonize and infect mammals. These include the gain of a Bp- specific capsular polysaccharide gene cluster. Wrinkled colonies Round colonies
43
Bt E555 My collaborator Patrick Tan thinks virulence and nonvirulence is not a black and white issue. There should be some intermediate state. He looked for 28 Bt strains. He finds Bt E555. It is a mixture of smooth and wrinkled colonies. Mixture of smooth and wrinkled colonies
44
Sequencing of Burkholderia thailandensis (Bt E555) We sequenced Bt E555 using Solexa Genome Analyser II. – 12.5M paired-end reads – Each read is of length 100bp – Insert size is 130-290 We map the sequences on the Bt reference E264.
45
De novo assembly of Bt E555 using PEassembler 521 contigs N50: 20293 bp Total length: 6145909 bp Longest contig: 72827 bp Shortest contig: 250 bp In particular, contig 19 (24k bp) is similar to the Bp-like CPS in Burkholderia pseudomallei. It replaces EPS.
46
Phenotype of Bp-like CPS Bp colonies are wrinkled. Bt colonies are round and smooth BtE555 exhibited a mixture of smooth and wrinkled colonies. BtE555 CPS KO develop round colonies with no wrinkling. This suggested that Bp-like CPS expression may contribute to the wrinkled colonies. Wrinkled colonies Mixture of smooth and wrinkled colonies Round colonies
47
SCAFFOLDING
48
Formal definition of the scaffolding problem Input: A set of contigs and edges – each edge spans Output: An ordering of the contigs s.t. the number of discordant edges is minimized Discordant edge
49
Scaffolding problem is NP-hard Huson et al (2002) showed that scaffolding is NP- hard. A number of heuristics solutions – Celera Assembler [Myers et al,2000] - Euler [Pevzner et al, 2001] – Jazz [Chapman et al, 2002] - Arachne [Batzoglou et al,2002] – Velvet [Zerbino et al,2008] - Bambus [Pop, et al, 2004] Can we solve the problem optimally? Is optimal solution better?
50
A parameter width (w) Since every contig has some minimum length and every edge span a fixed length, – we expected every edge span at most w contigs for some constant width w. At most w contig
51
Two parts Fixed parameter polynomial time algorithm – We showed that the running time of the scaffolder depends on a parameter “width” Graph Contraction – We proposed a way to reduce the graph
52
Scaffolding when no discordant edge When there is no discordant edge, a naïve solution is: – Enumerate all possible signed permutation of the contigs in a tree. Prune the subtree if the scaffold is not feasible. +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time +A-C+B +A-C-B … … … ABCD
53
Observation Lemma: Consider two scaffolds S 1 and S 2. If both scaffolds share a common tail of width w, – Then, both S 1 and S 2 have a feasible solution or both don’t have. Proof: Based on Bandwidth Problem [Saxe, 1980] – Orientation of Nodes – Direction of Edges – Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980) Upper Bound (W)
54
Scaffold Tail is Sufficient Analogous to Bandwidth Problem [Saxe, 1980] – Orientation of Nodes – Direction of Edges – Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), 363-369 (1980) Upper Bound (W)
55
Equivalence class of scaffolds – S 1 and S 2 have the same tail -> They are in the same class – Feature of equivalence class: – - Use of the same set of contigs; – - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F …
56
Scaffolding with p discordant edges When there are discordant edges, we just try all possible ways to discard the p discordant edges. Then, we run the scaffolding with no discordant edges. This gives an O(|E| |V| w+p )-time algorithm.
57
Graph Contraction 20k
58
Graph Contraction
60
Gap Estimation 60 Utility – Genome finishing(Genome Size Estimation) – Scaffold Correctness Calculate Gap Sizes – Maximum Likelihood – Quadratic Function – Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 μ,σμ,σ * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)
61
Runtime Comparison 61 ◆ E. coli ★ B. pseudomallei ◆ S. cerevisiae ◆ D. melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s ◆ Simulated dataset Coverage of 2x80bp PETs with insert size 300bp: 40X Coverage of 2x50bp PETs with insert size 10kbp: 2X Contigs assembled using Velvet ◆ Simulated datasets using MetaSim ★ In house data ★ B. pseudomallei Coverage of 100bp 454 reads: 20X Coverage of 2x20kbp PETs with insert sizelibrary: 2.8X Contigs assembled using Newbler
62
Scaffold Contiguity 62
63
Scaffold Correctness 63
64
Scaffold Correctness 64 E.coliS. cerevisiaeD. melanogaster Opera134 Bambus1955423
65
Reference Pramila Nuwantha Ariyaratne, Wing-Kin Sung: PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27(2): 167-174 (2011) Song Gao, Niranjan Nagarajan, Wing-Kin Sung: Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. RECOMB 2011: 437-451
66
Acknowledgement Bioinformatics – Zhizhuo – Xueliang – Chandana – Rikky – Gao Song – Pramila – Charlie Lee – Guoliang Li – Han Xu – Fabi Infectious Disease – Patrick Tan Sequencing group – Ruan Yijun – Wei Chialin – Yao Fei – Liu Jun – Herve Thoreau – Sequencing team
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.