Assembly by short paired-end reads Wing-Kin Sung National University of Singapore.

Slides:

Advertisements

Similar presentations

Fast Algorithms For Hierarchical Range Histogram Constructions

Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.

Next Generation Sequencing, Assembly, and Alignment Methods

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.

DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.

Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..

Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.

CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)

Genome sequencing and assembling

Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.

Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

De-novo Assembly Day 4.

How to Build a Horse Megan Smedinghoff.

Physical Mapping of DNA Shanna Terry March 2, 2004.

Mouse Genome Sequencing

Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.

CS 394C March 19, 2012 Tandy Warnow.

Todd J. Treangen, Steven L. Salzberg

A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.

PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.

Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.

Opera: Reconstructing optimal genomic scaffolds with highthroughput paired-end sequences Song Gao, Niranjan Nagarajan, Wing-Kin Sung National University.

1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.

Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.

CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.

Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.

Fuzzypath – Algorithms, Applications and Future Developments

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

The Changing Face of Sequencing

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.

NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.

billion-piece genome puzzle

Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.

OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.

Introduction to NP Instructor: Neelima Gupta 1.

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.

MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.

The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.

1 Chapter 5 Branch-and-bound Framework and Its Applications.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

CMPT 438 Algorithms.

The NP class. NP-completeness

CAP5510 – Bioinformatics Sequence Assembly

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

Denovo genome assembly of Moniliophthora roreri

Genome sequence assembly

Research in Computational Molecular Biology , Vol (2008)

Introduction to Genome Assembly

Removing Erroneous Connections

CS 598AGB Genome Assembly Tandy Warnow.

Presentation transcript:

Assembly by short paired-end reads Wing-Kin Sung National University of Singapore

State of genome sequencing Thousands bacterial genomes plus a few dozen higher organisms are sequenced There are still a lot of genomes waiting for us to sequence. Personalize sequencing is also a big market. In summary, we need cheaper and faster sequencing.

Bio-technology: DNA-PETs What data is used for genome assembly? DNA-PET is a paired-end tag extracted from the genome – Each tag is of length readlength. (e.g. readlength = 35) – The span of the DNA-PET is fixed (e.g. 1kb, 5kb, 10kb, or 20kb) ACTCAGCACCTTACGGCGTGCATCA TACGTTCTGAACGGCAGTACAAACT Readlength Span of the paired-end read

Bio-technology: DNA-PETs Some genome Sonication Size selection Pair-end sequencing

Sequence Assembly Problem Given the paired-end reads, can we assemble them to reconstruct the genome?

Agenda A short discussion on the data quality A brief review of existing methods PE-Assember An example demonstrates the use of assembly Scaffolding

QUALITY OF PAIRED-END SEQUENCING

Paired-end sequencing 1kb 10kb 20kb Size selection Circularize, ligation, and cut Sequencing

Size selection is not exact Sample fragment length distribution 300bp paired-end library10,000bp mate pair library

Errors in DNA Sequencing Ligation errors – Occur in mate-pair libraries during library construction. – Two unrelated reads are paired together. Chr1 Chr2 5’ and 3’ ends of two different fragments put together

Errors in DNA Sequencing Sequencing errors – Caused by ‘misreading’ bases by sequencing machine. – In most sequencing technologies, sequencing errors are more likely to occur towards end of the read. ACGTGAGGATGACACGATAGCCA ACGTGAGCATGACACGATAGCCA Actual DNA sequence Sequence, as interpreted by machine. Machine incorrectly reads this position as a C

EXISTING METHODS

SSAKE, VCAKE and SHARCGS Base by base 3’ extension. Currently, it can assemble short genome

De Bruijn graph approach Velvet, Euler-USR, Abyss, IDBA E.g. input = {AAGACTC, ACTCCGACTG, ACTGGGAC, GGACTTT} List of 3-mers = {AAG, AGA, GAC, ACT, CTC, TCC, CCG, CGA, CGA, CTG, TGG, GGG, GGA, CTT, TTT} AAGACTCCGACTGGGACTTT AAGACTC ACTCCGACTG ACTGGGAC GGACTTT Mark J. Chaisson, Dumitru Brinza and Pavel A. Pevzner. De novo fragment assembly with short mate-paired reads: Does read length matter? Genome Res. 19: Daniel Zerbino and Ewan Birney. Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs. Genome Res. 18:

ALLPATHS Builds unipath-graph by repeatedly overlapping the unipaths. Highly accuracy. However, it is slow and memory intensity. Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, and David B. Jaffe. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18: Combine these.. To obtain this.

Summary of current status Base-by-base extension approach can work only for short genome. De Bruijn graph approaches are fast. – However, the size of De Bruijn graph increases exponently with the error rate. – When error rate increase, this approach is not accurate. ALLPATHS is accurate. – However, it is too slow and not memory efficient for handling big genome. Furthermore, little has been done on using paired-end information explicitly. Is it possible to have a reasonably fast and memory-efficient method which is accurate?

PE-ASSEMBLER

Idea (I) Suppose we have two possible ways (green or red reads) to extend the sequence. How can we decide if we extend the red or the green read? or c t

Idea (II) If we use the paired-end information, we can make decision early!  or c t

PE-Assembler Instead of using de Bruijn graph approaches, we use the simple base-by-base extension approach similar to SSAKE, VCAKE and SHARCGS. Moreover, we try to utilize paired-end reads to resolve ambiguity.

PE-Assembler Input: a set of paired-end reads 1.Read screening 2.Seed building 3.Contig extension 4.Scaffolding 5.Gap filling

Read screening (I) Read screening identify ‘solid’ reads – i.e. error free and non-repetitive reads. From the set of reads, we count the frequency of each k-mer. If a k-mer occurs once, it is likely to be sequencing error. If a k-mer occurs too many times, it is likely to be repeat. acgtcgagtcaggtacgt acgtcgagtc cgtcgagtca gtcgagtcag tcgagtcagg cgagtcaggt gagtcaggta agtcaggtac gtcaggtacg tcaggtacgt

Read screening (II) A read is said to be ‘solid’ if the frequencies of all its k-mers are in the blue region. Those solid reads are the starting points for extension. acgtcgagtcaggtacgt acgtcgagtc cgtcgagtca gtcgagtcag tcgagtcagg cgagtcaggt gagtcaggta agtcaggtac gtcaggtacg tcaggtacgt

Seed building (I) A seed is a contiguous region in the target genome which is of length at least MaxSpan. Starting from some ‘solid’ reads, we extend the read from both 5’ and 3’ ends.

Seed building (II) In case there are multiple feasible extensions, we can resolve it by checking the mates. In the following example, g has support while a does not have support. Hence, g is correct.

Seed building (III) The previous method cannot resolve ambiguities arising due to sequencing error. In such case, we extend every candidate base up to a distance of ReadLength. Any extension path arising due to sequencing errors is likely to be terminated prematurely. If only one candidate path can reach the full distance, then that path is assumed to be the correct extension. ACGTCA AC CCGT TC X GC X TCGAT GC X ReadLength

Contig extension The contig extension step aims to extend each verified seed to form a longer contig iteratively. Since each verified seed is longer than maxSpan, we can extend the seed using paired-end reads.

Scaffolding problem Input: – A set of contigs of some genome X – A set of DNA-PETs of some genome X Scaffolding finds the correct ordering and the orientation of the contigs.

Scaffolding (II) It demarcates all repeat regions within assembled contigs. Build the contig graph Identify a linear order of the contigs

Demarcating all repeat regions within assembled contigs Map all paired-end reads onto the contigs. The mode of the read density is assumed to be the expected read coverage across the genome. Any region with read density higher than 1.5 times of the mode is considered as a repeat region. mode1.5 mode Repeat region density frequency

Build contig graph

Identify a linear order of the contigs Case 1: 1 discordant edge Case 2: 2 discordant edges

Gap filling (I) From scaffolding, we identify adjacent contigs. Those gaps are usually generated by repeat regions. Since we have paired-end reads from both 5’ and 3’, we may be able to fill-in the gap.

Gap filling (II)

Parallelization 1.Read screening – Sequential: This step is largely disk bound. 2.Seed building – Run in multiple threads. If two threads use the same set of paired-end reads, rewind one of the threads. 3.Contig extension – Run in multiple threads. If two threads use the same set of paired-end reads, rewind one of the threads. 4.Scaffolding – Graph building: Run in multiple threads. – Actual scaffolding: Sequential 5.Gap filling – Run in multiple threads.

Simulation data OrganismE. coliS. pombeHG 18 - Chr 10 No. of contigs/chromosomes131 Genome length4,639,658bp12,571,820bp135,374,737bp Library200bp1kbp10kbp200bp1kbp10kbp200bp1kbp10kbp Read length (bp)35 75 Average insert size (bp) Insert size range (average ± bp)± 40± 200 ± 2000± 40± 200 ± 2000± 40± 200 ± 2000 No. of paired reads3.31m 8.98m 45.12m9.02m Coverage50x 10x Seq. error rate2.0% Ligation error rate0.0%2.0% 0.0%2.0% 0.0%2.0% We perform simulation on 3 organisms.

Simulation data E. coliS. pombeHG18 chr10 PAVelvetAllpaths2AbyssSOAPPAVelvetAllpaths2AbyssSOAPPAAbyssSOAP Contig statistics No. of contigs (>200bp) Average length (kb) Maximum length (kb) Contig N50 size (kb) Contig N90 size (kb) Coverage Evaluation Large misassemblies Segment maps Performance Total execution time (min) N/A240 Peak memory usage (gb) N/A48 Velvet and Allpath2 are not efficient enough to handle the dataset for HG18 chr10. N50 length of the assembly is defined as the length such that contigs of equal or longer than that length account for 50% of the total length. N90 is defined similarly. Segment map: Divide the genome into bins of 1000bp. Count the number of bins which are the same as the reference genome.

Experimental data We obtained 4 real-life datasets from Allpath2 paper.

Experimental data S. aureusE. coli PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp) Average length (kb) Maximum length (kb) Contig N50 size (kb) Contig N90 size (kb) Coverage Evaluation Large misassemblies Segment maps Performance Total execution time (min) Peak memory usage (gb) S. pombeN. Crassa PAVelvetAllpaths2ABySSPAVelvetAllpaths2ABySS Contig statistics No. of contigs (>200bp) Average length (kb) Maximum length (kb) Contig N50 size (kb) Contig N90 size (kb) Coverage Evaluation Large misassemblies Segment maps Performance Total execution time (min) Peak memory usage (gb)6.615N/A N/A25.6

Running time Single CPU, multiple core

EXAMPLE APPLICATION

Burkholderia species Burkholderia pseudomallei (Bp) – Causative agent of melioidosis, a serious infectious disease of humans and animals with an overall fatality rate of 50% Burkholderia thailandensis (Bt) – non-pathogenic to mammals Why Bp can infect human? – Likely required for Bp to colonize and infect mammals. These include the gain of a Bp- specific capsular polysaccharide gene cluster. Wrinkled colonies Round colonies

Bt E555 My collaborator Patrick Tan thinks virulence and nonvirulence is not a black and white issue. There should be some intermediate state. He looked for 28 Bt strains. He finds Bt E555. It is a mixture of smooth and wrinkled colonies. Mixture of smooth and wrinkled colonies

Sequencing of Burkholderia thailandensis (Bt E555) We sequenced Bt E555 using Solexa Genome Analyser II. – 12.5M paired-end reads – Each read is of length 100bp – Insert size is We map the sequences on the Bt reference E264.

De novo assembly of Bt E555 using PEassembler 521 contigs N50: bp Total length: bp Longest contig: bp Shortest contig: 250 bp In particular, contig 19 (24k bp) is similar to the Bp-like CPS in Burkholderia pseudomallei. It replaces EPS.

Phenotype of Bp-like CPS Bp colonies are wrinkled. Bt colonies are round and smooth BtE555 exhibited a mixture of smooth and wrinkled colonies. BtE555 CPS KO develop round colonies with no wrinkling. This suggested that Bp-like CPS expression may contribute to the wrinkled colonies. Wrinkled colonies Mixture of smooth and wrinkled colonies Round colonies

SCAFFOLDING

Formal definition of the scaffolding problem Input: A set of contigs and edges – each edge spans Output: An ordering of the contigs s.t. the number of discordant edges is minimized Discordant edge

Scaffolding problem is NP-hard Huson et al (2002) showed that scaffolding is NP- hard. A number of heuristics solutions – Celera Assembler [Myers et al,2000] - Euler [Pevzner et al, 2001] – Jazz [Chapman et al, 2002] - Arachne [Batzoglou et al,2002] – Velvet [Zerbino et al,2008] - Bambus [Pop, et al, 2004] Can we solve the problem optimally? Is optimal solution better?

A parameter width (w) Since every contig has some minimum length and every edge span a fixed length, – we expected every edge span at most w contigs for some constant width w. At most w contig

Two parts Fixed parameter polynomial time algorithm – We showed that the running time of the scaffolder depends on a parameter “width” Graph Contraction – We proposed a way to reduce the graph

Scaffolding when no discordant edge When there is no discordant edge, a naïve solution is: – Enumerate all possible signed permutation of the contigs in a tree. Prune the subtree if the scaffold is not feasible. +A +A+B +A-B +A+C +A-C +A+B+C +A+B-C Exponential Time +A-C+B +A-C-B … … … ABCD

Observation Lemma: Consider two scaffolds S 1 and S 2. If both scaffolds share a common tail of width w, – Then, both S 1 and S 2 have a feasible solution or both don’t have. Proof: Based on Bandwidth Problem [Saxe, 1980] – Orientation of Nodes – Direction of Edges – Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), (1980) Upper Bound (W)

Scaffold Tail is Sufficient Analogous to Bandwidth Problem [Saxe, 1980] – Orientation of Nodes – Direction of Edges – Discordant Edges … * J. Saxe: Dynamic programming algorithms for recognizing small-bandwidth graphs in polynomial time SIAM J. on Algebraic and Discrete Methodd, 1(4), (1980) Upper Bound (W)

Equivalence class of scaffolds – S 1 and S 2 have the same tail -> They are in the same class – Feature of equivalence class: – - Use of the same set of contigs; – - All or none of them can be extended to a solution Tail +A-B+C +D+E -A+C +D+E+F …

Scaffolding with p discordant edges When there are discordant edges, we just try all possible ways to discard the p discordant edges. Then, we run the scaffolding with no discordant edges. This gives an O(|E| |V| w+p )-time algorithm.

Graph Contraction 20k

Graph Contraction

Gap Estimation 60 Utility – Genome finishing(Genome Size Estimation) – Scaffold Correctness Calculate Gap Sizes – Maximum Likelihood – Quadratic Function – Solved through quadratic programming [Goldfarb, et al, 1983] Polynomial Time g1g1 g2g2 g3g3 μ,σμ,σ * Goldfarb, D., Idnani, A.: A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27 (1983)

Runtime Comparison 61 ◆ E. coli ★ B. pseudomallei ◆ S. cerevisiae ◆ D. melanogaster Bambus50s16m2m3m SOPRA49m-2h5h Opera4s7m11s30s ◆ Simulated dataset Coverage of 2x80bp PETs with insert size 300bp: 40X Coverage of 2x50bp PETs with insert size 10kbp: 2X Contigs assembled using Velvet ◆ Simulated datasets using MetaSim ★ In house data ★ B. pseudomallei Coverage of 100bp 454 reads: 20X Coverage of 2x20kbp PETs with insert sizelibrary: 2.8X Contigs assembled using Newbler

Scaffold Contiguity 62

Scaffold Correctness 63

Scaffold Correctness 64 E.coliS. cerevisiaeD. melanogaster Opera134 Bambus

Reference Pramila Nuwantha Ariyaratne, Wing-Kin Sung: PE-Assembler: de novo assembler using short paired-end reads. Bioinformatics 27(2): (2011) Song Gao, Niranjan Nagarajan, Wing-Kin Sung: Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. RECOMB 2011:

Acknowledgement Bioinformatics – Zhizhuo – Xueliang – Chandana – Rikky – Gao Song – Pramila – Charlie Lee – Guoliang Li – Han Xu – Fabi Infectious Disease – Patrick Tan Sequencing group – Ruan Yijun – Wei Chialin – Yao Fei – Liu Jun – Herve Thoreau – Sequencing team