Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, Daniel S. Rokhsar Joint Genome Institute, Illumina Inc., University of California, Berkeley Presenter: Priyanka Ghosh

De Novo Genome Assembly with
Meraculous: De Novo Genome Assembly with Short Paired-End Reads

De Novo Genome Assembly
Short Paired-End Reads 1) Take 3 copies of the same DNA: 2) Sequence is broken down to generate Reads: Sequence both ends of a read. Why? Generates high- quality, alignable sequence data Likely to align better with reference genome 3) Reconstruct original DNA sequence from read set

Why use shorter reads? Readily aligned to a reference Error rates are low Lower cost per base, higher throughput Utilized in short read assemblers like: Velvet ABySS SOAPdenovo Take advantage of the de Bruijn graph representation of the assembly problem using k-mer approach

Genome Assembly vocabulary
k-mer : a DNA sequence of bases with length k De Bruijn graph of k-mers: a graph representing overlaps between k-mers Contig: a contiguous sequence of DNA formed by combining k-mers Scaffold: one or more contigs linked together by unknown sequence

Meraculous - Contributions
De novo assembler developed at Joint Genome Institute Efficient and conservative traversal of a subgraph of the de Bruijn graph (DBG) Takes into consideration unique (single-base) high- quality extensions in the dataset Avoids explicit error correction step Incorporates a novel low-memory hash structure to access the DBG = small memory footprint Vs. other short-read assemblers

Steps for genome assembly using Meraculous :
Reads: Reads may contain errors Reads fragmented further into k-mers, to possibly exclude errors K-mers: Contigs: Construct & traverse de Bruijn graph of k-mers, generate contigs Scaffolds: Use read information to link contigs and generate scaffolds Slide courtesy: Evangelos Georganas SC14 paper ” Parallel De Bruijn Graph Construction and Traversal for de novo Genome Assembly”

Test input Pichia stipitis CBS 6054 : predominantly haploid yeast genome (15.4 Mbp) Statistics: dataset of 3 lanes of 75bp paired- end shotgun

Assembly Algorithm 1) Selection of k-mer set (Generation of K- mer’s) 2) Production of maximal linear sub-paths of DBG (Generation of Contigs) 3) Identify read-pair information used to produce scaffolds by ordering and orienting contigs (Generation of Scaffolds) 4) Close gaps contained within scaffolds - with reads projected to lie within the gaps

Generation of sub-graph of DBG
Select an odd integer k(=41): such that a) fraction of targeted sequence for assembly is unique as k-mers b) reads have multiple overlapping error-free k-mers Threshold multiplicity - dmin (=10): # of occurrences of each k-mer in the dataset. k-mers > dmin : likely to error-free and occur in genome (part of k-mer set) k-mers < dmin : likely to contain sequence errors For each k-mer count all single-base extensions (forward & backward) such that the next /previous base has Q >= Qmin Single base extensions with Q >= Qmin = “high quality extensions” Each end of a k-mer marked - X, U, or F, depending on 0, 1, or >= 2 distinct high quality extensions “X” : no high quality extensions “U” : unique high quality extensions “F” : fork in the DBG “U-U k-mer” “U-U contig”

Assembly Algorithm Contd ….
‘U-U’ contig: subgraph of DBG comprising of linear chain of U-U k-mers (extensions must be unique) Omit [F] k-mers: candidates for boundaries Error correction made implicitly by adhering to dmin and Qmin constraints # of U-U contigs of the DBG depends on choice of Dmin Dmin high = U-U contigs likely to terminate at ‘X’ Dmin low = U-U contigs likely to terminate at ‘F’ Next we link contigs (>= 100bp for Pichia) into scaffolds using paired-end links to jump over unassembled repetitive regions, leaving gaps Finally intra-scaffold gaps closed using reads whose mate- pairs constrain them to lie within a gap

Lightweight Hash implementation – Contig generation phase
Reduces memory needed to store and randomly access the DBG Utilizes a recursive collision strategy with multiple hash functions to avoid explicitly storing the keys Key-value pair: key = U-U k-mer , value= 2 letter code [ACGT][ACGT] Algorithm: - Primed to expose all k-mer’s (Assume hash functions h0, h1,….,hn already defined) 1. Initialize hash depth ‘d’ to 0, write all keys to file Fd 2. For all keys in Fd , evaluate hash function hd. Update a ‘primer object’ Pd to key track of the keys that collide under hd. 3. Write all colliding keys to file Fd+1, increment hash depth d 4. Repeat steps 2,3 until number of colliding keys is 0 All Primers P0….Pd sent to lightweight hash initializer to create lightweight hash object K-mers are loaded with extension codes: Each key-value pair added to this hash object, checks Primer information to determine which level of recursion to store value, key is discarded

Approx 60-fold memory savings ..!!!
Results - I Advantages w.r.t memory using Lightweight hash: Conventional hash: # of distinct keys comparable to Genome (G) Size Memory required to naively store the hash = 2G * (k+1) bits (with majority cost associated with storing the keys) Example: Human Genome G = 3 * 109 , with k=75 Total memory required = 450 GB Lightweight hash: Applied if complete set of keys is known initially and does not change Average lookup time does not depend on Genome size Hash requires = e*G bytes memory (independent of k, e= ) Example: Human Genome G = 3 * 109 .Total memory required = 8GB Approx 60-fold memory savings ..!!!

Results - II Benchmarking Meraculous against other short-read assemblers for E.coli K-12 MG 1655 dataset of 10.4 million pairs of 36-bp reads Finished reference sequence appox = 6.64 Mbp genome Short-read dataset represents a nominal ~ 160X shotgun coverage k=21, dmin=9 Meraculous assembled 97.8% of the genome into contigs ranging from 200bp to 175kbp, with no assembly errors

Also to be noted … Accuracy of Pichia genome: Meraculous reconstructs 95% of the genome in long contigs (N50=101kbp) and scaffolds (N50=269kbp) Most steps of Meraculous assembly pipeline are parallelized, by partitioning reads (or k-mers) among processors 2 steps NOT parallelized are: a) construction of U-U subgraph (memory intensive) b) Scaffolding step Limitation: current Meraculous algorithm assumes data from a haploid genome

Summary Meraculous, new short-read assembler produces high quality, near complete de novo assemblies of small fungal genomes Does not construct the full DBG. Instead limited to “U-U” subgraph, which includes k-mers that possess unique high- quality extensions at each end – thereby removing most error-containing k-mers U-U subgraph produced with a memory footprint that scales linearly with the genome size Meraculous avoids explicit error correction, by identifying outliers and disregards them in a robust without degrading the assembly Gap-filling allows residual errors to be corrected – Combining initial set of contigs (using DBG approach) with reads using mate-pairs to link and fill gaps between contigs

Thank you for listening !!! Questions ??

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Similar presentations

Presentation on theme: "Meraculous: De Novo Genome Assembly with Short Paired-End Reads"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Similar presentations

Presentation on theme: "Meraculous: De Novo Genome Assembly with Short Paired-End Reads"— Presentation transcript:

Similar presentations

About project

Feedback