1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli, Magne Østerås, Jacques Schrenzel Presented by Lucas Lochovsky

2 Outline 1. Introduction 2. Edena’s Methodology Reducing Read Redundancy Overlap Graph Construction Transitive Edge Reduction Graph Cleanup Contig Production 3. Results Assemblers Assembly tasks 4. Additional Edena Analyses Graph Cleaning Effectiveness Effective Coverage Depth 5. Conclusions

4 1) Introduction NGS will allow us to explore strange new genomes, blah blah blah…. WGS assemblers we’ve covered so far: Medvedev-Brudno assembler Arachne AMOS-Cmp Velvet ALLPATHS Think you’ve seen it all?

5 1) Introduction (cont’d) Edena: De novo short read assembler Uses a classic overlap graph approach to assembly Anyone else get a feeling of déjà vu? Compare to other recently published NGS read assemblers De novo assembly of two bacterial genomes sequenced with the Illumina/Solexa platform

7 2) Edena’s Methodology Built around a standard overlap-layout- consensus workflow Opted to use exact matching for overlap detection Reduce # of spurious overlaps Faster than using approximate matching Also assume that all reads have the same length Is this assumption valid?

8 2) Edena’s Methodology (cont’d) Four major steps: 1. Remove redundant reads so that dataset size is more manageable 2. Overlap detection and overlap graph construction 3. Graph cleaning: simplification and ambiguity resolution 4. Produce contigs

9 2) Edena’s Methodology (cont’d) 1) Practice your 3 R’s: Reducing Read Redundancy Illumina Genome Analyzer has high amount of over-sampling → many redundant reads Reduce dataset so it contains only a single copy of each read → non-redundant Index all reads into a prefix tree Identical reads will be mapped to the same key → no duplicate reads in this structure

10 2) Edena’s Methodology (cont’d) Prefix trees are associative arrays for strings where all descendants of a node have a common prefix Reads and their reverse complements are considered the same read → merged into the same tree key

11 2) Edena’s Methodology (cont’d) Ambiguous reads discarded, since they won’t work with exact matching Opens up possibility of coverage gaps in read data (not explored by the authors) Original read data still useful for getting read frequencies Contig coverage depth Repeat identification

12 2) Edena’s Methodology (cont’d) 2) Overlap Graph Construction Non-redundant read dataset is indexed by a suffix array Déjà vu moment: Almost exactly like suffix trees from MUMmer/MUMmerGPU! Information used to produce a bidirected overlap graph Déjà vu moment: Just like the Medvedev-Brudno assembler! (which I presented!)

13 2) Edena’s Methodology (cont’d) This slide should be review for all of you! Bidirected graphs are kind of like directed graphs, except each edge has an orientation on each of its ends Gives rise to three types of edges: Edges where one arrow points out of a vertex, and one arrow points into a vertex Edges with both arrows pointing out, and Edges with both arrows pointing in (easiest one to do in PowerPoint!) For a walk in a bidirected graph, for each vertex on that walk, the orientation of the edge entering the vertex must be opposite that of the edge leaving the vertex

14 2) Edena’s Methodology (cont’d) More review! In a bidirected overlap graph, each vertex is a double- stranded read Edges represent read overlaps Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges) Suppose we have two ds reads r 1 and r 2 Each read can be oriented to the left or to the right The three possible overlaps are: i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) ii) r 1 points left and r 2 points right iii) r 1 points right and r 2 points left

15 2) Edena’s Methodology (cont’d) Parameter: Minimum overlap size Sensitivity vs. specificity tradeoff Small value: Higher frequency of chance overlaps → causes path branching in graph (sensitivity favoured) Large value: Creates more dead-end (DE) paths, i.e. reads not extended by overlapping reads on one side (specificity favoured)

16 2) Edena’s Methodology (cont’d) 3a) Transitive Edge Reduction Simplifies paths by removing nonessential nodes/edges Generally speaking, a path of the form v 1 → v 2 → v 3 can be reduced to v 1 → v 3, representing the same sequence with fewer nodes Reduces graph complexity by the over- sampling rate c = NL/G N: Number of reads L: Read length G: Genome size

17 2) Edena’s Methodology (cont’d) For sequences, it’s about removing reads for which another read with the same sequence overlaps the first read to a greater extent

18 2) Edena’s Methodology (cont’d) 3b) Graph Cleanup Can have multiple paths branching off a single node (branching paths) Due to genomic repetitions, sequencing errors, and clonal polymorphisms Genomic repetitions cannot be fixed without additional information But the other two can be resolved

19 2) Edena’s Methodology (cont’d) Sequencing errors produce short dead- end (DE) paths Attempt to elongate branching nodes up to a certain depth md (minimum depth) Reads that cannot be extended to a depth of md are removed Experimentally determined that md=10 is the best value

20 2) Edena’s Methodology (cont’d)

21 2) Edena’s Methodology (cont’d) Also disambiguate bubbles in the graph caused by single base substitutions (aka “p-bubbles”) Length of p-bubble is at most ms = 4L - 2T - 1 L: Read length T: Min. overlap size Explore each branching path up to length ms (guaranteed upper bound) Remove path with less coverage Polymorphisms can be retained for later analysis

22 2) Edena’s Methodology (cont’d)

23 2) Edena’s Methodology (cont’d) 4) Contig Production If run in strict mode, Edena starts generating contig sequences In non-strict mode, one more cleaning step is performed Longer overlaps more reliable than shorter ones Save only edges at branching nodes that have the highest overlap of all edges Produce contig sequence by following non- intersecting simple paths in overlap graph Nodes must have in-degree and out-degree of exactly one

25 3) Results Survivor: WGS Assembly Four assemblers Two challenges One winner

26 3) Results (cont’d) Contestant #1: SSAKE Indexes reads in a prefix tree based upon first eleven 5’ bases Identify highest possible overlap between pairs of reads Use most highly-covered reads as starting points for read extension (i.e. assembly “nucleation points”) So far only used for partial genome sequencing for comparative metagenomic analysis (e.g. bacterial species distinction)

27 3) Results (cont’d) Contestant #2: Velvet k-mer/q-gram/k-gram/q-mer de Bruijn graph representation of reads Contestant #3: SHARCGS Can accept base quality scores along with read data for read filtering (low quality reads discarded) Also filter out reads with low coverage Assembly performed with a prefix tree Contestant #4: Edena

28 3) Results (cont’d) Reward Challenge Assemble the 2.82 Mbp genome sequence and the 20.7 Kbp plasmid sequence of the Staphylococcus aureus MW2 strain from Illumina reads Immunity Challenge Assemble 1.55 Mbp genome sequence and the 3.66 Kbp plasmid sequence of the Helicobacter acinonychis Sheeba strain from Illumina reads

29 3) Results (cont’d) Staphylococcus aureus results Evaluated each assembler on the parameter configurations that produced the best results Edena: Min. overlap size: 21 bases Velvet: k-mer value: 23 SHARCGS: Max. gap span: 14 SSAKE: Default parameters

30 3) Results (cont’d) Compared contig assembly to published reference sequence Non-strict mode tends to produce longer contigs at the expense of additional misassemblies Velvet comparable to Edena strict

31 3) Results (cont’d) SHARCGS unable to assemble significant contigs → insufficient coverage depth SSAKE produced a large number of mismatches mostly at contig boundaries

32 3) Results (cont’d) Authors also tried combining contig results from Edena and Velvet due to significant overlaps between their contigs N50 and mean contig size increased relative to original results Edena non-strict has similar influence on results as previously

33 3) Results (cont’d) Helicobacter acinonychis results Best parameter settings: Edena: Min. overlap size: 27 (strict), 26 (non-strict) Velvet: k-mer value: 27 SHARCGS: Max. gap span: 10 (also must remove last four bases from each read) SSAKE: Default parameters

34 3) Results (cont’d) Results similar to those from the previous assembly challenge

35 3) Results (cont’d) Survivor: WGS Assembly Conclusion Granted Immunity: Edena, Velvet Sent to the Tribal Council: SSAKE, SHARCGS

37 4) Additional Edena Analyses Graph Cleaning Effectiveness Demonstrate the effectiveness of DE path removal and p-bubble fixing Created an ideal read pool from the S. aureus MW2 strain Consists of one read at every possible position No errors No polymorphisms Distinguish between positive and negative reads Positive reads have at least one exact occurrence in the reference sequence Negative reads have none

38 4) Additional Edena Analyses (cont’d) Ideal dataset indicates branching nodes and p- bubbles caused by genomic repetition Anomalies in real datasets only due to negative reads Due to small quantity of branching nodes in the ideal dataset, branch removal procedure is extremely effective

39 4) Additional Edena Analyses (cont’d) Though many p-bubbles consist of sequences made of negative reads, most cannot be explained by base calling errors Thought to correspond to underrepresented clonal polymorphisms

40 4) Additional Edena Analyses (cont’d) Since there are no DE paths in the ideal dataset, expect that DE removal should remove all DE paths in real dataset (i.e. dead-ends correspond to negative reads) From tests with different md values (below), authors decided 10 was best Not so clear-cut to me

41 4) Additional Edena Analyses (cont’d) Most DE paths have length 1 Correspond to paths created by base calling errors Longer DE paths exist that do not appear to be caused by such errors Thought to be clonal polymorphisms in low abundance → can’t form a complete p- bubble

42 4) Additional Edena Analyses (cont’d) Effective Coverage Depth Computed effective coverage depth according to formula from Lander and Waterman E = N(L-T)/G N: # of usable reads L: Read length T: Req. overlap length G: Genome size Can also estimate gaps in read coverage with Ne -E

43 4) Additional Edena Analyses (cont’d) S. aureus sequencing Raw coverage depth: 48x Effective coverage depth: 14x H. acinonychis sequencing Raw coverage depth: 284x Effective coverage depth: 36x Statistics imply that there should be no gaps in H. acinonychis assembly, and only a few in S. aureus But each actual assembly contained several hundred gaps

44 4) Additional Edena Analyses (cont’d) Statistics assume uniform read sampling Investigated underrepresented parts of genomes After alignment of reads to reference genome, extracted low coverage sequences These sequences have complex motifs and single base repeats → cause difficulty in replication

46 5) Conclusions Edena holds up well against other recent assemblers, in both assembly quality and computational resources Some assemblers are partially complementary to each other (Edena and Velvet) → can use together to produce results better than each individual assembler’s results Rise of NGS paired read data will help produce longer contigs and clean up ambiguities

47 Is Edena The One? The One that will herald the beginning of cost- effective whole genome assembly with NGS? Maybe you should ask the Oracle…

48 That’s all folks! Discussion Questions What were the strengths/weaknesses of the Edena? How would you improve it? How do you think Edena compares to the other assemblers tested? Would you test it against other assemblers not tested here? Given Edena’s limitations, would you trust it for de novo genome assembly over traditional sequence assembly? Why did we have to discuss yet another NGS genome assembler today?

1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

Similar presentations

Presentation on theme: "1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,

Similar presentations

Presentation on theme: "1 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer David Hernandez, Patrice François, Laurent Farinelli,"— Presentation transcript:

Similar presentations

About project

Feedback