Denovo genome assembly of Moniliophthora roreri Group 4. Chen, Demeke, Habte, Namrata, Rajdeep, Xu
Introduction M.roreri is a fungal pathogen that causes frosty pod rot in cacao (Theobroma cacao) mainly in central and south America Genomic information is important to enhance our understanding of the pathogen biology Genomic assembly is more important and challenging when there is no reference sequence
Assembly Pipeline Gap filling (Gap filler) Gene Prediction (Quast) Quality Control (FastQC) Scaffolding (SSPACE) Adapters Remove (Trimmomatic) Contaminant cleaning (Bowtie) Contig assembly (Minia)
Introduction to minia An ultra-low memory DNA sequence assembly Human genome can be assembled using 4 GB of memory Produces results of similar contiguity and accuracy to other de Bruijn assemblers like velvet Takes set of short genomic sequences (typically - Illumina DNA sequencer) Version used: Minia 1.5418-maxk128
Recommended k-mer based contig assembly K-mer estimation (Kmergenie) Minia assembly Library Recommended K Minimum coverage Predicted assembly size N50 Longest # Contigs Total length PE reads 77 4 55,981,704 5879 54,716 14,236 47,131,570 MP reads 93 7 56,496,136 1384 21,495 35,033 43,387,625 Unpaired reads 71 56,004,620 7431 54,409 11,989 45,327,642 PE and MP reads 87 16 57,157,776 6488 45,213 13,355 48,323,321 All 11 56,949,824 4017 35,607 18,455 48,488,928
Optimizing the k-mer selection for final assembly All library - all (PE, MP and unpaired) Minimum coverage set to 4 K-mer size Minia assembly N50 (bp) Longest (bp) # Contigs Total length (bp) 51 16,767 187,673 10,050 47,937,871 61 18,316 255,017 9,593 50,039,743 71 19,720 189,801 9,008 51,488,056 81 20,068 155,025 8,476 52,624,232
Effect of k-mer, data type on the assembly Data used k-mer Abundance threshold N50 (kb) Longest Contig (kb) # Contigs Assembly length (Mb) All 81 9 19.7 155.0 8595 52.4 12 18.8 147.9 8571 51.4 19 17.3 114.2 8458 48.2
Scaffolding: SSPACE used Standalone scaffolding program Extend and scaffold pre-assembled contigs Uses Bowtie to map paired libraries to a pre-assembled contigs Use positions and orientations for scaffolding Pairs are found within the allowed distance Together with their orientations - used for contig pairing & ordering
Effect of the library and insert size on scaffolding Library (insert size) # scaffolds N50 (kb) Longest Scaffold (Mb) Total Length (Actual sequence) Ns/100 kb (Total Ns) MP (3500) 899 217.2 1.32 65.1 (52.4) 19.5 kb (12.7 Mb) PE (400) 3417 50.6 0.58 52.24 (52.23) 10.25 (5354 b) MP (2500) 763 233.3 1.91 58.4 (52.4) 10.3 kb (6.02 Mb)
Introduction to GapFiller v 1.11 (Boetzer et al 2012 Genome Biology) Close gaps within previously created scaffolds Gaps within scaffolds are defined as unknown nucleotides (N's) the unknown nucleotides are filled with true nucleotides in order to (try) close the gap
Gap filling pipeline 1st cycle of gap filling 3 iterations # scaffolds = 763 Total Ns: 737 kb (1280/100 kb) Total length (with/without Ns): 57.57 / 56.83 Mb N50: 232.19 Kb Longest scaffold: 1.90 Mb Gaps closed : 7831-4350 = 3481 1st cycle of gap filling 3 iterations PE and MP libraries # scaffolds = 459 Total Ns: 892 kb (1546/100 kb) Total length: 57.72 / 56.83 Mb N50: 477.4 Kb Longest scaffold: 3.76 Mb 2nd cycle of scaffolding with MP libraries # scaffolds = 459 Total Ns: 659.5 kb (1138/100 kb) Total length: 57.9 / 57.2 Mb N50: 478.6 Kb Longest scaffold: 3.76 Mb Gaps closed : 4635-4420 = 215 2nd cycle of gap filling 8 iterations PE and MP libraries
Gene Prediction using quast
Summary Fairly good genome assembly pipeline Longest N50 and scaffold, 3.7MB Lowest # scaffolds, < 500 Fairly low # Ns
Assembly: All data set; K=93; M=11; SSPACE Statistics without ref. Minia 1st Scaffold 2nd Scaffold # contigs 18455 831 472 # contigs (>= 1000 bp) 13394 734 410 # contigs (>= 50000 bp) 290 225 Largest contig 35607 2104580 3259578 Total length 48488928 57810380 58311298 T. length (>= 10000 bp) 4904746 56841305 57975028 T. length (>= 50000 bp) 50546624 54752354 N50 4017 182599 320808 L50 3663 76 47 GC (%) 46.71 46.81 46.8 # N's 5898443 6428349
Thanks