Professors: Dr. Gribskov and Dr. Weil AGRY-600 Genomics Genome Assembly Professors: Dr. Gribskov and Dr. Weil Group 3: Brett Lane Amanpreet Kaur Stefanie Griebel Yulu Chen Rupesh Gaire Akanksha Singh
Cleaned Data as Input Files Genome Assembly Pipeline Data Cleaning REAPR QUAST Gene finding GapFiller SSPACE SPAdes Kmergenie Kmer size Cleaned Data as Input Files Assembly Group 3
Results Cleaning Steps - MP reads Group 3
Kmergenie Kmergenie was used to determine the size of k-mer for assembly. All the reads – paired end and mate paired were used to predict k-mer size Best kmer size predicted: 87 Predicted assembly size: 56.6 Mb Group 3
SPAdes Genome Assembler - Why? SPAdes is suitable for: Illumina reads Bacterial and fungal data Small genomes not large genomes Paired-end reads, Mate-pair reads and unpaired reads Group 3
Assembly Statistics Group 3 N50 Longest Contigs/ Scaffolds Total Length Program Data Comments 70,639 (Contigs) 74,736 (Scaffolds) 608,380 3293(Contigs) 3187(Scaffolds) 55.9 Mb SPAdes used multiple k-mer values: 51,61,71,81,83,87 PE, MP, both unpaired #N's=2873 5.14 Ns/100 kbp SPAdes allows De Bruijn graph assembly at multiple k-mer sizes, not just a single fixed one. Merges different k-mer assemblies Group 3
Scaffolding using SSPACE v3.0 The scaffolding was done using SSPACE v 3.0 using the Mate Pair Reads Programme Scaffolds N50 Longest contig Total length #N’s per 100 kb SPAdes 8073 (>=500bp=3187) 74.7 Kb 608 Kb 55.9 Mb 5.14 SSPACE v 3.0 5183 (>=500bp=967) 193 Kb 1.56 Mb 61.6 Mb 7855.43 Group 3
Bridging the gaps using GapFiller No. of N’s was very high after scaffolding : 4783167 #N’s per 100 Kb: 7855.43 GapFiller was used for filling the gaps using Mate Pair reads It reduced the number of N’s: 1254.29/100 Kb Scaffolds N50 Longest contig Total length #N’s per 100 kb Before gap filling 5183 (>=500bp=967) 193Kb 1.56 Mb 61.6 Mb 7855.43 After gap filling 1254.29 GapFiller highly reduced the gaps Group 3
Mapping the PE reads to the Assembly We used bowtie 2 to map the Paired end reads to our final assembly 59.58 % aligned concordantly exactly 1 time 24.01% aligned concordantly >1 times Total: 83.59 % Overall alignment rate : 99.94 % Group 3
REAPR Error free bases: 85.75% Total Number of errors: 2652 FCD errors within a contig: 615 FCD errors over a gap: 46 Low fragment coverage within a contig: 146 Low fragment coverage over a gap: 1845 85 % of the bases are error free which is good Group 3
Gene Prediction Genes were predicted using Quast No. of predicted genes Unique 17455 (>= 0 bp) 107219 (>= 300 bp) 22348 (>= 1500 bp) 1241 (>= 3000 bp) 77 Group 3
Conclusion Good assembly N50: 193Kb Longest contig : 1.56 Mb Less no. of gaps 85.75 % bases are error free Only concern is the total length of the genome (61 Mb) Group 3
Thanks