Toward Next Generation Biodiversity Research Morne Du Plessis1,2, Monica Mwale1, Essa Suleman1, Emily Mitchell1, Kim Labuschagne1, Desire Dalton1,3 and Antoinette Kotze1,4 1 National Zoological Gardens of South Africa, Research and Scientific Services Department, Pretoria, 0001 2 Department of Biotechnology, University of the Western Cape, Bellville, 7530 3 Department of Zoology, University of Venda, Thohoyandou, South Africa 4 Genetics Department, University of the Free State, Bloemfontein, South Africa
Introduction Merging NGS + eDNA + Bioinformatics = Next Generation Biodiversity assessment Techniques – Variations of metagenomics approaches - Microbiome analysis – 16S - Barcoding - Animals – COI - Plants – rbcL, matk trnH-psbA, ITS - Shotgun sequencing – direct environmental sequencing - Transcriptome analysis – eDNA – Direct from environment (soil, plant matter, animal matter, water) Indirect from environ – Fecal matter – host and what they consume Indirect parasites or insects that feed on other animals (eg bloodfeed analysis)
The Technology – Why the hype Capillary electrophoresis – gold standard for seq avg 800bp generated per sequence NGS on ION S5 - massive parallel sequencing can generate 200bp / 400 bp / 600bp reads Can therefore generate up to 15 Gb of sequence data = 15 000 000 000 bp Additionally has a significant capacity for multiplexing
The Technology – How is this ridiculously large volume of data possible Chip contains millions of wells How it actually works https://biosci-batzerlab.biology.lsu.edu/Genomics/documentation/S5_vs_S5xl_LinkedIn_post.pdf http://www.anthonybaldor.com/semiconductor-sequencing-ion-torrent/
Data Analysis – What sort of data is generated and how much
The Technology – Getting maximum value DNA – Site 1 DNA – Site 2 DNA – Site 3 Shear DNA across all samples seperately Size select across all samples seperately Library prep across all samples seperately Add unique (barcode A) to Site 1 sample Add unique (barcode B) to Site 2 sample Add unique (barcode C) to Site 3 sample Barcodes = short seqs to uniquely tag your respective experiments Merge and NGS together Separate according to barcode Seqs from site 1 Seqs from site 2 Seqs from site 3
The Technology – What does the data look like
Data Analysis – What happens downstream Eg. Shotgun sequencing of environments (getting an idea of the diversity we might encounter) Perform QC of sequences Trim sequences Redo QC Optimize assembly of sequences Generate assemblies Merge into larger scaffolds Group by similarity Generate a reference database for comparison Align scaffolds to references Annotate the aligned hits Categorize Evaluate abundance and diversity
Data Analysis – Checking quality and cleaning Before Trim Sequences After
The Analysis - Assembly of sequence reads
The Analysis – Making biological sense Assembled Sequences Reference database Contig A Gene K from Org A Contig B Gene L from Org B Contig C Whole genome from Org C Mitochondrial genome from Org D Raw sequence data from Org E Annotated sequences Organism A Contig A = Organism C Contig B = Organism E Contig C =
The Analysis – Bioinformatics resources Parallel Processing NZG Bioinformatics server and storage server Also use the Centre for High Performance Computing (CHPC) - DST
What happens to all of the data The raw sequence data is typically stored on the NCBI SRA (sequence retrieval archive) system All assembled genes / molecules (eg. mitochondria) / genomes on NCBI nucleotide / genome database The incidental assembled barcode data will feed back into the relevant barcode projects There is an evolution of specialised databases eg. Qiita – managing microbial studies (Microbiomes) Also keep back-ups of all datasets on our server at NZG
Summary What do we have / what can we supply: Access to: NGS resources Environmental samples / sampling Bioinformatics resources Bioinformatics training for students Expertise in related studies and techniques What we need next: Understand requirements of SANBI in terms of the diversity assessment Evaluate which Next Generation strategies are possible and feasible Strengthening partnerships - shared environments and shared spp. Build bioinformatics capacity in terms of students Benchmarking the next generation strategies vs traditional Adequate mathematical and statistical models to accurately reflect biodiversity
Acknowledgements NRF SANBI DST NZG