Considerations for metagenomics data analysis and summary of workflows Alex Mitchell mitchell@ebi.ac.uk
My background Doctorate in pharmacology (1995-1998) Post-doc in molecular biology (1998-2001) Bioinformatics research (2001-2011) Co-ordinator for InterPro and EBI metagenomics databases (2011-)
My background
Overview What is metagenomics? Different types of metagenomic studies Challenges and considerations for metagenomic data analyses
What is metagenomics? “Metagenomics” means literally ‘beyond genomics’ “Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora” “Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.” “Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”
Sampling from environment Filtering step Extraction of DNA Sequencing
Identification and characterisation of protein coding sequences Taxonomic analysis 16S rRNA 18S rRNA ITS etc Quality control Functional analysis Identification and characterisation of protein coding sequences
Why is metagenomics exciting? Estimated >90% of microbes are not culturable by current techniques Can’t be accessed by traditional sequencing methods NGS techniques allow the reading of DNA from uncloned samples For example: the human body contains ~1014 human cells and ~1015 microbial cells i.e. 90% of the cells in your body are not your own massive impact on understanding health and disease
Applications of taxonomic analyses Identification of new species Diversity analysis Comparing populations from different sites or states
Applications of functional analyses Bioprospecting for novel sequences with functional applications Reconstruction of pathways present in the community Comparing functional activities from different sites or states
Why is metagenomics challenging? Short sequence fragments are hard to characterise Assembly can lead to chimeras Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop’ Millions of different pieces Thousands of different puzzles All mixed together Most of the pieces are missing No boxes to refer to
Limitations and pitfalls Data used for analysis can have limitations: 16S rRNA genes - limited resolving power and subject to copy number variation Viral sequences –currently no gold-standard reference database Protist sequences – little experimentally-derived annotation of protein function in public databases
Additional pitfalls Different functional and taxonomic analysis tools can give different results The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3) The same version of the same tools can give different results depending on the reference database used
Reference databases
Reference databases
Reference databases
Scientific workflows A number of linked components Each component is responsible for a small fragment of overall functionality Allow the description, management, and sharing of scientific analyses
Scientific workflows Analogous to laboratory protocols Lyse cells Remove proteins DNA precipitate Re-suspend DNA … buffers, enzymes, volumes, temperatures, times, centrifugation steps
Scientific workflows Analogous to laboratory protocols Quality control of reads Assembly into contigs Prediction of rRNAs Prediction of protein coding genes Annotation … software used, versions, settings, reference databases
Tools to create, manage and automate workflows GALAXY: https://usegalaxy.org/ TAVERNA: http://www.taverna.org.uk/ Pipeline Pilot: http://accelrys.com/products/pipeline-pilot/
Tools to create, manage and automate workflows Offer a range of components (installed locally or called via web services) Connected together to create workflows using a drag and drop GUI
Workflows CloVR - a virtual machine for automated and portable sequence analysis (http://clovr.org/) Range of protocols for microbial genomics Specific protocol (CloVR-Metagenomics) for taxonomic and functional classifications of metagenomics shotgun sequence data
Workflows The CloVR-Metagenomics protocol http://clovr.org/methods/clovr-metagenomics/
Other considerations: data analysis speed Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at: www.genome.gov/sequencingcosts.
Data analysis speed The cost of sequencing has really gone down Now I can do metagenomics! Awesome! Amount of sequence generated has increased 5,000-fold Computational speed has increased only 10-fold Time taken to analyse has increased 500-fold $@%*!!!
Data analysis cost (~2m bp/$) 14.5 % 30 % 28 % 70 % (~80 bp/$) 14.5 % 55 % 36.5 % 14.5 % Sboner et al. Genome Biology (2011) 12:125
What data to store? Raw sequence data: Analysis results ? Important for metagenomics as some samples are hard to replicate Large file sizes Analysis results ? Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions Data description including metadata Essential: what, where, who, how and when If absent, raw data have very limited usefulness
The importance of metadata Metadata includes the in-depth, controlled description of the sample that your sequence was taken from How was it sampled? How was it extracted? How was it stored? What sequencing platform was used? Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?
The importance of metadata If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible Show the microbial species found in the North Pacific … at depths of 50 – 100 m … in samples taken May-June … compared to the Indian Ocean, under the same conditions
Considerations: storing data Where are you going to store this? Locally : back-up ? long term ? sharing ? access ? Amazon, Google or specialist research clouds Public repositories, such as ENA, NCBI or DDBJ
Public repositories Free! Secure long term storage No need for local infrastructure Enforced compliance: Publisher requirements (accession numbers) Institutional requirements Funder requirements Data are more useful: Data are reusable and can be discovered by others Available for re- and meta-analyses
Considerations: moving data Transferring a 100 Gb NGS data file across the internet 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week* High-speed bandwidth (10 Gigabit/s) < 1 day* Traditional methods may be the most effective! * Stein, Genome Biol. (2010) 11:207
Metagenomics portals http://www.ebi.ac.uk/metagenomics http://metagenomics.anl.gov/ http://img.jgi.doe.gov/ http://camera.calit2.net/
What do metagenomics portals offer? Submit data Tools to help transfer data Tools to help capture & store metadata Sequence archiving Quality filtering of sequences Sequence analysis (prebuilt workflows) Visualisation/Interpretation
Planning your experiments: things to consider Think data volumes Metagenomic data files can be vast (10s of Gbs/file): how are you going to store and manipulate them? Plan ahead In a short time-frame, data volumes could be orders of magnitude higher Think timescales Data analysis times >> data generation times
Planning your experiments: things to consider Consider economics Direct costs & opportunity costs Productivity Think communication and collaboration Biologists Bioinformaticians & service providers Computer scientists
Your views, concerns, frustrations…