Considerations for metagenomics data analysis and summary of workflows

Considerations for metagenomics data analysis and summary of workflows
Alex Mitchell

My background Doctorate in pharmacology (1995-1998)
Post-doc in molecular biology ( ) Bioinformatics research ( ) Co-ordinator for InterPro and EBI metagenomics databases (2011-)

My background

Overview What is metagenomics? Different types of metagenomic studies
Challenges and considerations for metagenomic data analyses

What is metagenomics? “Metagenomics” means literally ‘beyond genomics’
“Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora” “Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.” “Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”

Sampling from environment
Filtering step Extraction of DNA Sequencing

Identification and characterisation of protein coding sequences
Taxonomic analysis 16S rRNA 18S rRNA ITS etc Quality control Functional analysis Identification and characterisation of protein coding sequences

Why is metagenomics exciting?
Estimated >90% of microbes are not culturable by current techniques Can’t be accessed by traditional sequencing methods NGS techniques allow the reading of DNA from uncloned samples For example: the human body contains ~1014 human cells and ~1015 microbial cells i.e. 90% of the cells in your body are not your own massive impact on understanding health and disease

Applications of taxonomic analyses
Identification of new species Diversity analysis Comparing populations from different sites or states

Applications of functional analyses
Bioprospecting for novel sequences with functional applications Reconstruction of pathways present in the community Comparing functional activities from different sites or states

Why is metagenomics challenging?
Short sequence fragments are hard to characterise Assembly can lead to chimeras Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop’ Millions of different pieces Thousands of different puzzles All mixed together Most of the pieces are missing No boxes to refer to

Limitations and pitfalls
Data used for analysis can have limitations: 16S rRNA genes - limited resolving power and subject to copy number variation Viral sequences –currently no gold-standard reference database Protist sequences – little experimentally-derived annotation of protein function in public databases

Additional pitfalls Different functional and taxonomic analysis tools can give different results The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3) The same version of the same tools can give different results depending on the reference database used

Reference databases

Scientific workflows A number of linked components
Each component is responsible for a small fragment of overall functionality Allow the description, management, and sharing of scientific analyses

Scientific workflows Analogous to laboratory protocols Lyse cells
Remove proteins DNA precipitate Re-suspend DNA … buffers, enzymes, volumes, temperatures, times, centrifugation steps

Scientific workflows Analogous to laboratory protocols
Quality control of reads Assembly into contigs Prediction of rRNAs Prediction of protein coding genes Annotation … software used, versions, settings, reference databases

Tools to create, manage and automate workflows
GALAXY: TAVERNA: Pipeline Pilot:

Tools to create, manage and automate workflows
Offer a range of components (installed locally or called via web services) Connected together to create workflows using a drag and drop GUI

Workflows CloVR - a virtual machine for automated and portable sequence analysis ( Range of protocols for microbial genomics Specific protocol (CloVR-Metagenomics) for taxonomic and functional classifications of metagenomics shotgun sequence data

Workflows The CloVR-Metagenomics protocol

Other considerations: data analysis speed
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at:

Data analysis speed The cost of sequencing has really gone down Now I can do metagenomics! Awesome! Amount of sequence generated has increased 5,000-fold Computational speed has increased only 10-fold Time taken to analyse has increased 500-fold

Data analysis cost (~2m bp/$) 14.5 % 30 % 28 % 70 % (~80 bp/$) 14.5 %
55 % 36.5 % 14.5 % Sboner et al. Genome Biology (2011) 12:125

What data to store? Raw sequence data: Analysis results ?
Important for metagenomics as some samples are hard to replicate Large file sizes Analysis results ? Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions Data description including metadata Essential: what, where, who, how and when If absent, raw data have very limited usefulness

The importance of metadata
Metadata includes the in-depth, controlled description of the sample that your sequence was taken from How was it sampled? How was it extracted? How was it stored? What sequencing platform was used? Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?

The importance of metadata
If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible Show the microbial species found in the North Pacific … at depths of 50 – 100 m … in samples taken May-June … compared to the Indian Ocean, under the same conditions

Considerations: storing data
Where are you going to store this? Locally : back-up ? long term ? sharing ? access ? Amazon, Google or specialist research clouds Public repositories, such as ENA, NCBI or DDBJ

Public repositories Free! Secure long term storage
No need for local infrastructure Enforced compliance: Publisher requirements (accession numbers) Institutional requirements Funder requirements Data are more useful: Data are reusable and can be discovered by others Available for re- and meta-analyses

Considerations: moving data
Transferring a 100 Gb NGS data file across the internet 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week* High-speed bandwidth (10 Gigabit/s) < 1 day* Traditional methods may be the most effective! * Stein, Genome Biol. (2010) 11:207

Metagenomics portals http://www.ebi.ac.uk/metagenomics

What do metagenomics portals offer?
Submit data Tools to help transfer data Tools to help capture & store metadata Sequence archiving Quality filtering of sequences Sequence analysis (prebuilt workflows) Visualisation/Interpretation

Planning your experiments: things to consider
Think data volumes Metagenomic data files can be vast (10s of Gbs/file): how are you going to store and manipulate them? Plan ahead In a short time-frame, data volumes could be orders of magnitude higher Think timescales Data analysis times >> data generation times

Planning your experiments: things to consider
Consider economics Direct costs & opportunity costs Productivity Think communication and collaboration Biologists Bioinformaticians & service providers Computer scientists

Your views, concerns, frustrations…

Considerations for metagenomics data analysis and summary of workflows

Similar presentations

Presentation on theme: "Considerations for metagenomics data analysis and summary of workflows"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Considerations for metagenomics data analysis and summary of workflows

Similar presentations

Presentation on theme: "Considerations for metagenomics data analysis and summary of workflows"— Presentation transcript:

Similar presentations

About project

Feedback