Download presentation
Presentation is loading. Please wait.
Published byPierce Powell Modified over 6 years ago
1
Considerations for metagenomics data analysis and summary of workflows
Alex Mitchell
2
My background Doctorate in pharmacology (1995-1998)
Post-doc in molecular biology ( ) Bioinformatics research ( ) Co-ordinator for InterPro and EBI metagenomics databases (2011-)
3
My background
4
Overview What is metagenomics? Different types of metagenomic studies
Challenges and considerations for metagenomic data analyses
5
What is metagenomics? “Metagenomics” means literally ‘beyond genomics’
“Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora” “Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.” “Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”
6
Sampling from environment
Filtering step Extraction of DNA Sequencing
7
Identification and characterisation of protein coding sequences
Taxonomic analysis 16S rRNA 18S rRNA ITS etc Quality control Functional analysis Identification and characterisation of protein coding sequences
8
Why is metagenomics exciting?
Estimated >90% of microbes are not culturable by current techniques Can’t be accessed by traditional sequencing methods NGS techniques allow the reading of DNA from uncloned samples For example: the human body contains ~1014 human cells and ~1015 microbial cells i.e. 90% of the cells in your body are not your own massive impact on understanding health and disease
9
Applications of taxonomic analyses
Identification of new species Diversity analysis Comparing populations from different sites or states
10
Applications of functional analyses
Bioprospecting for novel sequences with functional applications Reconstruction of pathways present in the community Comparing functional activities from different sites or states
11
Why is metagenomics challenging?
Short sequence fragments are hard to characterise Assembly can lead to chimeras Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop’ Millions of different pieces Thousands of different puzzles All mixed together Most of the pieces are missing No boxes to refer to
12
Limitations and pitfalls
Data used for analysis can have limitations: 16S rRNA genes - limited resolving power and subject to copy number variation Viral sequences –currently no gold-standard reference database Protist sequences – little experimentally-derived annotation of protein function in public databases
13
Additional pitfalls Different functional and taxonomic analysis tools can give different results The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3) The same version of the same tools can give different results depending on the reference database used
14
Reference databases
15
Reference databases
16
Reference databases
17
Scientific workflows A number of linked components
Each component is responsible for a small fragment of overall functionality Allow the description, management, and sharing of scientific analyses
18
Scientific workflows Analogous to laboratory protocols Lyse cells
Remove proteins DNA precipitate Re-suspend DNA … buffers, enzymes, volumes, temperatures, times, centrifugation steps
19
Scientific workflows Analogous to laboratory protocols
Quality control of reads Assembly into contigs Prediction of rRNAs Prediction of protein coding genes Annotation … software used, versions, settings, reference databases
20
Tools to create, manage and automate workflows
GALAXY: TAVERNA: Pipeline Pilot:
21
Tools to create, manage and automate workflows
Offer a range of components (installed locally or called via web services) Connected together to create workflows using a drag and drop GUI
22
Workflows CloVR - a virtual machine for automated and portable sequence analysis ( Range of protocols for microbial genomics Specific protocol (CloVR-Metagenomics) for taxonomic and functional classifications of metagenomics shotgun sequence data
23
Workflows The CloVR-Metagenomics protocol
24
Other considerations: data analysis speed
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at:
25
Data analysis speed The cost of sequencing has really gone down Now I can do metagenomics! Awesome! Amount of sequence generated has increased 5,000-fold Computational speed has increased only 10-fold Time taken to analyse has increased 500-fold
26
Data analysis cost (~2m bp/$) 14.5 % 30 % 28 % 70 % (~80 bp/$) 14.5 %
55 % 36.5 % 14.5 % Sboner et al. Genome Biology (2011) 12:125
27
What data to store? Raw sequence data: Analysis results ?
Important for metagenomics as some samples are hard to replicate Large file sizes Analysis results ? Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions Data description including metadata Essential: what, where, who, how and when If absent, raw data have very limited usefulness
28
The importance of metadata
Metadata includes the in-depth, controlled description of the sample that your sequence was taken from How was it sampled? How was it extracted? How was it stored? What sequencing platform was used? Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?
29
The importance of metadata
If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible Show the microbial species found in the North Pacific … at depths of 50 – 100 m … in samples taken May-June … compared to the Indian Ocean, under the same conditions
30
Considerations: storing data
Where are you going to store this? Locally : back-up ? long term ? sharing ? access ? Amazon, Google or specialist research clouds Public repositories, such as ENA, NCBI or DDBJ
31
Public repositories Free! Secure long term storage
No need for local infrastructure Enforced compliance: Publisher requirements (accession numbers) Institutional requirements Funder requirements Data are more useful: Data are reusable and can be discovered by others Available for re- and meta-analyses
32
Considerations: moving data
Transferring a 100 Gb NGS data file across the internet 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week* High-speed bandwidth (10 Gigabit/s) < 1 day* Traditional methods may be the most effective! * Stein, Genome Biol. (2010) 11:207
33
Metagenomics portals http://www.ebi.ac.uk/metagenomics
34
What do metagenomics portals offer?
Submit data Tools to help transfer data Tools to help capture & store metadata Sequence archiving Quality filtering of sequences Sequence analysis (prebuilt workflows) Visualisation/Interpretation
35
Planning your experiments: things to consider
Think data volumes Metagenomic data files can be vast (10s of Gbs/file): how are you going to store and manipulate them? Plan ahead In a short time-frame, data volumes could be orders of magnitude higher Think timescales Data analysis times >> data generation times
36
Planning your experiments: things to consider
Consider economics Direct costs & opportunity costs Productivity Think communication and collaboration Biologists Bioinformaticians & service providers Computer scientists
37
Your views, concerns, frustrations…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.