Considerations for metagenomics data analysis and summary of workflows

Slides:



Advertisements
Similar presentations
Metabarcoding 16S RNA targeted sequencing
Advertisements

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey What is Metagenomics?  Traditional microbial genomics 
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Microbial Genomes Features Analysis Role of high-throughput sequencing Yeast - the eukaryotic model microbe Databases –TIGR CMR –NCBI Microbial Genomes.
E-BIOGENOUEST: A REGIONAL LIFE SCIENCES INITIATIVE FOR DATA INTEGRATION Datacite Annual Conference Nancy Olivier Collin – IRISA/INRIA
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
Integration and analysis of multi-type high-throughput data for biomolecular knowledge discovery Dr. Erik Bongcam-Rudloff SGBC-SLU Uppsala, Sweden.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists.
SCIENCE VOL FEBRUARY 2011 R 黃博強 R 林彥伯 R 蘇醒宇 R 吳卓翰 R 蘇煒迪 R 陳維.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
European Life Sciences Infrastructure for Biological Information META-pipe WP6 Kick-off Lars Ailo Bongo, ELIXIR-NO.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Bioinformatics and Computational Biology
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
| nectar.org.au NECTAR TRAINING Module 2 Virtual Laboratories and eResearch Tools.
es/by-sa/2.0/. Metagenomics Prof:Rui Alves Dept Ciencies Mediques Basiques, 1st Floor, Room.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
High throughput biology data management and data intensive computing drivers George Michaels.
Boundless Lecture Slides Free to share, print, make copies and changes. Get yours at Available on the Boundless Teaching Platform.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
From Reads to Results Exome-seq analysis at CCBR
Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie.
Canadian Bioinformatics Workshops
Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day.
Enhancements to Galaxy for delivering on NIH Commons
Microbial genomics.
Metagenomic Species Diversity.
Introduction to Bioinformatics and Functional Genomics
NGS data transmission, A point view from a user
Gil McVean Department of Statistics
Tools and Services Workshop
Quality Control & Preprocessing of Metagenomic Data
Joslynn Lee – Data Science Educator
CyVerse Discovery Environment
Seminar in Bioinformatics (236818)
Joseph JaJa, Mike Smorul, and Sangchul Song
Tools and Services Workshop
How to store and visualize RNA-seq data
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
Toward Next Generation Biodiversity Research
Data uploading and sharing with CyVerse
Day 5 Session 29: Questions and follow-up…. James C. Fleet, PhD
Human Cells Human genomics
14-3 Human Molecular Genetics
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Mangaldai College, Mangaldai
14-3 Human Molecular Genetics
Genomes and Their Evolution
H = -Σpi log2 pi.
Metagenomics Microbial community DNA extraction
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
LESSON 1 INTNRODUCTION HYE-JOO KWON, Ph.D /
Microbiome studies for microbial disease pathogenesis research
BSC1010: Intro to Biology I K. Maltz Chapter 21.
Genome resolved metagenomics
Toward Accurate and Quantitative Comparative Metagenomics
General overview of the bioinformatic pipelines for the 16S rRNA gene microbial profiling and shotgun metagenomics. General overview of the bioinformatic.
Presentation transcript:

Considerations for metagenomics data analysis and summary of workflows Alex Mitchell mitchell@ebi.ac.uk

My background Doctorate in pharmacology (1995-1998) Post-doc in molecular biology (1998-2001) Bioinformatics research (2001-2011) Co-ordinator for InterPro and EBI metagenomics databases (2011-)

My background

Overview What is metagenomics? Different types of metagenomic studies Challenges and considerations for metagenomic data analyses

What is metagenomics? “Metagenomics” means literally ‘beyond genomics’ “Metagenome” used by Handelsman et al., in 1998 to describe “collective genomes of soil microflora” “Metagenomics is the study of metagenomes, genetic material recovered directly from environmental samples.” “Metagenomics is the study of all genomes present in any given environment without the need for prior individual identification or amplification”

Sampling from environment Filtering step Extraction of DNA Sequencing

Identification and characterisation of protein coding sequences Taxonomic analysis 16S rRNA 18S rRNA ITS etc Quality control Functional analysis Identification and characterisation of protein coding sequences

Why is metagenomics exciting? Estimated >90% of microbes are not culturable by current techniques Can’t be accessed by traditional sequencing methods NGS techniques allow the reading of DNA from uncloned samples For example: the human body contains ~1014 human cells and ~1015 microbial cells i.e. 90% of the cells in your body are not your own massive impact on understanding health and disease

Applications of taxonomic analyses Identification of new species Diversity analysis Comparing populations from different sites or states

Applications of functional analyses Bioprospecting for novel sequences with functional applications Reconstruction of pathways present in the community Comparing functional activities from different sites or states

Why is metagenomics challenging? Short sequence fragments are hard to characterise Assembly can lead to chimeras Iddo Friedberg: ‘Metagenomics is like a disaster in a jigsaw shop’ Millions of different pieces Thousands of different puzzles All mixed together Most of the pieces are missing No boxes to refer to

Limitations and pitfalls Data used for analysis can have limitations: 16S rRNA genes - limited resolving power and subject to copy number variation Viral sequences –currently no gold-standard reference database Protist sequences – little experimentally-derived annotation of protein function in public databases

Additional pitfalls Different functional and taxonomic analysis tools can give different results The same tools can give different results depending on the version and underlying algorithm (e.g., HMMER2 vs HMMER3) The same version of the same tools can give different results depending on the reference database used

Reference databases

Reference databases

Reference databases

Scientific workflows A number of linked components Each component is responsible for a small fragment of overall functionality Allow the description, management, and sharing of scientific analyses

Scientific workflows Analogous to laboratory protocols Lyse cells Remove proteins DNA precipitate Re-suspend DNA … buffers, enzymes, volumes, temperatures, times, centrifugation steps

Scientific workflows Analogous to laboratory protocols Quality control of reads Assembly into contigs Prediction of rRNAs Prediction of protein coding genes Annotation … software used, versions, settings, reference databases

Tools to create, manage and automate workflows GALAXY: https://usegalaxy.org/ TAVERNA: http://www.taverna.org.uk/ Pipeline Pilot: http://accelrys.com/products/pipeline-pilot/

Tools to create, manage and automate workflows Offer a range of components (installed locally or called via web services) Connected together to create workflows using a drag and drop GUI

Workflows CloVR - a virtual machine for automated and portable sequence analysis (http://clovr.org/) Range of protocols for microbial genomics Specific protocol (CloVR-Metagenomics) for taxonomic and functional classifications of metagenomics shotgun sequence data

Workflows The CloVR-Metagenomics protocol http://clovr.org/methods/clovr-metagenomics/

Other considerations: data analysis speed Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Available at: www.genome.gov/sequencingcosts.

Data analysis speed The cost of sequencing has really gone down Now I can do metagenomics! Awesome! Amount of sequence generated has increased 5,000-fold Computational speed has increased only 10-fold Time taken to analyse has increased 500-fold $@%*!!!

Data analysis cost (~2m bp/$) 14.5 % 30 % 28 % 70 % (~80 bp/$) 14.5 % 55 % 36.5 % 14.5 % Sboner et al. Genome Biology (2011) 12:125

What data to store? Raw sequence data: Analysis results ? Important for metagenomics as some samples are hard to replicate Large file sizes Analysis results ? Easiest to repeat, although it takes time & requires keeping track of analysis steps and versions Data description including metadata Essential: what, where, who, how and when If absent, raw data have very limited usefulness

The importance of metadata Metadata includes the in-depth, controlled description of the sample that your sequence was taken from How was it sampled? How was it extracted? How was it stored? What sequencing platform was used? Where did it come from? What were the environmental conditions (lat/long, depth, pH, salinity, temperature…) or clinical observations?

The importance of metadata If metadata is adequately described, using a standardised vocabulary, querying and interpretation across projects becomes possible Show the microbial species found in the North Pacific … at depths of 50 – 100 m … in samples taken May-June … compared to the Indian Ocean, under the same conditions

Considerations: storing data Where are you going to store this? Locally : back-up ? long term ? sharing ? access ? Amazon, Google or specialist research clouds Public repositories, such as ENA, NCBI or DDBJ

Public repositories Free! Secure long term storage No need for local infrastructure Enforced compliance: Publisher requirements (accession numbers) Institutional requirements Funder requirements Data are more useful: Data are reusable and can be discovered by others Available for re- and meta-analyses

Considerations: moving data Transferring a 100 Gb NGS data file across the internet 'Normal' network bandwidth (1 Gigabit/s) ~ 1 week* High-speed bandwidth (10 Gigabit/s) < 1 day* Traditional methods may be the most effective! * Stein, Genome Biol. (2010) 11:207

Metagenomics portals http://www.ebi.ac.uk/metagenomics http://metagenomics.anl.gov/ http://img.jgi.doe.gov/ http://camera.calit2.net/

What do metagenomics portals offer? Submit data Tools to help transfer data Tools to help capture & store metadata Sequence archiving Quality filtering of sequences Sequence analysis (prebuilt workflows) Visualisation/Interpretation

Planning your experiments: things to consider Think data volumes Metagenomic data files can be vast (10s of Gbs/file): how are you going to store and manipulate them? Plan ahead In a short time-frame, data volumes could be orders of magnitude higher Think timescales Data analysis times >> data generation times

Planning your experiments: things to consider Consider economics Direct costs & opportunity costs Productivity Think communication and collaboration Biologists Bioinformaticians & service providers Computer scientists

Your views, concerns, frustrations…