Http://www.ebi.ac.uk/metagenomics Hubert DENISE hudenise@ebi.ac.uk.

Slides:



Advertisements
Similar presentations
Metabarcoding 16S RNA targeted sequencing
Advertisements

DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Scaffold Download free viewer:
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Automatic methods for functional annotation of sequences Petri Törönen.
Metagenomic Analysis Using MEGAN4
Discussion on Metagenomic Data for ANGUS Course Adina Howe.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Overview. What is Annotation? Annotation is the process of determining the location and function of all identifiable genes in a genome. Annotation is.
Metagenomics Assembly Hubert DENISE
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
Microbial diversity and virulence probing of five different body sites Anu Rebbapragada, Pub. Health Ontario Central Lab. Canada Wei-Jen Lin, Cal State.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Quality Control Hubert DENISE
Protein and RNA Families
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
How can we find genes? Search for them Look them up.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Copyright OpenHelix. No use or reproduction without express written consent1.
(H)MMs in gene prediction and similarity searches.
InterPro Sandra Orchard.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Copyright OpenHelix. No use or reproduction without express written consent1.
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
What is BLAST? Basic BLAST search What is BLAST?
Bacterial infection by lytic virus
bacteria and eukaryotes
Metagenomic Species Diversity.
Bacterial infection by lytic virus
Considerations for metagenomics data analysis and summary of workflows
Basics of BLAST Basic BLAST Search - What is BLAST?
Demo: Protein Information Resource
Sequence based searches:
Workshop on the analysis of microbial sequence data using ARB
Genome Annotation Continued
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Sequence Based Analysis Tutorial
Dr Tan Tin Wee Director Bioinformatics Centre
Basic Local Alignment Search Tool
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

http://www.ebi.ac.uk/metagenomics Hubert DENISE hudenise@ebi.ac.uk

About me 1997 PhD. Molecular Parasitology Univ. Bordeaux II, France 2003 – 2005 Lecturer Molecular Biology, Univ. Clermont-Ferrand II, France 1997 - 2003 PostDoc, WCMP Univ. Glasgow, UK 2011 – 2012 MSc. Bioinformatics Univ. Cranfield, UK 2005 - 2011 Sr. Scientist, Pfizer Ltd Sandwich, UK 2012 Bioinformatician Sanger Institute then EBI, Hinxton, UK

Where is the true cost of NGS ? 14.5 % 30 % 28 % (~2m bp/$) 4.5 % 70 % (~80 bp/$) 14.5 % 55 % 36.5 % 14.5 % Sboner et al. Genome Biology (2011) 12:125

Data analysis using selected EBI and external software tools EBI Metagenomics pipeline Philosophy Submission to EBI Metagenomics QC steps Overview of functional analysis Overview of taxonomy analysis Metagenome assembly Result outputs Others public pipelines Data analysis using selected EBI and external software tools

Philosophy behind EBI Metagenomics pipeline Helping metagenomics researchers make sense of their data From chaos to structure: archiving of data with metadata performing stringent QC filtering prior to analysis quality in, quality out performing robust taxonomy and functional analysis model-based rather than similarity-based approaches assignment done on reads rather than assembly intuitive navigation through website constant drive to improvement benchmarking and tool testing

Data analysis using selected EBI and external software tools EBI Metagenomics pipeline Philosophy Submission to EBI Metagenomics QC steps Overview of functional analysis Overview of taxonomy analysis Metagenome assembly Result outputs Others public pipelines Data analysis using selected EBI and external software tools

http://www.ebi.ac.uk/metagenomics secure login Resource stats Navigation panes Resource stats Latest data and news

Submitting to EBI Metagenomics Your data is valuable to you Raw sequence data Description of sample and experiment (sample metadata) Analysis steps and results All of this needs to be captured and stored to give context to your data If so, your data can also be valuable to others

Submitting to EBI Metagenomics EBI Metagenomics want to encourage people to supply as much detailed metadata as possible, but with the lowest possible overhead Development of intuitive web-based tools : ENA Webin and ISA tools Use of templates and check-lists (MIGS/MIXS standards) Tutorial and direct support who where, when, what how

Data analysis using selected EBI and external software tools EBI Metagenomics pipeline Phylosophy Submission to EBI Metagenomics QC steps Overview of functional analysis Overview of taxonomy analysis Metagenome assembly Result outputs Others public pipelines Data analysis using selected EBI and external software tools

Metagenomics data analysis Diversity analysis Quality control Functional analysis Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74:171-199

Overview of EBI Metagenomics Pipeline raw reads processed reads discarded reads trim and QC remove short remove duplicates rRNAselector reads with rRNA reads without rRNA FragGeneScan predicted CDS InterProScan Function assignment Unknown function pCDS Amplicon-based data Qiime Taxonomic analysis

EBI Metagenomics: QC rationale Why ? Garbage in, garbage out Base call error: - each base call has a quality score associated - specific platform-dependent errors Reads quality decreases with reads length NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

EBI Metagenomics: QC step by step Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package Quality filtering - sequences with > 10% undetermined nucleotides removed Read length filtering - short sequences are removed Duplicate sequences removal - clustered on 99% identity (UCLUST v 1.1.579) and representative sequence chosen Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

EBI Metagenomics: QC consequences Roche 454 Ion Torrent Illumina

EBI Metagenomics: overview of functional analysis reads without rRNA predicted CDS FragGeneScan Unknown function pCDS InterProScan Function assignment

EBI Metagenomics: identification of coding sequences Prediction of coding sequences is a challenge read length sequencing errors: frame-shift Two main types of approaches: homology-based methods: identify only known coding sequences feature-based approaches: predict probability that ORFs are coding EBI Metagenomics uses FragGeneScan : hidden Markov models to correct frame-shift using codon usage probabilistic identification of start and stop codons 60 bp minimum ORF Rho et al. (2010) NAR 38-20

EBI Metagenomics: annotation of coding sequences Most available pipelines use pairwise alignment methods (such as BLAST) compare a query sequence with a database of sequences identify database sequences that resemble the query sequence with homology score above a certain threshold However sequences may appear to have low homology score because: proteins may share homology only in limited domains proteins from different species can differ in length Example: first line of blast alignment of 60S acidic ribosomal protein P0 from 2 closely-related species

Using BLAST for annotation 19

EBI Metagenomics: advantage of InterPro EBI Metagenomics pipeline do not use BLAST-based methods to associate functions to predicted protein sequences: instead we use InterProScan to mine the InterPro database. InterPro database (HMM and profile –based functional analysis) is based on presence of “signatures” (models) from eleven databases Specificity: mapping is manually curated IPR024185: 5-formyltetrahydrofolate cyclo-ligase-like IPR000847: Transcription regulator HTH, LysR Speed Test set of 40,692 predicted protein sequences BLAST vs UniRef100 = 21.5 s/cds InterProScan (5 databases) = 3 s/cds

EBI Metagenomics: InterProScan annotations member database signature accession signature description pCDS SRR413626.9733695_1_1_105_- ProSitePatterns PS00194 Thioredoxin family active site 1.0E-13 IPR017937 Thioredoxin, conserved site GO:0045454 score InterPro accession InterPro description GO annotation

EBI Metagenomics: InterProScan annotations links signatures description GO terms

Aims of the Gene Ontology Controlled vocabulary Unify the representation of gene and gene product attributes across species Allow cross-species and/or cross-database comparisons

? Inconsistency in naming of biological concepts An example … English is not a very precise language Same name for different concepts Different names for the same concept An example … Taction Tactition Tactile sense ? Sensory perception of touch ; GO:0050975

The Gene Ontology A way to capture biological knowledge Less specific concepts A way to capture biological knowledge in a written and computable form A set of concepts and their relationships to each other arranged as a hierarchy More specific concepts www.ebi.ac.uk/QuickGO

The Concepts in GO 1. Molecular Function 2. Biological Process protein kinase activity insulin receptor activity An elemental activity or task or job 2. Biological Process A commonly recognised series of events cell division mitochondrion mitochondrial matrix mitochondrial inner membrane 3. Cellular Component Where a gene product is located

The relationship between InterPro and GO (InterPro2GO) Curators manually add relevant GO terms to InterPro entries When a sequence is searched against InterPro, it is assigned GO terms by virtue of the entries it matches SRR413626.11302948_1_1_133_+ Pfam PF00005 ABC transporter 6 8.9E-6 IPR003439 ABC transporter-like GO:0005524|GO:0016887 ATP binding ATPase activity 27

EBI Metagenomics: overview of taxonomy analysis processed reads rRNAselector reads with rRNA Amplicon-based data Qiime Taxonomic analysis

EBI Metagenomics: identification of suitable sequences Taxonomy analysis is generally based on identification and classification of rRNA sequences Prokaryotes: archaebacteria and eubacteria: 5S, 16S and 23S Eukaryotes: 5S, 5.8S, 18S and 28S there is no equivalent for virus so depend on DNA polymerase or part of 5’-UTR (internal ribosomal entry site [IRES]) sequences EBI Metagenomics currently only provide taxonomy analysis for Prokaryotes. rRNA sequences are identified using rRNASelector : hidden Markov models to identified rRNA sequences 60 bp minimum overlap with well-curated HMM model E-value < 10-5 Lee et al (2011) J Microbiol. 49(4)

EBI Metagenomics: identification of suitable sequences Once identified, rRNA sequences are clustered and classified using Qiime “QIIME stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities” The main steps are: clustering sequences in Operational Taxonomy Unit (OTU) using uclust picking a representative sequence set (one sequence from each OTU) aligning the representative sequence set assigning taxonomy to the representative sequence set using PyNAST generating output files: filtering the alignment prior to tree building building phylogenetic tree creating OTU table

EBI Metagenomics: validation of taxonomy analysis Re-analysis of: Sutton et al, Appl. Environ. Microbiol (2013), 79(2):619 Impact of Long-Term Diesel Contamination on Soil Microbial Community Structure. Alpha diversity analysis clean polluted clean (outlier)

Assembly of metagenomics data Metagenomics: Not clear how you avoid assembling sequences from different species together : chimaera No reference sequence to align against

EBI Metagenomics currently do not perform assembly We are still able to annotate metagenome as show by this re-analysis of Rumen metagenomics by Hess et al, Science (1011) 331:463 What are the consequences ? cannot link taxonomy information to functional annotations cannot currently perform viral taxonomy analysis

EBI Metagenomics pipeline in a nut shell QC : - trim adaptor sequences, low quality sequence ends - remove duplicates and short sequences - remove low complexity sequences, “Powerful and sophisticated alternative to BLAST-based functional metagenomic analysis” Diversity analysis : - identify prokaryotic rRNAsequences (5, 16 and 23s) - cluster rRNA-containing reads - assign taxonomy classificationusing Qiime, Functional analysis : - predict ORFs - translate ORFs into peptides - submit to InterProScan for functional annotation

Data analysis using selected EBI and external software tools EBI Metagenomics pipeline Submission Philosophy Overview data analysis QC steps Overview of functional analysis Overview of taxonomy analysis Metagenome assembly Result outputs Others public pipelines Data analysis using selected EBI and external software tools

Current outputs of EBI Metagenomics pipeline Visualisation Download - QC and sequence statistics - Diversity analysis - Functional analysis

Current outputs of EBI Metagenomics pipeline navigation tabs Access via the Sample page

EBI Metagenomics pipeline: taxonomy visualisation switch to bar chart, column or Krona interactive views Krona interactive representation Google charts dynamic representation

EBI Metagenomics pipeline: functional visualisation Google charts dynamic representation links to InterPro website switch to bar chart view

EBI Metagenomics pipeline : download options 470 MB: need high computing power to manipulate: EBI Metagenomics take care of it and extract meaningful information sets relatively small files: can be manipulated on labtop/desktop computer: users can filtered them according to their needs

Data analysis using selected EBI and external software tools EBI Metagenomics pipeline Submission Philosophy Overview data analysis QC steps Overview of functional analysis Overview of taxonomy analysis Metagenome assembly Result outputs Others public pipelines Data analysis using selected EBI and external software tools

Metagenomics data analysis Quality control Quality control Pipeline 1 Taxonomy analysis Taxonomy analysis Pipeline 2 Functional analysis Functional analysis results 1 results 2 should share trends and main findings could differ in ratio and assignment

Public Metagenomics portals http://metagenomics.anl.gov/ http://www.ebi.ac.uk/metagenomics/ http://img.jgi.doe.gov/ http://camera.calit2.net/

Simplified overview of MG-RAST pipeline Sequencer output Quality control Feature prediction (FragGeneScan) Abundance profiles Similarities search Blat Clustering (Uclust) Community reconstruction Metabolic reconstruction Metabolic model http://metagenomics.anl.gov/

Example: Analysis of Prairie Soil Sample MG-RAST and EBI Metagenomics QC comparison Example: Analysis of Prairie Soil Sample MG-RAST EBI Metagenomics Upload: bp Count 391,415,961 bp 391,415,961 bp  Upload: Sequences Count 946,839 Upload: Mean Sequence Length 413 ± 125 bp 413.39 bp Upload: Mean GC percent 61 ± 8 % 61.2 %  Artificial Duplicate Reads: Sequence Count 0  Post QC: bp Count 388,670,692 bp  Post QC: Sequences Count 908,602 Post QC: Mean Sequence Length 380.43 bp Post QC: Mean GC percent 57.8 % Processed: Predicted Protein Features 972,409 999,433 Processed: Predicted rRNA Features 5 3  Alignment: Identified Protein Features 510,221 480,560  Alignment: Identified rRNA Features 1,069 1,110 Annotation: Identified Functional Categories 442,070 462,475

Example: Analysis of Prairie Soil Sample MG-RAST and EBI Metagenomics Functional analysis Example: Analysis of Prairie Soil Sample ammonia monooxygenase: NH3 + A-H2 + O2     NH2OH + A + H2O MG-RAST: 28 unique hits on 8 different protein databases 1 putative ammonia monooxygenase 3 Putative ammonia monooxygenase 5 Ammonia monooxygenase 1 ammonia monooxygenase family protein 2 ammonia monooxygenase subunit A 1 ammonia monooxygenase, putative 6 putative ammonia monooxygenase 2 Putative ammonia monooxygenase 1 putative ammonia monooxygenase subunit A 13 GenBank 9 SEED 12 Ammonia monooxygenase 2 ammonia monooxygenase family protein 4 Ammonia monooxygenase subunit A 5 Ammonia monooxygenase, putative 62 Putative ammonia monooxygenase 3 putative ammonia monooxygenase protein 4 putative ammonia monooxygenase subunit A 8 KEGG 18 eggNOG 13 GenBank 11 IMG 8 PATRIC 10 RefSeq 12 TrEMBL 9 SEED what do the abundance numbers mean ? EBI Metagenomics: 3 IPR003393 Ammonia monooxygenase/particulate methane monooxygenase, subunit A 25 IPR007820 Putative ammonia monooxygenase/protein AbrB

MG-RAST and EBI Metagenomics Taxonomy analysis Example: Analysis of Prairie Soil Sample MG-RAST domain level of taxonomy (55 categories) (15 categories) (98 categories) (3 types) EBI Metagenomics only Archae/Bacteria taxonomy (333 OTU)

Overview of CAMERA workflow

Integrated Microbial Genomes and Metagenomes analysis tools

Some other Metagenomics tools http://ab.inf.uni-tuebingen.de/software/megan/ http://cbcb.umd.edu/software/metAMOS http://www.computationalbioenergy.org/software.html

Overview of MEGAN MEGAN QC ? Taxonomy analysis rdp,biome files csv, tsv files Taxonomy analysis seq comparison and assignment Comparative visualisation abundance plots PCA, clustering, co-occurrence blast output SAM files csv, tsv files Functional analysis SEED KEGG COG/EGGNOG QC ?

Example of taxonomy analysis using MEGAN diverse single and multi-sample visualisations

Example of taxonomy analysis using MEGAN Comparison, PCA and co-occurrence plots

Data analysis using selected EBI and external software tools EBI Metagenomics pipeline Submission Philosophy Overview data analysis QC steps Overview of functional analysis Overview of taxonomy analysis Metagenome assembly Result outputs Others public pipelines Data analysis using selected EBI and external software tools

http://www.ebi.ac.uk/metagenomics