EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists.

Slides:



Advertisements
Similar presentations
Cyber Metagenomics; Challenge to See The Unseen Majority in The Ocean
Advertisements

Genome Annotation: A Protein-centric Perspective.
Bioinformatics Ayesha M. Khan Spring 2013.
Data Search and Retrieval
Publish or perish? Linking Scratchpads and the new Biodiversity Data Journal for streamlining publication of botanical data D.N Koureas 1, L. Penev 2 &
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
11 Decembre 2000V. Breton Milan WP6 DataGRID meeting Biological applications in testbed 0 Evaluate GRID added value for handling biological data –What.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Proteins and Protein Function Charles Yan Spring 2006.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Metagenomic Analysis Using MEGAN4
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
Biological Databases By : Lim Yun Ping E mail :
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
MESH UK Workshop 19 October 2006 Introduction Dr Paul Gilliland Marine Policy Adviser and MESH Partner Lead Natural England.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Sara E. Richardson Calit2 Summer Undergraduate Research Scholarship Program Advisor: Jurgen Schulze Ivl.calit2.net/wiki CAMERA is.
Protein Data Bank: An Introduction Learning to Use the RCSB PDB Portal.
EB3233 Bioinformatics Introduction to Bioinformatics.
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
Bioinformatics Lecture to accompany BLAST/ORF finder activity
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Hellenic Centre for Marine Research (HCMR) MedOBIS - Ocean Biogeographic Information System for the Eastern Mediterranean and Black Sea.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Funded by: The EMBL Nucleotide Sequence Database: Exploiting commonalities between records.
es/by-sa/2.0/. Metagenomics Prof:Rui Alves Dept Ciencies Mediques Basiques, 1st Floor, Room.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
ArrayExpress - a Public Repository for Microarray Based Gene Expression Data European Bioinformatics Institute - EMBL outstation and German Cancer Research.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
African Register of Marine Species AfReMas Leen Vandepitte On behalf of WoRMS data management team.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
ArrayExpress Ugis Sarkans EMBL - EBI
Canadian Bioinformatics Workshops
Metagenomic Species Diversity.
Why to submit your data and metadata?
Protein databases Henrik Nielsen
Considerations for metagenomics data analysis and summary of workflows
Flanders Marine Institute (VLIZ)
Toward Next Generation Biodiversity Research
생물정보학 Bioinformatics.
UniProt: Universal Protein Resource
PIR: Protein Information Resource
Introduction to Bioinformatics
Metagenomics Microbial community DNA extraction
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists

Metagenomic nucleotide sequence and annotation: Range of environments Global ocean survey Human faecal virus communities Human distal gut microbiome Phosphorus removal sludge communities Obesity-associated gut microbiome Acidophilicbacterial community Mouse gut flora

Metagenomic nucleotide sequence and annotation: Data growth: projects

Metagenomic nucleotide sequence and annotation: Data growth: volume of dataset

Metagenomic nucleotide sequence and annotation: Assembly issues Most metagenome records have not been assembled into scaffolds in INSDC records (only 4 of 24 projects so far) and remain as unassembled WGS records Those that have been assembled into scaffolds show very limited assembly - of the four assembled projects, one contains almost as many scaffolds as contigs

Metagenomic nucleotide sequence and annotation: Metadata issues Metadata, particularly sampling information, are often not shown, or are provided with limited granularity, restricting re-analysis by users INSDC offers appropriate structures for such metadata, but they are frequently not used, even when the information is available to the submitters Current: FT source FT /organism="marine metagenome" FT /environmental_sample FT /mol_type="genomic DNA" FT /isolation_source="isolated as part of a large dataset FT composed predominantly from surface water marine samples FT collected along a voyage from Eastern North American coast FT to the Eastern Pacific Ocean, including locations in the FT Sargasso Sea, Panama Canal, and the Galapagos Islands" FT /note="metagenomic" FT /db_xref="taxon:408172" Could be: FT source FT /organism="marine metagenome" FT /environmental_sample FT /mol_type="genomic DNA" FT /country="French Polynesia: Moorea, Cooks Bay" FT /lat_lon=" S W" FT /isolation_source="marine surface water; sample FT depth: 34M; size range: microns; water FT temperature: ; salinity: " FT /db_xref="taxon:408172"

Metagenomic nucleotide sequence and annotation: Taxonomy issues Taxonomic annotation in metagenomic data is simplistic - a very small number of non-specific taxa are necessarily used to describe all of the raw data Analysis methodology, particularly binning, is inconsistent across the dataset, so taxonomic assertions in assembled sequence are of uncertain provenance Standards on whether or not single contigs should contribute to scaffolds for more than one taxon are yet to be established

Metagenomes and UniProt (1/2) As of this month, ~6 million protein sequences from Global Ocean Survey have been released (vs. 4,534,260 UniProtKB entries) Future exponential increase is anticipated: The growth of public protein sequence data is exponential with a doubling time of about 20 months Metagenomics data will have substantially shorter doubling time GOS data will more than double the existing protein-coding sequences in UniProtKB

Metagenomes and UniProt (2/2) Perspectives Vast amount of sequence data Environmental context in metadata New kind of data requires new storage, processing, and data mining procedures Taxonomically unassigned data will not be included in the UniProt Knowledgebase UniMES – UniProt Metagenomics and Environmental sequences (June 2007)

UniMes requirements Distinct storage and dissemination: separated from current UniProt databases. Distinct production pipeline Distinct accession number range: MES followed by 11 hexadecimal numbers, e.g. MES Distinct data mining pipelines: less restricted rules due to the lack of basic knowledge about the taxonomic origin of these sequences

UniMes pipeline overview EMBL Primary data Genomic sequence (EMBL) Other Submissions Metagenomics data (WGS) UniProt KnowledgebaseUniProt MetagenomicsUniProt Archive Classification Clustering Automatic annotation rules Secondary analysis Secondary analysis DNA Metagenomics (to be established)

UniProtKB vs.UniMes Database growth

UniMes storage growth

UniMes hardware requirements (1/2) 2 HP/Compaq AlphaServers ES45 with MHz CPU’s and 12GB Memory Oracle database designed to store and maintain data derived from EMBL Oracle Warehouse for data analysis, integration and display 64-bit linux farm (AMD operon) using 40 nodes for data mining procedures

UniMes hardware requirements (2/2) New oracle servers: Sunfire v490 with MHz UltraSparc IV CPUS’s and 16 GB memory We have enough physical storage and CPU power for 2007

UniMes dissemination FASTA and XML files UniProt Web Site: text and similarity searches

GOS submission Submission of nucleic acid sequence data to EMBL/GenBank/DDBJ is mandatory for publication of scientific paper Craig Venter Institute submission to EMBL/GenBank/DDBJ in March 2007 Environmental metadata can only be found in the CAMERA website Metadata are of great importance for metagenomic sequence data: Descriptions of sampling sites and habitats Analysis of metagenomics sequence data URGENT need for the community to agree on what metadata must be included with the submission of any metagenomics sample

UniMes and GOS data

Top 10 InterPro entries hitting UniProt:Top 10 InterPro entries hitting GOS Top 10 InterPro entries hitting UniParc (including GOS):

UniMes and GOS data: Analysis Calculation time: 763,425 CPU hours Storage for InterPro hits to GOS: 50 GB