Why to submit your data and metadata?

Slides:



Advertisements
Similar presentations
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Advertisements

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical.
DESIGNING THE MICROBIAL RESEARCH COMMONS: AN INTERNATIONAL SYMPOSIUM NATIONAL ACADEMY OF SCIENCES, WASHINGTON, DC, 8-9 OCTOBER 2009 Paul Gilna, B.Sc.,
Scratchpads Publishing biodiversity: The interplay between Scratchpads and the Biodiversity Data Journal Dr Dimitrios Koureas Biodiversity Informatics.
1 NODC, Russia GISC & DCPC developers meeting Langen, 29 – 31 March E2EDM technology implementation for WIS GISC development S. Sukhonosov, S. Belov.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
NCBI resources III: GEO and expression data analysis Yanbin Yin Fall
Using ArrayExpress. ArrayExpress is an international public repository for well-annotated microarray data, including gene expression, comparative genomic.
The benefit of GIS-reporting in the context of Water-related Health
Environmental sciences coverage of the CAB ABSTRACTS database Halina Dawson, Content Manager for Environmental Sciences, CABI.
Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center Standardizing Metadata Associated.
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
ODINCARSA Planning Meeting II, December 7-10, 2009.
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard.
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Wider Caribbean and Western Mid-Atlantic EBSA Regional Workshop, 28 February to 2 March 2012 – Recife, Brazil.
Scratchpads Publication Module - A paradigm shift in publishing RBG Kew, Seminar,
VOCABULARIES A data management presentation. Data management best practices Inventory of resources/datasets – Database level or series of datasets/collections.
Gene Expression Omnibus (GEO)
Resource Identification for a Biological Collection Information Service in Europe An introduction to the BioCISE project Walter G. Berendsohn Botanical.
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Peter H. Wiebe and Nancy Copley Woods Hole Oceanographic Institution How does CMarZ Work? CMarZ Information System / Database /OBIS/ Species Pages.
Considerations in designing a national or regional microbiological data archiving system Micah I. Krichevsky Bionomics International Wheaton, MD USA.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Standards and tools for publishing biodiversity data Yu-Huang Wang June 25, 2012.
Roadmap for Soil Community Metagenomics of DOE’s FACE & OTC Sites
EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Census of Marine Zooplankton CMarZ is a taxonomically comprehensive, global- scale census of marine zooplankton, to produce accurate and complete information.
Sara E. Richardson Calit2 Summer Undergraduate Research Scholarship Program Advisor: Jurgen Schulze Ivl.calit2.net/wiki CAMERA is.
Gene Expression Omnibus (GEO)
The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.
Scratchpads and the new Biodiversity Data Journal Biodiversity Data Publishing made… easier Dimitris Koureas Natural History Museum London.
Hellenic Centre for Marine Research (HCMR) MedOBIS - Ocean Biogeographic Information System for the Eastern Mediterranean and Black Sea.
Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.
es/by-sa/2.0/. Metagenomics Prof:Rui Alves Dept Ciencies Mediques Basiques, 1st Floor, Room.
Habitat-Lite & EnvO Jin Mao Postdoc, School of Information, University of Arizona Nov. 20, 2015.
On the D4Science Approach Toward AquaMaps Richness Maps Generation Pasquale Pagano - CNR-ISTI Pedro Andrade.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Knowledge base for growth and innovation in ocean economy: assembly and dissemination of marine data for seabed mapping LOT 5 – BIOLOGY WP4 Data archaeology.
3rd Training Workshop June 2008, Ostende Management of CSR Anne Che-Bohnenstengel, BSH  Metadata Formats  Defined Vocabularies  Content Management.
Center of Excellence for Oceans and Human Health at the Hollings Marine Laboratory Metadata Development in Support of the Oceans and Human Health Tidal.
Geography at Marlborough Primary School At Marlborough Primary School Geography continues to be delivered through a thematic approach. This approach allows.
Barcode sequences at GenBank
Recording Metadata Inbal adir 26/4/17.
WHO Operational Plan for RSV Surveillance Pilot
Considerations for metagenomics data analysis and summary of workflows
An Overview of Data-PASS Shared Catalog
Using ArrayExpress.
Geography at Marlborough Primary School
Flanders Marine Institute (VLIZ)
The International Plant Protection Convention
Data challenges in the pharmaceutical industry
How to store and visualize RNA-seq data
What is Bioinformatics?
Department of Genetics • Stanford University School of Medicine
UniProt: Universal Protein Resource
Gene Expression Omnibus (GEO)
Metagenomics Microbial community DNA extraction
Welcome to the Quantitative Trait Loci (QTL) Tutorial
Ocean Biogeographic Information System (OBIS)
Weekly Vocabulary 10/26/15 Ecology Biosphere Environment Abiotic
Crop Protection Compendium Instruction Manual
Welcome - webinar instructions
My name is VL, I work at the EEA, on EA, and particularly on developing a platform of exchange which aims at facilitating the planning and development.
Draft revision of ISPM 6: National surveillance systems ( )
Module 1b – ICIS Permitted Features
Incorporating Scientific Practices into the BBNJ ILBI
Presentation transcript:

Why to submit your data and metadata? Petra ten Hoopen, PhD European Nucleotide Archive

Why to submit your data and metadata to public data archives? Molecular data archives Metadata importance Reporting of environmental data and metadata Support for data reporting

European Nucleotide Archive 1. Molecular data archives European Nucleotide Archive Permanent and comprehensive repository for public domain nucleotide sequences and associated information http://www.ebi.ac.uk/ena/home Archiving Helpdesk Training Standards development Technology development Community building

ENA 1. Molecular data archives ENA data flow raw reads assemblies Direct presentation Browser/Search/Download ENA data flow ENA raw reads assemblies annotation Large-scale Programmatic services ENA-captured data validation processing archiving Brokered data Small-scale WEBIN submission tool INSDC-exchanged data Infrastructure services Domain-specific databases

1. Molecular data archives ENA data model Data = reads and nucleotide sequence assemblies Metadata = information associated with sequences, includes provenance of biological sample (sample), sequencing experiment (experiment) and its scope (study), analysis and annotation of sequences (analysis), and files of raw data (run) Study Experiment Analysis Sample Run Data

ENA data standardization 1. Molecular data archives ENA data standardization Standardized reporting requirements for all metadata and data objects Study Experiment Analysis Sample Run Data agreed by: INSDC Consortia of scientific domain-specific experts (e.g. GSC, MicroB3, RNACentral, GMI) implemented with: community-agreed checklists and control vocabularies, data-type-specific file formats

Metadata on sampling and experimentation are essential: 2. Metadata importance metadata are data about data Metadata on sampling and experimentation are essential: what where when who how Data without metadata are useless!

2. Metadata importance - data discovery Umbrella project with component sequencing projects http://www.ebi.ac.uk/ena/data/view/PRJEB402

2. Metadata importance – data discovery http://www.ebi.ac.uk/ena/browse free text search, sequence similarity search, programmatic access, data download http://www.ebi.ac.uk/ena/data/warehouse/search advanced search

2. Metadata importance - data discovery http://www.ebi.ac.uk/ena/data/warehouse/search?query=%22sa mpling_platform=%22SV%20Tara%22%22&domain=sample

2. Metadata importance – data comparison If metadata is adequately described, using a standardised vocabulary, comparing sequencing projects becomes possible Show the microbial species found in the North Pacific … at depths of 50 – 100 m … in samples taken May-June … compared to the Indian Ocean, under the same conditions

Unified description of marine datasets 2. Metadata importance - data comparison Unified description of marine datasets Tara Oceans consortium 2009/2012 Tara Oceans, a global view 2013 Tara Oceans Polar Circle Sampling route of the Tara Oceans expedition (Scientific Data 2, Article number: 150023 (2015)

Unified description of marine datasets 2. Metadata importance - data comparison Unified description of marine datasets OSD consortium 21st June 2014 Ocean Sampling Day 21st June 2015 Ocean Sampling Day Map of OSD Sites (http://mb3is.megx.net/osd-registry/list)

2. Metadata importance - data analysis OSD Analysis Consortium (unpublished) Selected InterPro families are enriched (or otherwise) corresponding to various environmental conditions. For instance the IPR001661is very much underrepresented at certain PAR value and IPR001570 is overrepresented at certain latitude. Heatmap of significant Spearman correlations between protein families and environmental conditions across 150 sites

2. Metadata importance - data analysis If metadata is adequately described, using a standardised vocabulary, comparing sequencing projects becomes possible

3. Reporting environmental data and metadata http://www.ebi.ac.uk/ena/submit INSDC-agreed what should be reported and how Genome Assembly (GA) – layers of reads, contigs, scaffolds, chromosome Read Data (RD) – sequence reads, associated sequencing study/experiment/analysis Feature Table (FT) – provenance and functional annotation of assembled sequences Third Party Data (TPA) – (re)assembly/annotation of existing records by third party

3. Reporting environmental data and metadata http://www.ebi.ac.uk/ena/about/reporting-standards community-developed what should be reported and how MIGS, MIMS, MIMARKS (MIxS) – Minimum Information about any (x) Sequence M2B3 – Minimum Information about marine microbial sample GMI:MDM – Minimal Data for Mapping in relation to the Global Microbial Identifier pathogen tracking initiative MINSEQE – Minimum Information about a high-throughput Nucleotide SeQuencing Experiment BARCODE – biodiversity standard formulated by the Consortium for the Barcode of Life, CBOL currently approves as effective barcodes loci: cox1 (for animals), matK +rbcL (for plants) and ITS (for fungi) GMI – standard for sequence data of genome-scale pathogen surveys in clinical, animal and environmental samples MINSEQE – for description of high-throughput sequencing (HTS) experiments for unambiquous interpretation of HTS experiments

MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence 3. Reporting environmental data and metadata MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence Yilmaz et al., 2011

MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence 3. Reporting environmental data and metadata MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence Yilmaz et al., 2011

MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence 3. Reporting environmental data and metadata MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence Example: MIMS-compliant water sample: Submitted to INSDC Project name Investigation type Sequencing method Geographic location (latitude and longitude) Geographic location (country and/or sea region) Collection date Environment (biome) Environment (feature) Environment (material) Environmental package-water Depth http://wiki.gensc.org/index.php?title=MIxS http://www.ebi.ac.uk/ena/submit/mixs-checklists

MIxS CDI OBIS 3. Reporting environmental data and metadata M2B3 Minimum information about marine microbial sample MIxS CDI OBIS Multidisciplinary standard, new insight into marine ecosystem functioning and its biotechnological potential by relating marine microbial diversity with its environmental oceanographic context. ten Hoopen et al. 2015

3. Reporting environmental data and metadata M2B3 Reporting Standard minimum information about marine microbial sampling

M2B3 MIxS 3. Reporting environmental data and metadata Example: M2B3-compliant marine sample: investigation campaign investigation site investigation platform protocol label sample title geographic location (latitude and longitude) collection date depth biome feature material temperature salinity scientific name taxon ID parameter ID Example: MIMS-compliant water sample: submitted to INSDC project name investigation type sequencing method geographic location (latitude and longitude) geographic location (country and/or sea region) collection date environment (biome) environment (feature) environment (material) environmental package-water depth

3. Reporting environmental data and metadata GMI Minimum information about pathogen sample collector name collecting institution collection date / receipt date geographic location (latitude and longitude) geographic location (country and/or sea region) sample capture status (e.g. surveillance, farm) isolate pathogen testing no yes pathogen association host name host subject ID host health state host disease isolation source host-associated host sex/age/behaviour/habitat host inoculation isolation source non-host-associated GMI for global identification of pathogens to detect outbreaks and new pathogens. environmental sample no yes strain serovar / serotype human surveillance data (antiviral treatment, vacciantion details, illness details)

3. Reporting environmental data and metadata How many ways can you say ‘male’ 37 year old male initial phase male male fetus six males mixed 600 yr. old male m male plant stallion adult male make male, 8 weeks old steer bull makle male, castrated sterile male castrated male mal e male, pooled strictly male cm male males tetraploide male dioecious male male (7-2872) man type i males diploid male male (7-3074) men type ii males drone male (m-a) normale male virgin male engorged male male (m-o) ram winged and wingless males fertile male male caucasian rooster young male four males mixed male child s1 male sterile individual male male fertile sex: male male (note: this sample was originally provided as a \female\ sample to us and therefore labeled this way in the brawand et al. paper and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample stems from a male individual)

3. Reporting environmental data and metadata ENVO Environment (biome) = broad ecological context e.g. oceanic pelagic zone biome(ENVO:01000033) Environment (feature) = local environment determined by e.g. marine surface current(ENVO:01000099) Environment (material) = surrounded by e.g. saline water(ENVO:00002010) http://environmentontology.org http://ols.wordvis.com

3. Reporting environmental data and metadata Null value reporting language not collected – information has not been collected not provided - not given, a value may be given at the later stage restricted access - information exists but can not be released openly not applicable - information is inappropriate to report We strongly encourage to provide true values whenever possible!

4. Support for data reporting Sequencing study description in WEBIN submission tool what, who Study Experiment Analysis Sample Run Data

4. Support for data reporting ENA Checklists for sample provenance description in WEBIN submission tool where, when, how MIxS microbial mat biofilm MIxS host associated MIxS human associated MIxS human gut MIxS human oral MIxS human skin MIxS human vaginal MIxS plant associated MIxS miscellaneous env. MIxS sediment MIxS soil MIxS wastewater sludge MIxS water MIxS built env. ENA default sample ENA prokaryotic pathogen GMI:MDM:1.1 pathogen ENA virus pathogen ENA Influenza ENA M2B3 sample ENA Tara sample ENA plant sample Study Experiment Analysis Sample Run Data

4. Support for data reporting ENA Checklists for sample provenance description in WEBIN submission tool Select checklist Select fields Fields values

4. Support for data reporting Controlled vocabularies for sequencing experiment description in WEBIN submission tool how Study Experiment Analysis Sample Run Data

4. Support for data reporting Checklists for functional annotation in WEBIN submission tool Select fields Fields values Study Experiment Analysis Sample Run Data

Take home message Share your data to increase impact of your research Data without metadata are useless Describe your data as richly as you would like to find data of others Use standardised descriptors Use standardised values for descriptors Keep data updated – archives present your data that give credits to you Keep data updated – archives present YOUR data and the data give credits to YOU!

Thank you