Download presentation
Presentation is loading. Please wait.
1
Why to submit your data and metadata?
Petra ten Hoopen, PhD European Nucleotide Archive
2
Why to submit your data and metadata to public data archives?
Molecular data archives Metadata importance Reporting of environmental data and metadata Support for data reporting
3
European Nucleotide Archive
1. Molecular data archives European Nucleotide Archive Permanent and comprehensive repository for public domain nucleotide sequences and associated information Archiving Helpdesk Training Standards development Technology development Community building
4
ENA 1. Molecular data archives ENA data flow raw reads assemblies
Direct presentation Browser/Search/Download ENA data flow ENA raw reads assemblies annotation Large-scale Programmatic services ENA-captured data validation processing archiving Brokered data Small-scale WEBIN submission tool INSDC-exchanged data Infrastructure services Domain-specific databases
5
1. Molecular data archives
ENA data model Data = reads and nucleotide sequence assemblies Metadata = information associated with sequences, includes provenance of biological sample (sample), sequencing experiment (experiment) and its scope (study), analysis and annotation of sequences (analysis), and files of raw data (run) Study Experiment Analysis Sample Run Data
6
ENA data standardization
1. Molecular data archives ENA data standardization Standardized reporting requirements for all metadata and data objects Study Experiment Analysis Sample Run Data agreed by: INSDC Consortia of scientific domain-specific experts (e.g. GSC, MicroB3, RNACentral, GMI) implemented with: community-agreed checklists and control vocabularies, data-type-specific file formats
7
Metadata on sampling and experimentation are essential:
2. Metadata importance metadata are data about data Metadata on sampling and experimentation are essential: what where when who how Data without metadata are useless!
8
2. Metadata importance - data discovery
Umbrella project with component sequencing projects
9
2. Metadata importance – data discovery
free text search, sequence similarity search, programmatic access, data download advanced search
10
2. Metadata importance - data discovery
mpling_platform=%22SV%20Tara%22%22&domain=sample
11
2. Metadata importance – data comparison
If metadata is adequately described, using a standardised vocabulary, comparing sequencing projects becomes possible Show the microbial species found in the North Pacific … at depths of 50 – 100 m … in samples taken May-June … compared to the Indian Ocean, under the same conditions
12
Unified description of marine datasets
2. Metadata importance - data comparison Unified description of marine datasets Tara Oceans consortium 2009/2012 Tara Oceans, a global view 2013 Tara Oceans Polar Circle Sampling route of the Tara Oceans expedition (Scientific Data 2, Article number: (2015)
13
Unified description of marine datasets
2. Metadata importance - data comparison Unified description of marine datasets OSD consortium 21st June 2014 Ocean Sampling Day 21st June 2015 Ocean Sampling Day Map of OSD Sites (
14
2. Metadata importance - data analysis
OSD Analysis Consortium (unpublished) Selected InterPro families are enriched (or otherwise) corresponding to various environmental conditions. For instance the IPR001661is very much underrepresented at certain PAR value and IPR is overrepresented at certain latitude. Heatmap of significant Spearman correlations between protein families and environmental conditions across 150 sites
15
2. Metadata importance - data analysis
If metadata is adequately described, using a standardised vocabulary, comparing sequencing projects becomes possible
16
3. Reporting environmental data and metadata
INSDC-agreed what should be reported and how Genome Assembly (GA) – layers of reads, contigs, scaffolds, chromosome Read Data (RD) – sequence reads, associated sequencing study/experiment/analysis Feature Table (FT) – provenance and functional annotation of assembled sequences Third Party Data (TPA) – (re)assembly/annotation of existing records by third party
17
3. Reporting environmental data and metadata
community-developed what should be reported and how MIGS, MIMS, MIMARKS (MIxS) – Minimum Information about any (x) Sequence M2B3 – Minimum Information about marine microbial sample GMI:MDM – Minimal Data for Mapping in relation to the Global Microbial Identifier pathogen tracking initiative MINSEQE – Minimum Information about a high-throughput Nucleotide SeQuencing Experiment BARCODE – biodiversity standard formulated by the Consortium for the Barcode of Life, CBOL currently approves as effective barcodes loci: cox1 (for animals), matK +rbcL (for plants) and ITS (for fungi) GMI – standard for sequence data of genome-scale pathogen surveys in clinical, animal and environmental samples MINSEQE – for description of high-throughput sequencing (HTS) experiments for unambiquous interpretation of HTS experiments
18
MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence
3. Reporting environmental data and metadata MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence Yilmaz et al., 2011
19
MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence
3. Reporting environmental data and metadata MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence Yilmaz et al., 2011
20
MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence
3. Reporting environmental data and metadata MIxS (MIGS, MIMS, MIMARKS) Minimum Information about any (x) Sequence Example: MIMS-compliant water sample: Submitted to INSDC Project name Investigation type Sequencing method Geographic location (latitude and longitude) Geographic location (country and/or sea region) Collection date Environment (biome) Environment (feature) Environment (material) Environmental package-water Depth
21
MIxS CDI OBIS 3. Reporting environmental data and metadata M2B3
Minimum information about marine microbial sample MIxS CDI OBIS Multidisciplinary standard, new insight into marine ecosystem functioning and its biotechnological potential by relating marine microbial diversity with its environmental oceanographic context. ten Hoopen et al. 2015
22
3. Reporting environmental data and metadata
M2B3 Reporting Standard minimum information about marine microbial sampling
23
M2B3 MIxS 3. Reporting environmental data and metadata
Example: M2B3-compliant marine sample: investigation campaign investigation site investigation platform protocol label sample title geographic location (latitude and longitude) collection date depth biome feature material temperature salinity scientific name taxon ID parameter ID Example: MIMS-compliant water sample: submitted to INSDC project name investigation type sequencing method geographic location (latitude and longitude) geographic location (country and/or sea region) collection date environment (biome) environment (feature) environment (material) environmental package-water depth
24
3. Reporting environmental data and metadata
GMI Minimum information about pathogen sample collector name collecting institution collection date / receipt date geographic location (latitude and longitude) geographic location (country and/or sea region) sample capture status (e.g. surveillance, farm) isolate pathogen testing no yes pathogen association host name host subject ID host health state host disease isolation source host-associated host sex/age/behaviour/habitat host inoculation isolation source non-host-associated GMI for global identification of pathogens to detect outbreaks and new pathogens. environmental sample no yes strain serovar / serotype human surveillance data (antiviral treatment, vacciantion details, illness details)
25
3. Reporting environmental data and metadata
How many ways can you say ‘male’ 37 year old male initial phase male male fetus six males mixed 600 yr. old male m male plant stallion adult male make male, 8 weeks old steer bull makle male, castrated sterile male castrated male mal e male, pooled strictly male cm male males tetraploide male dioecious male male (7-2872) man type i males diploid male male (7-3074) men type ii males drone male (m-a) normale male virgin male engorged male male (m-o) ram winged and wingless males fertile male male caucasian rooster young male four males mixed male child s1 male sterile individual male male fertile sex: male male (note: this sample was originally provided as a \female\ sample to us and therefore labeled this way in the brawand et al. paper and original geo submission; however, detailed data analyses carried out in the meantime clearly show that this sample stems from a male individual)
26
3. Reporting environmental data and metadata
ENVO Environment (biome) = broad ecological context e.g. oceanic pelagic zone biome(ENVO: ) Environment (feature) = local environment determined by e.g. marine surface current(ENVO: ) Environment (material) = surrounded by e.g. saline water(ENVO: )
27
3. Reporting environmental data and metadata
Null value reporting language not collected – information has not been collected not provided - not given, a value may be given at the later stage restricted access - information exists but can not be released openly not applicable - information is inappropriate to report We strongly encourage to provide true values whenever possible!
28
4. Support for data reporting
Sequencing study description in WEBIN submission tool what, who Study Experiment Analysis Sample Run Data
29
4. Support for data reporting
ENA Checklists for sample provenance description in WEBIN submission tool where, when, how MIxS microbial mat biofilm MIxS host associated MIxS human associated MIxS human gut MIxS human oral MIxS human skin MIxS human vaginal MIxS plant associated MIxS miscellaneous env. MIxS sediment MIxS soil MIxS wastewater sludge MIxS water MIxS built env. ENA default sample ENA prokaryotic pathogen GMI:MDM:1.1 pathogen ENA virus pathogen ENA Influenza ENA M2B3 sample ENA Tara sample ENA plant sample Study Experiment Analysis Sample Run Data
30
4. Support for data reporting
ENA Checklists for sample provenance description in WEBIN submission tool Select checklist Select fields Fields values
31
4. Support for data reporting
Controlled vocabularies for sequencing experiment description in WEBIN submission tool how Study Experiment Analysis Sample Run Data
32
4. Support for data reporting
Checklists for functional annotation in WEBIN submission tool Select fields Fields values Study Experiment Analysis Sample Run Data
33
Take home message Share your data to increase impact of your research
Data without metadata are useless Describe your data as richly as you would like to find data of others Use standardised descriptors Use standardised values for descriptors Keep data updated – archives present your data that give credits to you Keep data updated – archives present YOUR data and the data give credits to YOU!
34
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.