Presentation is loading. Please wait.

Presentation is loading. Please wait.

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical.

Similar presentations


Presentation on theme: "Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical."— Presentation transcript:

1 Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center

2 NIAID Bioinformatics Resource Centers www.pathogenportal.net

3 Influenza Research Database www.fludb.org

4 NIAID Genome Sequencing Centers

5 Metadata Inconsistencies Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis

6 Dengue Clinical Metadata

7 Complex Query Interface

8 Additional Clinical Characteristics

9 GSC-BRC Metadata Standards Working Group NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs Develop metadata standards for pathogen isolate sequencing projects

10 GSC-BRC Metadata Working Groups

11 Metadata Standards Process Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific For each data field, provide: – definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories, – data providers Merge subgroup core elements into a common set of core metadata fields and attributes Assemble metadata fields into a semantic network Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples Establish policies and procedures for metadata submission workflows and GenBank linkage Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

12 Core Sample Metadata 30 Core Sample Metadata Fields

13 Core Project Metadata 16 Core Project Metadata Fields

14 Metadata Standards Process Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific For each data field, provide: – definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories, – data providers Merge subgroup core elements into a common set of core metadata fields and attributes Assemble metadata fields into a semantic network (Scheuermann) Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng) Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples Establish policies and procedures for metadata submission workflows and GenBank linkage Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

15 organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID v2 v5-6 v3-4 v7 v8 v15 v16 denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input Comments ???? v9 organism part hypothesis v17 is_about IRB/IACUC approval has_authorization v19v18 b18 b22 environment has_quality b23 b24 b28 b29 b25 b26 b27 b30

16 Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeIDqualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation

17 Core-Project

18 Core-Specimen

19 assay X sample material X person X equipment X lot # primary data assay protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Assay has_part located_in denotes run ID assay type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part analyte X has_part quality x has_quality input sample material X is_about

20 material transformation X sample material X person X equipment X lot # output material X material transformation protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Material Transformation has_part located_in denotes run ID material transformation type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part quality x has_quality quality x material type has_quality instance_of sample ID denotes

21 data transformation X input data output data material X algorithm has_specification has_output is_about software has_input located_in person X name data analyst role denotes run ID denotes Generic Data Transformation temporal-spatial region spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes data transformation type instance_of plays

22 Generic Material (IC) material X ID material type quality x has_quality material Y has_part material Z has_part quality y has_quality denotes instance_of temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes located_in

23 Conclusions Utility of semantic representation – Identified gaps in data field list (e.g. temporal components) – Identified gaps in ontology data standards (use case-driven standard development) – Identified commonalities in data structures (reusable) – Support for semantic queries and inferential analysis in future Two flavors of MIBBI – Distinguish between minimum information to reproduce an experiment and the minimum information to structure in a database for query and analysis OBI-based framework is re-usable – Sequencing => “omics” Practical issues about implementation strategies – Challenge of using ontologies for preferred value sets Can be large May not directly match common language


Download ppt "Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical."

Similar presentations


Ads by Google