Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for Human Pathogen/Vector Genomic Sequences
Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC)
High Throughput Sequencing Enabling technology – Epidemiology of outbreaks – Pathogen evolution – Host range restriction – Genetic determinants of virulence and pathogenicity Metadata requirements – Temporal-spatial information about isolates – Selective pressures – Host species of specimen source – Disease severity and clinical manifestations
Metadata Submission Spreadsheets
Complex Query Interface
Metadata Inconsistencies Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis Required extensive custom bioinformatics system development
GSC-BRC Metadata Standards Working Group NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs Develop an approach for capturing standardized metadata for pathogen isolate sequencing projects Bottom up approach to capture data considered to be important by users Compatible with data standards and submission requirements
Metadata Standardization Process Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specific For each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc. Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI) Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/BioSamples Draft data submission spreadsheets Beta test version 1.0 standard with new GSCID white paper projects, collecting feedback Adopt version 1.1 metadata standard and data submission spreadsheets for all GSCID white paper and BRC-associated projects
Core Project Metadata Field ID Metadata Field Descriptor OBO Foundry ID BioProject/BioSampleMIxS CP1Project Title name CP2Project IDhttp://purl.obolibrary.org/obo/OBI_ CP3Project Descriptionhttp://purl.obolibrary.org/obo/OBI_ Description CP4Supporting Grants/Contract IDhttp://purl.obolibrary.org/obo/OBI_ Grant Agency CP5Publication Citationhttp://purl.obolibrary.org/obo/OBI_ PubMed IDref_biomaterial CP6 Sample Provider Principal Investigator (PI) Name CP7Sample Provider PI's Institution CP8Sample Provider PI's CP9Sequencing Facility CP10Sequencing Facility Contact Name CP11Sequencing Facility Contact's Institution CP12Sequencing Facility Contact's CP13Bioinformatics Resource Centerhttp://purl.obolibrary.org/obo/OBI_ CP14Bioinformatics Resource Center Contact Name CP15 Bioinformatics Resource Center Contact's Institution CP16Bioinformatics Resource Center Contact's CP17Target Material Material CP18Project Method Methodology CP19Project Objectives Objective CP20Sample Scope CP21Target Capture Capture
Core Sample Metadata Field ID Metadata Field Descriptor OBO Foundry ID NCBI BioSample MIxS CS1Specimen Source IDhttp://purl.obolibrary.org/obo/OBI_ host-subject-idhost_subject_id CS2Specimen Source Specieshttp://purl.obolibrary.org/obo/OBI_ specific_hosthost_taxid CS3Species Source Common Name host-common-namehost_common_name CS4Specimen Source Genderhttp://purl.obolibrary.org/obo/PATO_ host-sexsex CS5Specimen Source Age - Valuehttp://purl.obolibrary.org/obo/OBI_ host-ageage CS6Specimen Source Age - Unithttp://purl.obolibrary.org/obo/UO_ host-age CS7Specimen Source Health Statushttp://purl.obolibrary.org/obo/OGMS_ host-health-statedisease status CS8Specimen Collection Datehttp://purl.obolibrary.org/obo/OBI_ collection_datecollection date CS9Specimen Collection Location - Latitudehttp://purl.obolibrary.org/obo/OBI_ lat_longeographic location (lat and long) CS10Specimen Collection Location - Longitudehttp://purl.obolibrary.org/obo/OBI_ lat_longeographic location (lat and long) CS11Specimen Collection Location - Locationhttp://purl.obolibrary.org/obo/GAZ_ geo_loc_name CS12Specimen Collection Location - Countryhttp://purl.obolibrary.org/obo/OBI_ geo_loc_namegeographic location (country and/or sea) CS13Specimen IDhttp://purl.obolibrary.org/obo/OBI_ sample name CS14Specimen Typehttp://purl.obolibrary.org/obo/OBI_ host-tissue-sampledbody habitat, body site, body product CS15Suspected Organism(s) in Specimen - Specieshttp://purl.obolibrary.org/obo/OBI_ organism CS16 Suspected Organism(s) in Specimen - Subclass strainsubspecific genetic lineage CS17 Human Pathogenicity of Suspected Organism(s) in Specimen phenotype CS18Environmental Materialhttp://purl.obolibrary.org/obo/ENVO_ isolation-sourceenvironment (material) CS19Organism Detection Methodhttp://purl.obolibrary.org/obo/OBI_ sample collection device or method CS20Specimen Repository culture-collectionsource material identifiers CS21Specimen Repository Sample ID culture-collectionsource material identifiers CS22Sample ID - Sequencing Facility CS23Nucleic Acid Extraction Methodhttp://purl.obolibrary.org/obo/OBI_ samp_mat_processsample material processing CS24Nucleic Acid Preparation Method samp_mat_processsample material processing CS25Sequencing Methodhttp://purl.obolibrary.org/obo/OBI_ sequencing method CS26Assembly Algorithmhttp://purl.obolibrary.org/obo/OBI_ assembly CS27Depth of Coverage - Averagehttp://purl.obolibrary.org/obo/OBI_ finishing strategy CS28Annotation Algorithmhttp://purl.obolibrary.org/obo/OBI_ CS29GenBank Record IDhttp://purl.obolibrary.org/obo/OBI_ CS30Commentshttp://purl.obolibrary.org/obo/IAO_ host-description CS31Specimen Collector Name collected-by CS32Specimen Collector's Institution CS33Specimen Collector's CS34Sample Category attribute_package CS35Host Disease host-disease
Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Host Characterization has_input has_output
organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16
Core Project Semantics
Outcome of Metadata Standards WG Consistent metadata captured across GSCID Bottom up approach focuses standard on important features Support more standardized BRC interface development Harmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/BioSample Represented in the context of an extensible semantic framework
Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case- driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible – Sequencing => “omics” Utility of semantic representation
Acknowledgements Bruce Birren 2,b, Lauren Brinkac 1,a, Vincent Bruno 3,c, Elizabeth Caler 1,a, Ishwar Chandramouliswaran 1,a, Sinéad Chapman 2,b, Frank Collins 8,h, Christina Cuomo 2,b, Joana Carneiro Da Silva 3,c, Valentina Di Francesco 4, Vivien Dugan 1,a, Scott Emrich 8,h, Mark Eppinger 3,c, Michael Feldgarden 2,b, Claire Fraser 3,c, W. Florian Fricke 3,c, Maria Giovanni 4, Gloria Giraldo-Calderon 8,h, Omar S. Harb 5,g, Matt Henn 2,b, Erin Hine 3,c, Julie Dunning Hotopp 3,c, Jessica C. Kissinger 6,g, Eun Mi Lee 4, Punam Mathur 4, Garry Myers 3,c, Emmanuel Mongodin 3,c, Cheryl Murphy 2,b, Dan Neafsey 2,b, Karen Nelson 1,a, Ruchi Newman 2,b, William Nierman 1,a, Brett E. Pickett 1,d,e, Julia Puzak 4, David Rasko 3,c, David S. Roos 5,g, Lisa Sadzewica 3,c, Richard H. Scheuermann 1,d,e, Lynn M. Schriml 3,c, Bruno Sobral 7,f, Tim Stockwell 1,a, Chris Stoeckert 5,g, Dan Sullivan 7,f, Luke Tallon 3,c, Herve Tettelin 3,c, Doyle V. Ward 2,b, David Wentworth 1,a, Owen White 3,c, Rebecca Will 7,f, Jennifer Wortman 2,b, Alison Yao 4, Jie Zheng 5,g 1 J. Craig Venter Institute, Rockville, MD and San Diego, CA, 2 Broad Institute, Cambridge, MA, 3 Insitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 4 National Institute of Allergy and Infectious Diseases, Rockville, MD, 5 University of Pennsylvania, Philadelphia, PA, 6 University of Georgia, Athens, GA, 7 Cyberinfrastructure Division, Virginia Bioinformatics Institute, Blacksburg, VA, 8 University of Notre Dame, South Bend, IN, a J. Craig Venter Institute Genome Sequencing Center for Infectious Diseases, b Broad Institute Genome Sequencing Center for Infectious Diseases, c Institute for Genome Sciences Genome Sequencing Center for Infectious Diseases, d Influenza Research Database Bioinformatics Resource Center, e Virus Pathogen Resource Bioinformatics Resource Center, f PATRIC Bioinformatics Resource Center, g EuPathDB Bioinformatics Resource Center, h VectorBase Bioinformatics Resource Center Tanya Barrett – NCBI Pelin Yilmaz – Genome Standards Consortium N01AI /N01AI40041