Download presentation
Presentation is loading. Please wait.
Published byTylor Ground Modified over 9 years ago
1
Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for Human Pathogen/Vector Genomic Sequences
2
Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC) www.viprbrc.orgwww.fludb.org
3
High Throughput Sequencing Enabling technology – Epidemiology of outbreaks – Pathogen evolution – Host range restriction – Genetic determinants of virulence and pathogenicity Metadata requirements – Temporal-spatial information about isolates – Selective pressures – Host species of specimen source – Disease severity and clinical manifestations
4
Metadata Submission Spreadsheets 1111 2 2 3 3 4 44
5
Complex Query Interface
6
Metadata Inconsistencies Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis Required extensive custom bioinformatics system development
7
GSC-BRC Metadata Standards Working Group NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs Develop an approach for capturing standardized metadata for pathogen isolate sequencing projects Bottom up approach to capture data considered to be important by users Compatible with data standards and submission requirements
8
Metadata Standardization Process Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specific For each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc. Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI) Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/BioSamples Draft data submission spreadsheets Beta test version 1.0 standard with new GSCID white paper projects, collecting feedback Adopt version 1.1 metadata standard and data submission spreadsheets for all GSCID white paper and BRC-associated projects
9
Core Project Metadata Field ID Metadata Field Descriptor OBO Foundry ID BioProject/BioSampleMIxS CP1Project Title http://purl.obolibrary.org/obo/OBI_0001622Titleproject name CP2Project IDhttp://purl.obolibrary.org/obo/OBI_0001628 CP3Project Descriptionhttp://purl.obolibrary.org/obo/OBI_0001615Description CP4Supporting Grants/Contract IDhttp://purl.obolibrary.org/obo/OBI_0001629Grant Agency CP5Publication Citationhttp://purl.obolibrary.org/obo/OBI_0001617PubMed IDref_biomaterial CP6 Sample Provider Principal Investigator (PI) Name CP7Sample Provider PI's Institution CP8Sample Provider PI's email CP9Sequencing Facility CP10Sequencing Facility Contact Name CP11Sequencing Facility Contact's Institution CP12Sequencing Facility Contact's email CP13Bioinformatics Resource Centerhttp://purl.obolibrary.org/obo/OBI_0001626 CP14Bioinformatics Resource Center Contact Name CP15 Bioinformatics Resource Center Contact's Institution CP16Bioinformatics Resource Center Contact's email CP17Target Material Material CP18Project Method Methodology CP19Project Objectives Objective CP20Sample Scope CP21Target Capture Capture
10
Core Sample Metadata Field ID Metadata Field Descriptor OBO Foundry ID NCBI BioSample MIxS CS1Specimen Source IDhttp://purl.obolibrary.org/obo/OBI_0001141host-subject-idhost_subject_id CS2Specimen Source Specieshttp://purl.obolibrary.org/obo/OBI_0100026specific_hosthost_taxid CS3Species Source Common Name host-common-namehost_common_name CS4Specimen Source Genderhttp://purl.obolibrary.org/obo/PATO_0000047host-sexsex CS5Specimen Source Age - Valuehttp://purl.obolibrary.org/obo/OBI_0001167host-ageage CS6Specimen Source Age - Unithttp://purl.obolibrary.org/obo/UO_0000003host-age CS7Specimen Source Health Statushttp://purl.obolibrary.org/obo/OGMS_0000022host-health-statedisease status CS8Specimen Collection Datehttp://purl.obolibrary.org/obo/OBI_0001619collection_datecollection date CS9Specimen Collection Location - Latitudehttp://purl.obolibrary.org/obo/OBI_0001620lat_longeographic location (lat and long) CS10Specimen Collection Location - Longitudehttp://purl.obolibrary.org/obo/OBI_0001621lat_longeographic location (lat and long) CS11Specimen Collection Location - Locationhttp://purl.obolibrary.org/obo/GAZ_00000448geo_loc_name CS12Specimen Collection Location - Countryhttp://purl.obolibrary.org/obo/OBI_0001627geo_loc_namegeographic location (country and/or sea) CS13Specimen IDhttp://purl.obolibrary.org/obo/OBI_0001616sample name CS14Specimen Typehttp://purl.obolibrary.org/obo/OBI_0001479host-tissue-sampledbody habitat, body site, body product CS15Suspected Organism(s) in Specimen - Specieshttp://purl.obolibrary.org/obo/OBI_0000925organism CS16 Suspected Organism(s) in Specimen - Subclass strainsubspecific genetic lineage CS17 Human Pathogenicity of Suspected Organism(s) in Specimen http://purl.obolibrary.org/obo/OBI_0000925 phenotype CS18Environmental Materialhttp://purl.obolibrary.org/obo/ENVO_00010483isolation-sourceenvironment (material) CS19Organism Detection Methodhttp://purl.obolibrary.org/obo/OBI_0001624 sample collection device or method CS20Specimen Repository culture-collectionsource material identifiers CS21Specimen Repository Sample ID culture-collectionsource material identifiers CS22Sample ID - Sequencing Facility CS23Nucleic Acid Extraction Methodhttp://purl.obolibrary.org/obo/OBI_0666667samp_mat_processsample material processing CS24Nucleic Acid Preparation Method samp_mat_processsample material processing CS25Sequencing Methodhttp://purl.obolibrary.org/obo/OBI_0600047 sequencing method CS26Assembly Algorithmhttp://purl.obolibrary.org/obo/OBI_0001522 assembly CS27Depth of Coverage - Averagehttp://purl.obolibrary.org/obo/OBI_0001618 finishing strategy CS28Annotation Algorithmhttp://purl.obolibrary.org/obo/OBI_0001625 CS29GenBank Record IDhttp://purl.obolibrary.org/obo/OBI_0001614 CS30Commentshttp://purl.obolibrary.org/obo/IAO_0000300host-description CS31Specimen Collector Name collected-by CS32Specimen Collector's Institution CS33Specimen Collector's email CS34Sample Category attribute_package CS35Host Disease host-disease
11
Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Host Characterization has_input has_output
12
organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16
13
Core Project Semantics
14
Outcome of Metadata Standards WG Consistent metadata captured across GSCID Bottom up approach focuses standard on important features Support more standardized BRC interface development Harmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/BioSample Represented in the context of an extensible semantic framework
15
Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case- driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible – Sequencing => “omics” Utility of semantic representation
16
Acknowledgements Bruce Birren 2,b, Lauren Brinkac 1,a, Vincent Bruno 3,c, Elizabeth Caler 1,a, Ishwar Chandramouliswaran 1,a, Sinéad Chapman 2,b, Frank Collins 8,h, Christina Cuomo 2,b, Joana Carneiro Da Silva 3,c, Valentina Di Francesco 4, Vivien Dugan 1,a, Scott Emrich 8,h, Mark Eppinger 3,c, Michael Feldgarden 2,b, Claire Fraser 3,c, W. Florian Fricke 3,c, Maria Giovanni 4, Gloria Giraldo-Calderon 8,h, Omar S. Harb 5,g, Matt Henn 2,b, Erin Hine 3,c, Julie Dunning Hotopp 3,c, Jessica C. Kissinger 6,g, Eun Mi Lee 4, Punam Mathur 4, Garry Myers 3,c, Emmanuel Mongodin 3,c, Cheryl Murphy 2,b, Dan Neafsey 2,b, Karen Nelson 1,a, Ruchi Newman 2,b, William Nierman 1,a, Brett E. Pickett 1,d,e, Julia Puzak 4, David Rasko 3,c, David S. Roos 5,g, Lisa Sadzewica 3,c, Richard H. Scheuermann 1,d,e, Lynn M. Schriml 3,c, Bruno Sobral 7,f, Tim Stockwell 1,a, Chris Stoeckert 5,g, Dan Sullivan 7,f, Luke Tallon 3,c, Herve Tettelin 3,c, Doyle V. Ward 2,b, David Wentworth 1,a, Owen White 3,c, Rebecca Will 7,f, Jennifer Wortman 2,b, Alison Yao 4, Jie Zheng 5,g 1 J. Craig Venter Institute, Rockville, MD and San Diego, CA, 2 Broad Institute, Cambridge, MA, 3 Insitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 4 National Institute of Allergy and Infectious Diseases, Rockville, MD, 5 University of Pennsylvania, Philadelphia, PA, 6 University of Georgia, Athens, GA, 7 Cyberinfrastructure Division, Virginia Bioinformatics Institute, Blacksburg, VA, 8 University of Notre Dame, South Bend, IN, a J. Craig Venter Institute Genome Sequencing Center for Infectious Diseases, b Broad Institute Genome Sequencing Center for Infectious Diseases, c Institute for Genome Sciences Genome Sequencing Center for Infectious Diseases, d Influenza Research Database Bioinformatics Resource Center, e Virus Pathogen Resource Bioinformatics Resource Center, f PATRIC Bioinformatics Resource Center, g EuPathDB Bioinformatics Resource Center, h VectorBase Bioinformatics Resource Center Tanya Barrett – NCBI Pelin Yilmaz – Genome Standards Consortium N01AI2008038 /N01AI40041
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.