Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for.

Slides:



Advertisements
Similar presentations
The Process of Data Ingestion in ÆKOS Andrew Graham and Matt Schneider TERN Ecoinformatics Data Analysts Logos used with consent. Content of this presentation.
Advertisements

Representing the Immune Epitope Database in OWL Jason A. Greenbaum 1, Randi Vita 1, Laura Zarebski 1, Hussein Emami 2, Alessandro Sette 1, Alan Ruttenberg.
HLA Genetics Consortium Meeting, December 14-15, 2010.
CHOICE Pathology Informatics 2010 Boston, Massachusetts DataReady ® : A Deployable Data Management and Integration System for Large-scale Cancer Repositories.
GSC-BRC Metadata Standards Richard H. Scheuermann U.T. Southwestern Medical Center.
Centers of Excellence for Influenza Research and Surveillance 6 th Annual Meeting Aug 1, 2012 Status of IRD Development.
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical.
Systems Biology Data Dissemination Working Group 25FEB2015.
Introduction to Bioinformatics Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
Host cell responses to viral infection can be monitored by a variety of different high throughput experimental methodologies in order to understand the.
ODM2: Developing a Community Information Model and Supporting Software to Extend Interoperability of Sensor and Sample Based Earth Observations Jeffery.
EuPathDB –Eukaryotic Pathogen Database Resources Chris Stoeckert, Ph.D. Dept of Genetics and Penn Center for Bioinformatics, University of Pennsylvania.
BTRIS: The NIH Biomedical Translational Research Information System James J. Cimino Chief, Laboratory for Informatics Development NIH Clinical Center.
How to Organize the World of Ontologies Barry Smith 1.
1 FACS Data Management Workshop The Immunology Database and Analysis Portal (ImmPort) Perspective Bioinformatics Integration Support Contract (BISC) N01AI40076.
Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center Standardizing Metadata Associated.
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard.
Databases and tools to study the genomes of hundreds of pathogens, plants, and mammals Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter.
Influenza Research Database (IRD): A Web-based Resource for Influenza Virus Data and Analysis Victoria Hunt 1 *, R. Burke Squires 1, Jyothi Noronha 1,
Limning the CTS Ontology Landscape Barry Smith 1.
Biorepository Software Selection University of Michigan 31-Aug-2012 Frank Manion, Chief Information Officer Paul McGhee, Lead Business Analyst Cancer Center.
Ontology Development and Usage for Protozoan Parasite Research John A. Miller and Alok Dhamanaskar Collaborators: Michael E. Cotterell, Chaitanya Guttula,
Computational Biology and Informatics Laboratory Development of an Application Ontology for Beta Cell Genomics Based On the Ontology for Biomedical Investigations.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Abstract Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult.
PCBC Bioinformatics Core & Committee PCBC Steering Committee Call Nathan Salomonis Cincinnati Children’s Larsson Omberg, Sage Bionetworks Nathan Salomonis.
The MGED Society Facilitating Data Sharing and Integration with Standards CTSA Omics Data Standards Working Group Chris Stoeckert Dept. of Genetics and.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
1 Enhancing Organism Based Disease Knowledge Using Biological Taxonomy, and Environmental Ontologies Ken Baclawski Northeastern University Neil Sarkar.
OBI – Communities and Structure 1. Coordination Committee (CC): Representatives of the communities -> Monthly conferences 2. Developers WG: CC and other.
Michael F. Huerta, Ph.D. Associate Director for Program Development National Library of Medicine, NIH BD2K CDE Webinar – September 8, 2015 Common Data.
Molecular marker data and their impact on gene bank management Chris Richards NCGRP, Fort Collins, CO Curator Workshop, Atlanta Georgia.
Page 1 Informatics Pilot Project EDRN Knowledge System Working Group San Antonio, Texas January 21, 2001 Steve Hughes Thuy Tran Dan Crichton Jet Propulsion.
Human Microbiome Conference
Statistical Tool for Identifying Sequence Variations That Correlate with Virus Phenotypic Characteristics in the Virus Pathogen Resource (ViPR) July 22,
Background: Clinical and Translational Research Centers promote scientific collaborations. The Puerto Rico Clinical and Translational Research Consortium.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
DAN LAWSON BRC 2011 – ANNUAL MEETING UT SOUTHWESTERN MEDICAL CENTER DALLAS, TX SEPTEMBER 2011 Challenges and opportunities of new sequencing technologies.
Building Ontologies with Basic Formal Ontology Barry Smith May 27, 2015.
Yun Zhang J. Craig Venter Institute San Diego, CA, USA August 4, 2012 Integrated Bioinformatics Data and Analysis Tools for Herpesviridae.
Leveraging Ontologies for Human Immunology Research Barry Smith, Alexander Diehl, Anna- Maria Masci Presented at Leveraging Standards and Ontologies to.
Statistical Tool for Identifying Sequence Variations that Correlate with Virus Phenotypic Characteristics in the Virus Pathogen Resource (ViPR) Brett E.
Richard H. Scheuermann, Ph.D. November 5, 2012 Support for Systems Biology Data in IRD/ViPR - Proteomics.
BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
Influenza Infectious Disease Ontology (Influenza-IDO) Status August 2010.
Vectorbase and Galaxy Jarek Nabrzyski On behalf of VectorBase Center for Research Computing University of Notre Dame VectorBase Bioinformatics Resource.
ADVANCED DB SYSTEMS BIOMEDICAL ENGINEERING. Index INTRODUCTION  BIOMEDICAL ENGINEERING  B.E. DATASETS APPLICATIONS  DATA MINING ON FDA DATABASE  ONTOLOGY-BASED.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Data Integration and Management A PDB Perspective.
Integration of Host Factor Data into the Virus Pathogen Database and Analysis Resource (ViPR) and the Influenza Research Database (IRD) Brett E. Pickett.
SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata.
BRC 2011 Session #4 – “Omics” Data. Session #4 - Outline Challenges and Opportunities  pathogen datasets; host datasets; integrating pathogen-host datasets.
Valentina Di Francesco Senior Program Officer for Bioinformatics, Structural Genomics and Systems Biology Microbial Genomics.
Richard H. Scheuermann, Ph.D. November 5, 2012 Support for Systems Biology Data in IRD/ViPR.
Sharing the knowledge of electrophysiology data Phillip Lord, Frank Gibson and the CARMEN Consortium.
Introduction to Biomedical Ontology for Imaging Informatics Barry Smith, PhD, FACMI University at Buffalo May 11, 2015.
Influenza Ontology Infectious Disease Ontology Workshop 2008 Burke Squires.
Habitat-Lite & EnvO Jin Mao Postdoc, School of Information, University of Arizona Nov. 20, 2015.
TRANSITION FROM BFO 1.1 TO BFO 2.0 (OWL FORMAT) Jie Zheng Department of Genetics University of Pennsylvania May 13 th, 2013.
Immunology Ontology Rho Meeting October 10, 2013.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
DoD Global Influenza Surveillance Program Sentinel Site Surveillance at AFIOH Sequence Analysis And Vaccine Effectiveness Overview Luke T. Daum, PhD, Molecular.
Randi Vita, M.D. Better living through ontologies at the Immune Epitope Database La Jolla Institute for Allergy & Immunology Division of Vaccine Discovery.
Functional Annotation of the Horse Genome
Ontology of biomedical investigations (OBI)
MARINE STRATEGY FRAMEWORK DIRECTIVE (MSFD) COMMON IMPLEMENTATION STRATEGY Capturing metadata: Implementation of MSFD art – via a metadata catalogue.
Introduction to the MIABIS SOP Working Group
Bird of Feather Session
Presentation transcript:

Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for Human Pathogen/Vector Genomic Sequences

Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC)

High Throughput Sequencing Enabling technology – Epidemiology of outbreaks – Pathogen evolution – Host range restriction – Genetic determinants of virulence and pathogenicity Metadata requirements – Temporal-spatial information about isolates – Selective pressures – Host species of specimen source – Disease severity and clinical manifestations

Metadata Submission Spreadsheets

Complex Query Interface

Metadata Inconsistencies Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis Required extensive custom bioinformatics system development

GSC-BRC Metadata Standards Working Group NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs Develop an approach for capturing standardized metadata for pathogen isolate sequencing projects Bottom up approach to capture data considered to be important by users Compatible with data standards and submission requirements

Metadata Standardization Process Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specific For each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc. Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI) Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/BioSamples Draft data submission spreadsheets Beta test version 1.0 standard with new GSCID white paper projects, collecting feedback Adopt version 1.1 metadata standard and data submission spreadsheets for all GSCID white paper and BRC-associated projects

Core Project Metadata Field ID Metadata Field Descriptor OBO Foundry ID BioProject/BioSampleMIxS CP1Project Title name CP2Project IDhttp://purl.obolibrary.org/obo/OBI_ CP3Project Descriptionhttp://purl.obolibrary.org/obo/OBI_ Description CP4Supporting Grants/Contract IDhttp://purl.obolibrary.org/obo/OBI_ Grant Agency CP5Publication Citationhttp://purl.obolibrary.org/obo/OBI_ PubMed IDref_biomaterial CP6 Sample Provider Principal Investigator (PI) Name CP7Sample Provider PI's Institution CP8Sample Provider PI's CP9Sequencing Facility CP10Sequencing Facility Contact Name CP11Sequencing Facility Contact's Institution CP12Sequencing Facility Contact's CP13Bioinformatics Resource Centerhttp://purl.obolibrary.org/obo/OBI_ CP14Bioinformatics Resource Center Contact Name CP15 Bioinformatics Resource Center Contact's Institution CP16Bioinformatics Resource Center Contact's CP17Target Material Material CP18Project Method Methodology CP19Project Objectives Objective CP20Sample Scope CP21Target Capture Capture

Core Sample Metadata Field ID Metadata Field Descriptor OBO Foundry ID NCBI BioSample MIxS CS1Specimen Source IDhttp://purl.obolibrary.org/obo/OBI_ host-subject-idhost_subject_id CS2Specimen Source Specieshttp://purl.obolibrary.org/obo/OBI_ specific_hosthost_taxid CS3Species Source Common Name host-common-namehost_common_name CS4Specimen Source Genderhttp://purl.obolibrary.org/obo/PATO_ host-sexsex CS5Specimen Source Age - Valuehttp://purl.obolibrary.org/obo/OBI_ host-ageage CS6Specimen Source Age - Unithttp://purl.obolibrary.org/obo/UO_ host-age CS7Specimen Source Health Statushttp://purl.obolibrary.org/obo/OGMS_ host-health-statedisease status CS8Specimen Collection Datehttp://purl.obolibrary.org/obo/OBI_ collection_datecollection date CS9Specimen Collection Location - Latitudehttp://purl.obolibrary.org/obo/OBI_ lat_longeographic location (lat and long) CS10Specimen Collection Location - Longitudehttp://purl.obolibrary.org/obo/OBI_ lat_longeographic location (lat and long) CS11Specimen Collection Location - Locationhttp://purl.obolibrary.org/obo/GAZ_ geo_loc_name CS12Specimen Collection Location - Countryhttp://purl.obolibrary.org/obo/OBI_ geo_loc_namegeographic location (country and/or sea) CS13Specimen IDhttp://purl.obolibrary.org/obo/OBI_ sample name CS14Specimen Typehttp://purl.obolibrary.org/obo/OBI_ host-tissue-sampledbody habitat, body site, body product CS15Suspected Organism(s) in Specimen - Specieshttp://purl.obolibrary.org/obo/OBI_ organism CS16 Suspected Organism(s) in Specimen - Subclass strainsubspecific genetic lineage CS17 Human Pathogenicity of Suspected Organism(s) in Specimen phenotype CS18Environmental Materialhttp://purl.obolibrary.org/obo/ENVO_ isolation-sourceenvironment (material) CS19Organism Detection Methodhttp://purl.obolibrary.org/obo/OBI_ sample collection device or method CS20Specimen Repository culture-collectionsource material identifiers CS21Specimen Repository Sample ID culture-collectionsource material identifiers CS22Sample ID - Sequencing Facility CS23Nucleic Acid Extraction Methodhttp://purl.obolibrary.org/obo/OBI_ samp_mat_processsample material processing CS24Nucleic Acid Preparation Method samp_mat_processsample material processing CS25Sequencing Methodhttp://purl.obolibrary.org/obo/OBI_ sequencing method CS26Assembly Algorithmhttp://purl.obolibrary.org/obo/OBI_ assembly CS27Depth of Coverage - Averagehttp://purl.obolibrary.org/obo/OBI_ finishing strategy CS28Annotation Algorithmhttp://purl.obolibrary.org/obo/OBI_ CS29GenBank Record IDhttp://purl.obolibrary.org/obo/OBI_ CS30Commentshttp://purl.obolibrary.org/obo/IAO_ host-description CS31Specimen Collector Name collected-by CS32Specimen Collector's Institution CS33Specimen Collector's CS34Sample Category attribute_package CS35Host Disease host-disease

Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Host Characterization has_input has_output

organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16

Core Project Semantics

Outcome of Metadata Standards WG Consistent metadata captured across GSCID Bottom up approach focuses standard on important features Support more standardized BRC interface development Harmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/BioSample Represented in the context of an extensible semantic framework

Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case- driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible – Sequencing => “omics” Utility of semantic representation

Acknowledgements Bruce Birren 2,b, Lauren Brinkac 1,a, Vincent Bruno 3,c, Elizabeth Caler 1,a, Ishwar Chandramouliswaran 1,a, Sinéad Chapman 2,b, Frank Collins 8,h, Christina Cuomo 2,b, Joana Carneiro Da Silva 3,c, Valentina Di Francesco 4, Vivien Dugan 1,a, Scott Emrich 8,h, Mark Eppinger 3,c, Michael Feldgarden 2,b, Claire Fraser 3,c, W. Florian Fricke 3,c, Maria Giovanni 4, Gloria Giraldo-Calderon 8,h, Omar S. Harb 5,g, Matt Henn 2,b, Erin Hine 3,c, Julie Dunning Hotopp 3,c, Jessica C. Kissinger 6,g, Eun Mi Lee 4, Punam Mathur 4, Garry Myers 3,c, Emmanuel Mongodin 3,c, Cheryl Murphy 2,b, Dan Neafsey 2,b, Karen Nelson 1,a, Ruchi Newman 2,b, William Nierman 1,a, Brett E. Pickett 1,d,e, Julia Puzak 4, David Rasko 3,c, David S. Roos 5,g, Lisa Sadzewica 3,c, Richard H. Scheuermann 1,d,e, Lynn M. Schriml 3,c, Bruno Sobral 7,f, Tim Stockwell 1,a, Chris Stoeckert 5,g, Dan Sullivan 7,f, Luke Tallon 3,c, Herve Tettelin 3,c, Doyle V. Ward 2,b, David Wentworth 1,a, Owen White 3,c, Rebecca Will 7,f, Jennifer Wortman 2,b, Alison Yao 4, Jie Zheng 5,g 1 J. Craig Venter Institute, Rockville, MD and San Diego, CA, 2 Broad Institute, Cambridge, MA, 3 Insitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 4 National Institute of Allergy and Infectious Diseases, Rockville, MD, 5 University of Pennsylvania, Philadelphia, PA, 6 University of Georgia, Athens, GA, 7 Cyberinfrastructure Division, Virginia Bioinformatics Institute, Blacksburg, VA, 8 University of Notre Dame, South Bend, IN, a J. Craig Venter Institute Genome Sequencing Center for Infectious Diseases, b Broad Institute Genome Sequencing Center for Infectious Diseases, c Institute for Genome Sciences Genome Sequencing Center for Infectious Diseases, d Influenza Research Database Bioinformatics Resource Center, e Virus Pathogen Resource Bioinformatics Resource Center, f PATRIC Bioinformatics Resource Center, g EuPathDB Bioinformatics Resource Center, h VectorBase Bioinformatics Resource Center Tanya Barrett – NCBI Pelin Yilmaz – Genome Standards Consortium N01AI /N01AI40041