Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical.

Slides:



Advertisements
Similar presentations
The Process of Data Ingestion in ÆKOS Andrew Graham and Matt Schneider TERN Ecoinformatics Data Analysts Logos used with consent. Content of this presentation.
Advertisements

CVRG Presenter Disclosure Information Tahsin Kurc, PhD Center for Comprehensive Informatics Emory University CardioVascular Research Grid Core Infrastructure.
Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for.
GSC-BRC Metadata Standards Richard H. Scheuermann U.T. Southwestern Medical Center.
Overview of key concepts and features
Systems Biology Data Dissemination Working Group 25FEB2015.
Introduction to Bioinformatics Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
Host cell responses to viral infection can be monitored by a variety of different high throughput experimental methodologies in order to understand the.
VectorBase Invertebrate Vectors of Human Pathogens.
Overview of Biomedical Informatics Rakesh Nagarajan.
The MetaDater Model and the formation of a GRID for the support of social research John Kallas Greek Social Data Bank National Center for Social Research.
BTRIS: The NIH Biomedical Translational Research Information System James J. Cimino Chief, Laboratory for Informatics Development NIH Clinical Center.
How to Organize the World of Ontologies Barry Smith 1.
Center for Environmental Studies Arizona State University Digital Research Records at Center for Environmental Studies Peter McCartney.
BTRIS: The NIH Biomedical Translational Research Information System James J. Cimino Chief, Laboratory for Informatics Development NIH Clinical Center.
1 FACS Data Management Workshop The Immunology Database and Analysis Portal (ImmPort) Perspective Bioinformatics Integration Support Contract (BISC) N01AI40076.
Prepared by Abzamiyeva Laura Candidate of the department of KKGU named after Al-Farabi Kizilorda, Kazakstan 2012.
Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center Standardizing Metadata Associated.
Clinical Trials, TCGA: Deep Integrative Research RT, Imaging, Pathology, “omics” Joel Saltz MD, PhD Director Center for Comprehensive Informatics.
Unification of CytometryML, DICOM and Flow Cytometry Standard Robert C. Leif *a and Stephanie H. Leif a a XML_Med, a Division of Newport Instruments, 5648.
CaGrid, Fog and Clouds Joel Saltz MD, PhD Director Center for Comprehensive Informatics.
Lecture 4 Data. Why GIS? Ask questions Solve a problem Support a decision Make Maps Involve others, share data, procedures, ideas.
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard.
Databases and tools to study the genomes of hundreds of pathogens, plants, and mammals Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter.
Influenza Research Database (IRD): A Web-based Resource for Influenza Virus Data and Analysis Victoria Hunt 1 *, R. Burke Squires 1, Jyothi Noronha 1,
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Limning the CTS Ontology Landscape Barry Smith 1.
Business Computing 550 Lesson 1. Fundamentals of Information Systems, Fifth Edition An Introduction to Information Systems in Organizations.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Bioinformatics and medicine: Are we meeting the challenge?
OBI – Communities and Structure 1. Coordination Committee (CC): Representatives of the communities -> Monthly conferences 2. Developers WG: CC and other.
Ontologies for Web Service Annotations OBI & EDAM Dr. Jessica Kissinger Department Of Genetics University Of Georgia 1.
Data Analysis Summary. Elephant in the room General Comments General understanding that informatics is integral in medical sequencing and other –omics.
Standards and tools for publishing biodiversity data Yu-Huang Wang June 25, 2012.
Background: Clinical and Translational Research Centers promote scientific collaborations. The Puerto Rico Clinical and Translational Research Consortium.
Metadata in the iPlant Collaborative Cyberinfrastructure Birds of a Feather meeting at PAG XXII, Jan. 14, 2014.
Leveraging Ontologies for Human Immunology Research Barry Smith, Alexander Diehl, Anna- Maria Masci Presented at Leveraging Standards and Ontologies to.
Multimodal User Interface with Natural Language Classification for Clinicians At Point of Care Health Informatics Showcase Peter Budd Sponsors: NCCH -
University of Michigan Medical School 1 Towards a Semantic Web application: Ontology-driven ortholog clustering analysis Yu Lin, Zuoshuang Xiang, Yongqun.
Richard H. Scheuermann, Ph.D. November 5, 2012 Support for Systems Biology Data in IRD/ViPR - Proteomics.
BIG Data: Knowledge for Improving Vaccine Virus Selection Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Influenza Infectious Disease Ontology (Influenza-IDO) Status August 2010.
Immunological Images and the ImmPort Database and Analysis Portal Anna Maria Masci Department of Immunology Duke University Buffalo, 24 June 2014.
ADVANCED DB SYSTEMS BIOMEDICAL ENGINEERING. Index INTRODUCTION  BIOMEDICAL ENGINEERING  B.E. DATASETS APPLICATIONS  DATA MINING ON FDA DATABASE  ONTOLOGY-BASED.
Panel Discussion: Reference Databases Nathan Edwards Georgetown University Medical Center.
Data Integration and Management A PDB Perspective.
Integration of Host Factor Data into the Virus Pathogen Database and Analysis Resource (ViPR) and the Influenza Research Database (IRD) Brett E. Pickett.
SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata.
BRC 2011 Session #4 – “Omics” Data. Session #4 - Outline Challenges and Opportunities  pathogen datasets; host datasets; integrating pathogen-host datasets.
Enabling complex queries to drug information sources through functional composition Olivier Bodenreider Lister Hill National Center for Biomedical Communications.
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.
Master headline RDFizing the EBI Gene Expression Atlas James Malone, Electra Tapanari
Mining the Biomedical Research Literature Ken Baclawski.
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Introduction to Biomedical Ontology for Imaging Informatics Barry Smith, PhD, FACMI University at Buffalo May 11, 2015.
Influenza Ontology Infectious Disease Ontology Workshop 2008 Burke Squires.
24 Nov 2007Data Management and Exploratory Data Analysis 1 Yongyuth Chaiyapong Ph.D. (Mathematical Statistics) Department of Statistics Faculty of Science.
Immunology Ontology Rho Meeting October 10, 2013.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
High throughput biology data management and data intensive computing drivers George Michaels.
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Databases, Ontologies and Text mining Session Introduction Part 2
Data challenges in the pharmaceutical industry
Virtual Ice Charting System
OBI – Standard Semantic
Metadata The metadata contains
Bird of Feather Session
Presentation transcript:

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center

NIAID Bioinformatics Resource Centers

Influenza Research Database

NIAID Genome Sequencing Centers

Metadata Inconsistencies Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis

Dengue Clinical Metadata

Complex Query Interface

Additional Clinical Characteristics

GSC-BRC Metadata Standards Working Group NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs Develop metadata standards for pathogen isolate sequencing projects

GSC-BRC Metadata Working Groups

Metadata Standards Process Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific For each data field, provide: – definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories, – data providers Merge subgroup core elements into a common set of core metadata fields and attributes Assemble metadata fields into a semantic network Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples Establish policies and procedures for metadata submission workflows and GenBank linkage Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

Core Sample Metadata 30 Core Sample Metadata Fields

Core Project Metadata 16 Core Project Metadata Fields

Metadata Standards Process Divide into pathogen subgroups – viruses, bacteria, eukaryotic pathogens and vectors Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects within a pathogen subgroup (core) and data fields that appear to be project specific For each data field, provide: – definitions, – synonyms, – allowed value sets preferably using controlled vocabularies, – expected syntax, – examples, – data categories, – data providers Merge subgroup core elements into a common set of core metadata fields and attributes Assemble metadata fields into a semantic network (Scheuermann) Harmonize semantic network with the Ontology of Biomedical Investigation (OBI) (Stoeckert, Zheng) Compare, harmonize, map to other relevant initiatives, including MIGS, MIMS, BioProjects, BioSamples Establish policies and procedures for metadata submission workflows and GenBank linkage Develop data submission spreadsheets to be used for all white paper and BRC-associated projects

organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID v2 v5-6 v3-4 v7 v8 v15 v16 denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input Comments ???? v9 organism part hypothesis v17 is_about IRB/IACUC approval has_authorization v19v18 b18 b22 environment has_quality b23 b24 b28 b29 b25 b26 b27 b30

Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeIDqualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation

Core-Project

Core-Specimen

assay X sample material X person X equipment X lot # primary data assay protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Assay has_part located_in denotes run ID assay type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part analyte X has_part quality x has_quality input sample material X is_about

material transformation X sample material X person X equipment X lot # output material X material transformation protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Material Transformation has_part located_in denotes run ID material transformation type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part quality x has_quality quality x material type has_quality instance_of sample ID denotes

data transformation X input data output data material X algorithm has_specification has_output is_about software has_input located_in person X name data analyst role denotes run ID denotes Generic Data Transformation temporal-spatial region spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes data transformation type instance_of plays

Generic Material (IC) material X ID material type quality x has_quality material Y has_part material Z has_part quality y has_quality denotes instance_of temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes located_in

Conclusions Utility of semantic representation – Identified gaps in data field list (e.g. temporal components) – Identified gaps in ontology data standards (use case-driven standard development) – Identified commonalities in data structures (reusable) – Support for semantic queries and inferential analysis in future Two flavors of MIBBI – Distinguish between minimum information to reproduce an experiment and the minimum information to structure in a database for query and analysis OBI-based framework is re-usable – Sequencing => “omics” Practical issues about implementation strategies – Challenge of using ontologies for preferred value sets Can be large May not directly match common language