Making Metadata More Comprehensive and More Searchable with CEDAR

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

The Dryad Data Repository Ryan Scherle 1, Hilmar Lapp 1, Amol Bapat 2, Sarah Carrier 2, Jane Greenberg 2, Peggy Schaeffer 1, Todd Vision 1,3, Hollie White.
Where next…. Stakeholder workshop, 29 Jan To the end of the project.
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
The Imperial College Tissue Bank A searchable catalogue for tissues, research projects and data outcomes Prof Gerry Thomas - Dept. Surgery & Cancer The.
IDENTIFIERS & THE DATA CITATION INDEX DISCOVERY, ACCESS, AND CITATION OF PUBLISHED RESEARCH DATA NIGEL ROBINSON 17 OCTOBER 2013.
Evaluation of Web Sites: What Works and What Doesn't Sue Ellen Hansen, Survey Research Operations, University of Michigan Matthew Richardson, ICPSR, University.
© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Development Principles PHIN advances the use of standard vocabularies by working with Standards Development Organizations to ensure that public health.
THE DATA CITATION INDEX AN INNOVATIVE SOLUTION TO EASE THE DISCOVERY, USE AND ATTRIBUTION OF RESEARCH DATA MEGAN FORCE 22 FEBRUARY 2014.
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Joint agINFRA & SCI-BUS workshop, 30/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA Joint agINFRA & SCI-BUS workshop agINFRA.
Topic Rathachai Chawuthai Information Management CSIM / AIT Review Draft/Issued document 0.1.
Leveraging Ontologies for Human Immunology Research Barry Smith, Alexander Diehl, Anna- Maria Masci Presented at Leveraging Standards and Ontologies to.
CLARIN work packages. Conference Place yyyy-mm-dd
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
Alternative Architecture for Information in Digital Libraries Onno W. Purbo
This material was developed by Duke University, funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information.
Collecting History: Profiles in Science Alexa T. McCray National Library of Medicine Bethesda, MD Stanford University August 21, 1999.
MPS Workshop 1: Gauging the Impact of Requirements for Public Access to Data November 19, 2015 Jennie Larkin, Ph.D. Office of the Associate Director for.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
NIH: DATA SCIENCE & BD2K Jennie Larkin, PhD Senior Advisor, Extramural Programs and Strategic Planning Office of the Associate Director for Data Science,
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
Policy/ strategy Formal requirements.  Requirements / expectations from Eurostat to us as NSI’s  Code of Practise  Quality Assurance Framework  PSI.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
Linking Ontologies and ISO/IEC in the CEDAR Metadata Repository Martin J. O’Connor Technical Lead, CEDAR Project Stanford University.
© 2016 Chapter 6 Data Management Health Information Management Technology: An Applied Approach.
Towards a unified MOD resource: An Overview
Mark A. Musen, M.D., Ph.D. Stanford University
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
The National Center for Biomedical Ontology
Stanford University, Stanford, CA, USA
TRSS Terminology Registry Scoping Study
Conceptualizing the research world
Scientific Reproducibility using the Provenance for Healthcare and Clinical Research Framework Satya S. Sahoo Collaborators/Co-Authors: Joshua Valdez,
Center for Open Science: Practical Steps for Increasing Openness
Libraries as Data-Centers for the Arts and Humanities
FAIR Sample and Data Access
Ways to upgrade the FAIRness of your data repository.
Using ArrayExpress.
FAIR Metrics RDA 10 Luiz Bonino – - September 21, 2017.
Next Generation Preprint Service
Data challenges in the pharmaceutical industry
knowledge organization for a food secure world
Publishing software and data
Identifiers Answer Questions
Reading Research Papers-A Basic Guide to Critical Analysis
Enabling discoverability and dissemination through TDM
Open Access to your Research Papers and Data
Metadata for research outputs management Part 2
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
2. An overview of SDMX (What is SDMX? Part I)
An Open Archival Repository System for UT Austin
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
Introduction to the MIABIS SOP Working Group
How to Implement the FAIR Data Principles? Elly Dijk
The National Center for Biomedical Ontology
Bird of Feather Session
Attributes and Values Describing Entities.
GEO Knowledge Hub: overview
Data + Research Elements What Publishers Can Do (and Are Doing) to Facilitate Data Integration and Attribution David Parsons – Lawrence, KS, 13th February.
MANUSCRIPT WRITING TIPS, TRICKS, & INFORMATION Madison Hedrick, MA
Cultivating Semantics for Data in Agriculture and Nutrition
Presentation transcript:

Making Metadata More Comprehensive and More Searchable with CEDAR Mark A. Musen, M.D., Ph.D. Stanford University musen@Stanford.EDU

An (in)famous commentary in Nature (Begly and Ellis, 2012) Amgen could reproduce the findings in only 6 of 53 “landmark” papers in cancer biology Bayer could validate only 25% of 67 pre-clinical studies Some “landmark” studies had spawned entire fields of investigation, with no one bothering to confirm the initial results Nonreproducible studies are more likely to be published in journals with high impact factors

The problem has many causes A lack of statistical power in many studies An “art form” in conducting experiments An eagerness to publish early Outright fraud A system that does not make it easy or rewarding to replicate the results of other investigators

At a minimum, science needs Open, online access to experimental data sets Annotation of online data sets with adequate metadata (data about the data) Use of controlled terms in metadata annotations Mechanisms to search for metadata to find relevant experimental results

To make sense of the metadata … We need standards For the kinds of things we want to say about the data For the language that we use to say those things For how we structure the metadata We need a scientific culture that values making experimental data accessible to others

The OMICs people—whose actual data are much simpler—have focused on the metadata directly

How can we make sense of microarray data? What was the substrate of the experiment? What array platform was used? What were the experimental conditions? DNA Microarray

MIAME structures metadata for online repositories such as the Gene Expression Omnibus (GEO) and Array Express Provides a framework for assuring that scientists can understand the basics of what experiment was performed Offers a model that lots of groups have copied

Retake image

BioSharing lists more than 100 guidelines like MIAME for reporting experimental data 85 Guidelines

The Good News: Minimal information checklists, such as MIAME, are being advanced from all sectors of the biomedical community The Bad News: Investigators view requests for even “minimal” information as burdensome; there is still the pervasive problem of “What’s in it for me?”

Metadata Submitting to GEO Submission Summary Data Matrix Raw Data

Reliance on free text leads to a mess! age Age AGE `Age age (after birth) age (in years) age (y) age (year) age (years) Age (years) Age (Years) age (yr) age (yr-old) age (yrs) Age (yrs) age [y] age [year] age [years] age in years age of patient Age of patient age of subjects age(years) Age(years) Age(yrs.) Age, year age, years age, yrs age.year age_years

http://metadatacenter.org

CEDAR Partners Stanford University School of Medicine University of Oxford e-Science Centre Yale University School of Medicine Northrup Grumman Corporation Stanford University Libraries

From standard checklists to CDE-like elements to annotation templates bsg-000174 biosharing: ReportingGuideline bsg-000161 MINSEQE MIMARKS sample information identifier taxonomy sequence read geo location High-level information about the metadata standards Representations of the standards elements Template elements for el-000001 el-000002 el-000003 provenance: and Serve machine-readable content metadata standards, providing provenance for their elements Inform the creation of metadata templates, rendering standards invisible to the researchers

HIPC centers submit data directly to the NIAID ImmPort data repository using detailed, experiment- specific templates Example entity–relationship diagram for describing metadata for annotating multiplex bead array assays (e.g., Luminex)

A Metadata Ecosystem HIPC investigators perform experiments in human immunology HIPC Standards Working Group creates metadata templates to annotate experimental data in a uniform manner ImmPort stores HIPC data (and metadata) in its public repository CEDAR will ease Template creation and management The use of templates to author metadata for ImmPort Analysis of existing metadata to inform the authoring of new metadata

The CEDAR Approach to Metadata

CEDAR technology will give us Mechanisms To author metadata template elements To assemble them into composite templates that reflect community standards To fill out templates to encode experimental metadata A repository of metadata from which we can Learn metadata patterns Guide predictive entry of new metadata Links to the National Center for Biomedical Ontology to ensure that metadata are encoded using appropriate ontology terms

The CEDAR Approach to Metadata

The CEDAR Approach to Metadata

The CEDAR Approach to Metadata

The CEDAR Approach to Metadata Highlight the metadata repository in the third panel.

Some key features of CEDAR All semantic components—template elements, templates, ontologies, and value sets—are managed as first-class entities User interfaces and drop-down menus are not hardcoded, but are generated on the fly from CEDAR’s semantic content All software components have well defined APIs, facilitating reuse of software by a variety of clients (see https://metadatacenter.org/tools-training/cedar-api) CEDAR generates all metadata in JSON-LD, a widely adopted Web standard that can be translated into other representations

Library of Integrated Network-Based Cellular Signatures

Hydra-in-a-Box

How will CEDAR make data FAIR? Data will be more findable because Datasets will be associated with rich metadata Use of community-based templates will set expectations for the metadata elements that need to be specified Use of predictive metadata entry will encourage the creation of more complete metadata specifications Metadata will use standard ontologies and value sets, that are themselves based on FAIR principles Metadata, and the data sets that they reference, will be given unique identifiers

How will CEDAR make data FAIR? Data will be more accessible because All metadata and their associated data sets will have unique identifiers The CEDAR metadata archive will be persistent, even if the target data repositories are ever decommissioned

How will CEDAR make data FAIR? Data will be more interoperable because CEDAR metadata are stored in a standard, widely used knowledge-representation language (JSON-LD) Use of community-based content standards (for templates, terminologies, and value sets) assures shared representation of metadata

How will CEDAR make data FAIR? Data will be more reusable because Metadata will include provenance information for Data sets Metadata themselves—an essential component because ontologies, experimental methods, and scientific knowledge evolve Metadata will be more comprehensive and more complete because They will be easier to author They will be constructed with community buy-in, due to the use of community-developed standards

Where is this ecosystem leading? Formalization of scientific knowledge in terms of ontologies Use of ontologies to standardize scientific communication In the long term … Dissemination of the results of scientific investigations purely as machine-interpretable knowledge Intelligent agents that will allow help us to synthesize scientific knowledge from online representations

Imagine a Siri-like intelligent agent on the Web … That tells you when new studies have been done in disciplines that you care about That can compare and contrast study results across dimensions that matter to you That can identify those studies whose subjects best match the characteristics of particular patients That can tell investigators when the methods of their proposed studies are nearly identical to those of earlier studies

Once scientific knowledge is disseminated in machine-interpretable form … Technology such as CEDAR will assist in the automated “publication” of scientific results online Intelligent agents will Search and “read” the “literature” Integrate information Track scientific advances Suggest the next set of experiments to perform And maybe even do them!

http://metadatacenter.org