A Funder's Perspective on Sustainability of Digital Data Repositories Allen Dearry, PhD Director, Office of Scientific Information Management National Institute of Environmental Health Sciences, NIH SciDataCon, Denver, CO September 13 , 2016
The Challenge
This is expected to increase 50% this year alone. National Center for Biotechnology Information European Molecular Biology Laboratory-European Bioinformatics Institute US NCBI holds 20 PB of data. This is expected to increase 50% this year alone. EMBL-EBI estimates that biological data double every 12-18 months. Dark data—12% of data described in published papers is in recognized archives.
BD2K Sustainability Workgroup Function Develop an NIH vision for economic, technical, and social stewardship of biomedical data repositories. Goals Goal 1: Define metrics for evaluation of biomedical data repositories and assess value. Goal 2: Develop a sustainable lifecycle and coherent funding plan in support of biomedical research data.
What Solutions Are We Exploring? Metrics for review and evaluation Enhance efficiency and effectiveness of curation International collaboration on business models Pilots Model Organism Databases (MODs) Commons
Request for Information: Metrics to Assess Value of Biomedical Digital Repositories (NOT-OD-16-133) Which data should be preserved and for how long? Qualitative and quantitative metrics, such as Utilization at multiple levels Indicators of quality and impact Quality of service Infrastructure and governance Case studies demonstrating value, e.g., What would happen in the absence of the repository? Responses due September 30 Utilization at multiple levels (repository, dataset, data item) Size and demand of community served Indicators of repository quality and impact Publications, citations, altmetrics, patents Quality of service Data quality measures; user support and training Infrastructure and governance Advisory board; legal structure Qualitative metrics for the above categories, e.g., use cases/case studies Case studies demonstrating value If the repository weren’t available, how would that impact your work?
Interagency Workshop on Measuring the Impact of Data Repositories Organized by Big Data Interagency Work Group Planning group NIH, NSF, NIST, NARA, DOT, NTIS, DHS December 8 and 9, Washington DC Repository managers, data producers & users, funders, publishers, metrics/evaluation experts Workshop Objectives Identify current metrics, tools and methodologies for assessing and communicating impact of digital repositories. Identify technical, social and financial obstacles. Synthesize results into best practices for both near and long term success.
Big Data to Knowledge (BD2K) Enhancing the Efficiency and Effectiveness of Digital Curation for Biomedical Big Data (RFA-LM-17-001) Efficient Tools Automated or semi-automated approaches Improve speed and accuracy Support data annotation at points throughout the research lifecycle Distributed, crowdsource approaches to curation Tools and templates to facilitate consistent use of community-defined standards such as common data elements and standards used by archival resources such as GenBank, SRA, Biosample, etc. Automated or semi-automated approaches to merging (harmonizing) disparate or heterogeneous data sets for purposes of new research. Approaches that improve the speed and accuracy of extracting metadata information from text or other digital sources, and linking the information to a data set or other digital asset. Approaches that support data annotation at points throughout the research lifecycle (data gathering, preparation of data for sharing, public sharing of data sets, submission or review of articles supported by data sets, etc.). Distributed approaches to curation processes that increase the efficiency, completeness, accuracy or quality of the digital asset.
Sustainable Business Models for Data Repositories Organized by OECD Global Science Forum Cochair with Simon Hodson, CODATA, & Ingrid Dillo, DANS Landscape survey, July-September Broad spectrum of 60 worldwide repositories Characteristics, metrics, income streams, future funding, alternative income, cost optimization Workshops 1.1 Innovative income streams, November 3 Paris 1.2 Cost restraint, November 4 Paris 2.0 Business models, March 2017 Brussels Session at SciDataCon, September Denver Report and recommendations, April 2017
Future of Life Sciences and Biomedical Databases Organized by International Human Frontier Science Program Information gathering, June-September 20 life science repositories Funding model, user community, usage, metrics, sustainability challenges, contingency plans Workshop Life sciences data resources and the future,” November 18-19, Strasbourg International data resources, public & private funding agencies, scientific organizations, publishers White paper NIH, GA4GH, EBI, academic experts Data management/curation, QA, IP, commons, DOI, FAIR, sustainable funding, improved efficiencies
Example: Model Organism Databases Highly curated and valuable data Siloed /Not interoperable Cumbersome to compute over all the data Costly to maintain as individual resources I
Pilot a new infrastructure model SGD FlyBase WB MGD ZFIN GOC Alliance Genomic Resources SGD FlyBase WormBase MGD ZFin GO Consortium User confusion for lack of homogeneity User access interfaces need different navigation skills and data access approaches for each resource Semantic inconsistencies and different data structures for the same genomic entities Analyses human/model organism association for disease and phenotypes functional annotation Homology representation MODs support biomedical research across NIH and international biomedical research Aim is to support findable, accessible, interoperable and reusable (FAIR) model organism data facilitated by the NIH Commons platform Different user access & analyses Redundancy of operations Standardize interfaces Standardize curation, display of shared data
The Data Commons Framework Compute Platform: Cloud or HPC Services: APIs, Containers, Indexing, Software: Services & Tools scientific analysis tools/workflows Data “Reference” Data Sets User defined data Digital Object Compliance App store/User Interface Treats products of research - data, software, metadata, workflows, papers etc. - as digital objects Digital objects exist in a shared virtual space - Deposit, Manage, Find, Share, and Re-Use digital objects Conforms to FAIR principles: Findable Accessible Interoperable Reusable Detailed description of the Commons Framework can be found at : https://datascience.nih.gov/commons
Data Science at NIH Data Science at NIH allen.dearry@nih.gov https://datascience.nih.gov bd2k@nih.gov @NIH_BD2K #BD2K allen.dearry@nih.gov