Scientific Data as Research Infrastructure: The Biomedical Sciences Strategies for Economic Sustainability of Publicly Funded Data Repositories Board on Research Data and Information, March 12, 2014 Alexa T. McCray Center for Biomedical Informatics Harvard Medical School Image created by Rachel Jones
2 CENTER FOR Biomedical Informatics Data Sharing Policies in the Life Sciences There are many reasons to share data from NIH-supported studies … Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data. NIH Data Sharing Policy and Implementation Guidance The ICMJE member journals will require, as a condition of consideration for publication, registration in a public trials registry. Clinical Trial Registration: A Statement from the International Committee of Medical Journal Editors A condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols promptly available to others without undue qualifications. The preferred way to share large data sets is via public repositories… Instructions to authors, Nature Journals We continue to request that the authors provide the “data underlying the findings described in their manuscript”… authors need to indicate where the data are housed, at the time of submission. PLOS Data Policy
3 CENTER FOR Biomedical Informatics Where Biomedical Data are Housed National Institutes of Health NLM/NCBI Dozens of databases Institute-specific databases e.g., National Database of Autism Research Nucleic Acids Research 2014 database issue 58 new molecular biology databases Updates to 123 databases Community-driven domain-specific repositories BioDB catalogue lists 622 databases e.g., AgingGenesDB, MousePhenome Database, FlyBase
4 CENTER FOR Biomedical Informatics Data Stewardship in Transition Plan for an NIH Data Discovery Index (DDI) Index of publicly available biomedical datasets Motivation Catalyze scientific progress Reduce duplication of experimental data collection Reward the data provider Long-term sustainability of the underlying data sources not clear Long-term value of the data Long-term costs dependent on Selection, curation, and access