Revisiting Self-Deposit of Scientific Data Darren Hardy Stanford University Open Repositories, 10 June 2015, Indianapolis, IN Abstract: Sharing scientific data is increasingly valuable for reproducible science, furthering investigation, and innovation. To this end, repositories facilitate data sharing by making scholarly data available. We are at an impasse, however. Librarian-mediated approaches to self-deposit of scientific data are very resource-intensive, and the repository services provided to researchers are often limited. Self-deposit is quite a challenging use case as it encompasses data preparation, metadata description, upload, visualization, annotation, sharing, publication, access, rights, preservation, citation, and discovery services. This editorial suggests we revisit the value proposition we make for self-deposit and mitigate its resource-intensive workflows.
Why share? Sharing scientific data is increasingly valuable Reproducible, open science Furthering investigation, innovation “share [data], and do so in such a way that the data are interpretable and reusable by others” (Borgman 2012)
Why repositories? Repositories in position to facilitate sharing “The centerpiece of such data sharing [for reuse] is the digital repository, which acts as the foundation for surrounding value-added services supporting and promoting effective publication, discovery, and dissemination of research data” (Abrams et al. 2013)
But, when researchers self-deposit scholarly scientific data, what are their expectations for services?
✔︎ Share Data Here’s my data… Email it! Preparation… not likely Citation… “personal communication” Access… email only Preservation… nope Discovery… nope Rights… nope
😃 Self-Publish Data Here’s my data… Personal or project website, maybe file sharing service like Dropbox Preparation maybe Citation… via URL Access... as long as website works… Preservation… nope Discovery… not assured, maybe Google works Rights… maybe
Self-Deposit Data Here’s my data… Deposited in institutional repository Preparation… recommended with suggestions Citation… persistent Access… ensured, data & metadata Preservation… long-term Discovery… many indexes Rights… explicit, multiple choices
Example Marine ecologist Malin Pinsky Published research on Pacific salmon conservation Article: Pinsky et al. 2009, Conservation Biology 23(3) Visible: Used in testimony before the US Senate in 2010 Self-published GIS data on his personal website Graduated from Stanford, went to Rutgers Website taken down(!)… 404 Not Found Then, self-deposited into Stanford repository Now, discovery, access, and preservation services
Scientific data visualized as paper map in Pinsky et al. (2009)
Self-Deposit can provide direct data access Download the actual data!
…with auxiliary downloads
…with citation services
…with discovery services Via SearchWorks, our library catalog Via EarthWorks, our GIS data search engine Via Google, etc. “pinsky salmon data” Stanford self-deposit is first hit
(again) …with direct data access
Stanford Digital Repository (sdr.stanford.edu) Self-deposit interface to a Hydra repository 2+ years in production 300+ depositors 2,000+ deposits 20,000+ deposited files 3+ TB preserved Self-training via video, quickstart guide But, no added services for scientific data
Barriers vs. Expectations Participation no extra work for depositors Metadata creation Data preparation will this be a requirement of open science? Resource limitations who will write the code? shepherd deposits?
Are we at an impasse? Librarian-mediated approaches are very resource-intensive Software and services are often resource-limited
Closing the gap Mitigate workflows for librarians, curators Improve the value proposition for depositors Data preparation, metadata description, upload, visualization, annotation, sharing, publication, access, rights, preservation, citation, related work, ontology, discovery, social media