Update on Data Publishing With Dataverse Intro to Dataverse and Data Science Team Options (3) for Journals to publish data associated with their articles OJS integration project Rigorous Data Publishing Workflows with 4.0 (Versioning) 4.0 Deaccession only once published w/ a reason (cannot delete). Always a landing page with a DOI Publishing Privacy Sensitive Data with Dataverse Eleni Castro, Research Coordinator Institute for Quantitative Social Science (IQSS) Harvard University DataCite Annual 2014 Nancy, France August 25, 2014
Introduction to Dataverse Software framework for publishing, citing and preserving research data (open source on github for others to install) Provides incentives for researchers to share: Recognition & credit via data citations Control over data & branding Fulfill Data Management Plan requirements Harvard Dataverse (open to all; repository instance at Harvard) currently has: 761 Dataverses > 1 Million Downloads 54,828 Datasets EZID DOI (2013) 748,554 Files
Who’s Using Dataverse? Worldwide Dataverse Installations Institutions can setup & host their own Dataverse installation (e.g., Odum, OCUL, DANS, Fudan, etc) and within them can support datasets from a variety of users (across all research domains): Researchers, Projects, Departments, Journals, etc.
Journals Publishing Data w/ Dataverse Option A. Journals include Dataverse as a Recommended Repository Option B. Authors Contribute Directly to a Journal Dataverse Option B. Step 3: This can be done at the same time as the journal article is being reviewed. Option C. Seamless Integration btw Journal + Dataverse (e.g., OJS)
OJS-Dataverse Integration OJS Journal Journal Dataverse Citation to Data Citation to Article Details/Updates: 2 Year Project 2012-2014 Integrating w/ PKP’s Open Journal Systems (Data Deposit API). Pilot with ~ 50 journals + expanding outreach (100s) . OJS’ Dataverse plugin now available with latest OJS release. Future: Embed Dataverse widgets into journal article. http://projects.iq.harvard.edu/ojs-dvn
OJS Plugin: Journal Data Policies Boilerplate Templates Including Guidelines for: Authors (w/ data citation) Reviewers Boilerplate policies for authors to deposit and cite data in OJS Dataverse plugin. Includes boiler plate for Reviewers and copyeditors. Read full Data Policies / Guidelines Template: http://bit.ly/1xkLjoZ
OJS Plugin: Author Manuscript + Data Submission Joint Declaration of Data Citation principles #3 In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited Option to: (A) deposit into Dataverse AND/OR; (B) if data is already in a repository can include the data citation (w/ persistent URL/identifier).
OJS Plugin: Editor Reviews Article + Data Joint Declaration of Data Citation principles #3 In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited
Data Published in Dataverse w/ OJS Plugin In OJS: In Dataverse: 2 Options in OJS: 1) Dataset Published (with DOI) at Article Approval. 2) Dataset Published when Journal Issue is Released.
OJS Plugin: Article Published w/ Data Citation Now that the Data Citation is listed on the same page as the article it helps it go one step closer to data citation principle #1 Importance: Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications
Towards An Integrated Publishing Lifecycle See: Data Citation Principle #1 Importance This is really a reference implementation that we hope will inspire other repositories and journal management / publishing systems to use our open source code to expand and extend the possibilities of data publishing. Its one step in the right direction to bringing data closer to the same level of importance as the published article. Image Credit: Mercè Crosas
Publishing in 4.0 (Late Fall 2014)
Rigorous Data Publishing Workflows Upload Draft Dataset See Data Citation Principle #7 Specificity & Verifiability Published Dataset v1 Publish Version 1 Authors, Title, Year, DOI, Repository, V1 Published Dataset v1.1 Publish Version 1.1: small metadata change; citation doesn’t change. Compliant w/ Joint Declaration of Data Citation Principles and based on the paper by Altman and King, 2007 were pushing for machine readable information to be added to data citations (UNF, persistent identifier). Principle #7 Specificity & Verifiability Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verfiying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited Publish Version 2: File change (automatic); big metadata change (e.g., author, title). Published Dataset v2 Authors, Title, Year, DOI, Repository, UNF, V2 See: Altman, M., & King, G. (2007) doi:10.1045/march2007-altman
Dataset Versioning (1) If you don’t add or change any files and havent changed any default metadata fields (eg, author, title) you will get the option to publish the data citation as a minor or major version
Dataset Versioning (2) If you add or change files the version will go up a major version (ex. V1 to V2)
Dataset Versioning (3) Ex. Added files to a Dataset so it bumped up to a major version change. If you add or change files the version will go up a major version (ex. V1 to V2). And here is a screenshot of how you would see the version differences when changing/adding files.
Dataset Versioning (4) Ex. Added small metadata change to a Dataset so it bumped up to a minor version change. If you add or change files the version will go up a major version (ex. V1 to V2)
Deaccession Data in 4.0 Before a Dataset is published the DOI is private (reserved). Only when published is it made public & searchable. In accordance w/ Data Citation Principle #6 Persistence: A Published Dataset cannot be deleted; only deaccessioned, with a reason. You can Deaccession (in 4.0): a version(s) of a Dataset, or an entire Dataset. And now that we support more granular levels of versioning we are also working on improving dataset deaccession. Prior to 4.0 you could delete a published a dataset which goes completely against data citation best practices of Persistence (#6)
Deaccession Workflow (Step 1) Ex. This file was added in v2 and has identifiable information. To try out deaccessioning, go to a dataset you’ve already published (or add a new one and publish it), click on Edit Dataset, then Deaccession Dataset. If you have multiple versions of a dataset, you can select here which versions you want to deaccession or choose to deaccession the entire dataset.an entire dataset.
Deaccession Workflow (Step 2) You are then presented with the option to deaccession the entire dataset or just certain version(s) with a drop down list of "Reasons for deaccession". Are there any important reasons to deaccession that we might be missing here?
Deaccession Workflow (Step 3) Screenshot 3: Once you select a reason you can enter additional information. This is particularly important if you select Other. If the Dataset has moved then a URL may also be entered (we validate URLs).
Deaccession Workflow (Step 4) Deaccession Landing Page Data Citation Principle #6 Persistence A persistent landing page, that includes very limited metadata (see below), will always be accessible to the public if they use the DOI provided in the citation for that dataset. In accordance with Data Citation Principle #6 Persistence: Unique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe
Data Publishing After 4.0 (2015) Publishing Privacy Sensitive Data Secure Dataverse DataTags (demo) (based on Privacy Laws and DUAs) In the current version of Dataverse, identifiable data cannot be deposited safely. Based on the data publication need for sufficient information to understand and reuse the data (metadata, documentation, code) after 4.0 users will be able to safely upload datasets that have identifiable information, while a minimal risk version of the data is automatically rendered (synthetic data, differential privacy, statistical disclosure control (SDC), etc) and can be made available for the public. Full interview Integration with ORCID (API): create ORCID account, connect all Dataverse datasets to ORCID account. (Note: 4.0 will already allow for authors to enter ID.)
Thank you! Contact: ecastro@fas.harvard.edu More information: http://datascience.iq.harvard.edu Twitter: @thedataorg