Identifiers and Citation an introduction Joan Starr California Digital Library
Agenda Introducing identifiers Citation practices and issues Q&A ARKs and DOIs Using EZID to mint and assign them Using DataCite to track citations Citation practices and issues Granularity Versioning Dynamic data Q&A
Why identify data? get credit ensure transparency measure impact track use (even over a long project) aid reproducibility
Which identifier? machine (& human) actionable globally unique widely used by a community commitment to persistence your requirements here Credit: Jt. Declaration of Data Citation Principles
ezid.cdlib.org Introducing one option for creating and managing long-term globally unique identifiers is EZID, from California Digital Library. At the present time, EZID offers 2 identifier types that meet the above requirements, ARKs and DOIs. There are other identifiers that meet these requirements as well, but I am not qualified to speak at length about all of them. Today I’m going to talk to you about these two.
DOIs ARKs Strict metadata requirements Flexible metadata guidelines From the scholarly communication community From the archives and museums community Established “brand name” Option-rich, open source Use case: Data Citation Use case: Data Documentation
ARKs and DOIs in data management plans Image credit: http://www.flickr.com/photos/52041471@N04/6053042920
ARKS: Data Documentation What it looks like: At top-level directory/folder: Project Title Unique Identifier Date (yyyy or yyyy.mm.dd) At sub-directories: optional identifiers at granular levels Sample plan language: [Team] follows the recommended best practice for good data management by assigning unique identifiers (ARKs) to the data as part of the data documentation. Why it’s important: Researcher benefits: Data documentation helps you keep track of (and remember) aspects of your data throughout the research project. Reference for “documentation and metadata”: http://libraries.mit.edu/guides/subjects/data-management/metadata.html
DOIs: Data Citation What it looks like: Sample plan language: Aguilée R, Lambert A, Claessen D (2011) Data from: Ecological speciation in dynamic landscapes. Journal of Evolutionary Biology doi:10.5061/dryad.74024 Publication of data shall occur during the project, if appropriate, or at the end of the project, consistent with normal scientific practices. [Team] follows a standardized data product citation including DOI, that indicates the version and how to obtain a copy of that product. Why it’s important: OSTP mandate to: identify and provide “appropriate attribution to scientific data sets” Researcher benefits: Credit, increased citations, increased productivity OSTP Public Access Memo: https://web.archive.org/web/20161218061244/https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf Increased citations: Piwowar’s 2007 study: http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0000308
Working with DOIs in EZID Image credit: http://www.flickr.com/photos/52041471@N04/6053042920
IDENTIFY image ©2014 University of California 1. Assign a DOI.
ezid.cdlib.org
ezid.cdlib.org
CITE image ©2014 University of California 2. Use the DOI.
TRACK image ©2014 University of California 3. Track the DOI’s use.
profiles.datacite.org orcid.org search.datacite.org EZID is a founding member of DataCite, so all EZID DOIs are DataCite DOIs and EZID clients are able to use all DataCite tools and services. Auto-update
AUTOMATE 4. Make assigning DOIs automatic. image ©2014 University of California 4. Make assigning DOIs automatic.
To try it: username: apitest pword: apitest
Agenda Introducing identifiers Citation practices and issues Q&A ARKs and DOIs Using EZID to mint and assign them Using DataCite to track citations Citation practices and issues Granularity Versioning Dynamic data Q&A
GRANULARITY Image: https://www.flickr.com/photos/39495096@N04/4656925337 by velmc Granularity describes the degree of aggregation of the object to be registered. Different levels of granularity can be useful depending on discipline or resource. The identification of an object can be executed to any desired level of granularity (element of a file, a file, a file collection, etc.) as the purpose dictates. When deciding which level of granularity is used to register an object, THERE ARE SEVERAL FACTORS TO KEEP IN MIND.
Granularity considerations Citation: What is likely to be cited? Use cases: How will funders/publishers/administrators/etc. use the data? Complexity: What type of resource/structure is being registered? Complex objects may require a more granular identifier structure than a document or image file. Sustainability: PID owners must be able to maintain target URLs and citation metadata for PIDs. Relationships: The DataCite Metadata Schema includes a flexible mechanism to specify relationships between objects. DOIs have strict contractual requirements for maintenance.
VERSIONING Image: https://www.flickr.com/photos/sweetascandybooth/13104215864 by sweet as candy photo I’ll base my versioning practice suggestions on 2 sources: DataCite Metadata Schema documentation and the Earth Science Information Partners (ESIP)
Versioning considerations ESIP: track major_version.minor_version For ESIP, see: http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines#Note_on_Versioning_and_Locators Datacite: Register a new DOI for a major version change. Individual stewards need to determine major vs. minor. DataCite: Use AlternateIdentifier and RelatedIdentifier to indicate various information updates and Description to indicate the nature and file/record range of version. For DataCite, see: http://schema.datacite.org/
DYNAMIC DATA https://www.flickr.com/photos/adstream/2723201167 by Jon Anderson Citation of dynamic data is an evolving topic. There is no single best practice, because the best approach depends on the characteristics of the data, as well as the capabilities of the repository storing the data. I’m going to present the DataCite recommendations and then point you to some sources for more information and developments.
Citing dynamic data Cite a specific slice (the set of updates to the dataset made during a particular period of time or to a particular area of the dataset); Cite a specific snap-shot (a copy of the entire dataset made at a specific time); Cite the continuously updated dataset, and add an Access Date and Time to the citation; Cite a query time-stamped for re-execution against versioned DB (Research Data Alliance proposal) For datasets that are continuously and rapidly updated, there are special challenges both in citation and preservation. 4 approaches are possible. Note that a “slice” and “snap-shot” are versions of the dataset and require unique identifiers. The third and fourth options are controversial. The 3rd, because it necessarily means that following the citation does not result in observation of the resource as cited. The 4th, because it shifts a significant burden onto repositories to store database versions for all the queries.
More about dynamic data Ball, A. & Duke, M. (2015). ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides/cite-datasets#sec:versions For more information about the RDA approach, see this post on DataCite’s blog: https://blog.labs.datacite.org/dynamic-data-citation-webinar/ Another place to watch is Force11: https://www.force11.org/group/dcip
Agenda Introducing identifiers Citation practices and issues Q&A ARKs and DOIs Using EZID to mint and assign them Using DataCite to track citations Citation practices and issues Granularity Versioning Dynamic data Q&A
CONTACT TWEET VISIT ezid@ucop.edu @ezidCDL ezid.cdlib.org