Persistent Identifiers (PIDs) and Data Sharing

Persistent Identifiers (PIDs) and Data Sharing
A Brief Overview October 2016 Improving Data Sharing And Re-Use In And For Africa Larry Lannom Corporation for National Research Initiatives

PIDs for Data– Why Bother?
Managing increasing amounts of primary and secondary data on the Net over long periods of time Managing increasingly complex data relationships on the Net over long periods of time When the attributes of that data such as location(s), responsible parties, and the underlying systems may change dramatically over time Science builds on past work and increasingly relies on collaboration within virtual distributed communities All of this absolutely requires reliable, long-term persistent references to bind together the distributed data, processes, and parties involved – referential integrity This slide is 8 years or so old, pre-dating RDA. I was happy to see that there wasn’t any reason to change it.

PID Considerations – Big Picture
No lack of unique identifiers in the world – that part is easy Unique identification is NOT a technical challenge (U.S. SS# ) Strength in numbers – at this point you would need a very good reason to start yet another PID scheme Smaller independent schemes will be more fragile and vulnerable to a small group moving on in any fashion, i.e., less persistent Reliable well-run systems will tend to grow (nobody gets fired for assigning DOIs?) If there is some aspect of a current widely used scheme that doesn’t work for your case, talk to that community What problem are you trying to solve? Don’t start with deciding on a scheme, start with defining the requirements Resolution Systems – basic decision point Single authoritative resolution system (≠ single point of failure): DNS, Handle No single authoritative system (but controlled minting): ISBN, SS#

Requirements: Identifier String
Not based on any changeable attributes of the entity Location Ownership Any other attribute that may change w/o changing data itself Opaque, preferably a ‘dumb number’ A well known pattern invites assumptions that may be misleading Meaningful semantics invite IP wars, language problems Unique Avoid collisions, referential uncertainty Nice to have Human-readable Cut-able, paste-able, embeddable Fits common systems, e.g., URI specification All of the above contribute to persistence

Requirements: Identifier Resolution System
Reliable Redundant, no single points of failure Fast enough to not appear broken Scalable Higher loads managed with more computers, not new software Flexible Adapt to changing computing environments Useful to new applications Trusted Resolution/Administration must be trusted Organization must be committed to the long term Open Architecture Leverage efforts of a community in building apps on your infrastructure Transparent But users needing details of the id/infrastructure NOT a good feature Persistence, again

Using a Resolution System with Existing Identifiers
No lack of identifiers in the world ISBN mapped to DOI Example: /99990 The syntax specification, reading from left to right, is: Handle System DOI name prefix = "10.” ISBN (GS1) Bookland prefix = "978." or "979.” ISBN Publisher prefix = variable length numeric string of 2 to 8 digits Prefix/suffix divider = "/” ISBN Title enumerator and checkdigit = variable length numeric string of 8 to 2 digits

Persistence is (primarily) an Organizational Issue
No technology runs itself Organizations need to commit to persistence Organizations need the resources to keep their commitments Size helps Business model needed (profits not required, but funds are) Organizations need dedication to persistence Conflicts of interest, e.g., if profit is the motive (not the case in any major system of which I am aware) then lack of profit will be a problem Regional organizations will have difficulty growing International organization is best, even with the accompanying political and cultural issues

Persistence is an Organizational Issue (but don’t make it harder than it needs to be)
Do not bake changeable attributes into the string, such that users and developers operate with mistaken assumptions Ownership if ownership can change Organizational names (count the orgs that are > 100 years old and still have the same name) Assume resolution will change over time Usage can and will change Computing/networking environments that seem eternal will change Indirection is a good thing ID string will exist as a static set of bits in various formats while the computing and usage environments shift Disconnect the string from those things that will change New functions will evolve over time – don’t make it difficult to connect the ID to those new functions Best example of the problem – broken URLs An example of a solution – adding functionality to DOIs Going from 1-to-1 to 1-to-Many Adding linked data as a resolution option

PID Systems: Which One? Handle/DOI appears to be the leading candidate
DOI is a community using Handle System technology DOI Registration Agencies (RAs) serve specific communities The International DOI Federation (IDF) guarantees the back-end and the continued viability of each DOI even if the responsible RA goes out of business Other Handle communities are looking to establish that same level of trust GWDG, working with EPIC and Max Planck Institutes is going down this same path International Foundation (DONA) has been established in Geneva to oversee the Handle System and related efforts This was formerly the sole responsibility of CNRI Six organizations have joined DONA, will manage the root servers and have the ability to mint prefixes South Africa expected to join shortly

Many Other PID Systems URN (Uniform Resource Name)
RFC 1737: Functional Requirements for Uniform Resource Names (Sollins & Masinter, 1994) RFC 2141: URN Syntax (Moats, 1997) RFC 2168: Resolution of Uniform Resource Identifiers using the Domain Name System (Daniel & Mealling, 1997) Many updates – new IETF group currently at work PURL (Persistent URL) OCLC 1995 Close ties to W3C ARK (Archival Resource Key) John Kunze, CDL 2001? XRI (Extensible Resource Identifier) OASIS 2005? Recommend Persistent identifiers – an overview by Juha Hakala, 2010

Related RDA Efforts PID Interest Group Data Type Registries
Initial output at level of principles How do I understand the data of others (MIME ++) For PIDs – how do I interpret the return from PID resolution? Follow-on Working Group addressing common type record PID Info Types Working Group In Maintenance/Adoption Mode Focused on the types of the type/value pairs of PID resolution Assumes a Data Type Registry Data Citation Working Group Fourteen Recommendations on Citation of Evolving Data

Progressions Pre-Internet The Web hits Organizational Efforts
Dedicated Systems – SSN, ISBN Self-explanatory (affordance) – bib cite The Web hits DNS + file name – terrific for quick start, disaster for long-term info mgmt Permalink, Cool URIs, etc. – just be careful – its not enough for the important stuff over time Our notion of PID One level of indirection, e.g., 1 DOI = 1 URL: good start, but is that all? One to Many – id resolves to current state data (original Handle goal) Multiple copies, Services such as LoD Typing of id values, typing of ids (DOI profiles did not take, RDA Data Type Registry, now going to ISO) Organizational Efforts Research Orgs: CNRI, OCLC Standards Bodies: IETF, W3C, RDA (small s standards), NISO, ISO International Bodies: DONA, RDA again

What is Left? Can we get to the same level of common infrastructure for digital object ids as we have for physical addresses (IP) on the Internet? What would/should this include? a common reliable resolution architecture (could be multiple, but all well known and obvious, possibly interchangeable) an international governance body that was more dedicated to efficiency and openness than to profits a platform on which multiple applications and businesses could be built – as unconstrained as possible Enough accompanying information to allow the named entity to be understood and used an abstraction/indirection layer for digital entities to take us to the next level – the ability to conjure a named entity into existence, given permissions, with no worries about where it is and how it is formed.

The DO Cloud End users, developers, and automated processes
ID: 123… A ID: 987/… F ID: 843… G deal with persistently identified,consistently structured digital objects which are securely & redundantly managed & accessed via the Internet Identifier Service Repository which is an overlay on existing or future infor-mation storage systems.

“It's tough to make predictions, especially about the future.”
― Yogi Berra

PID Advantages Persistent Identity via Indirection
Static references into fluid systems over time Data on networks moves Ownership/responsibility change Formats change Embedded Ids For data object in hand – current state data Updates New related entities Networks of Persistent Links Data / metadata links Provenance chains Inheritance across a broad set of entities

PID Disadvantages Extra level of effort / cost on creation
Analysis – what to identify / granularity Coordination across organizations Maintain resolution system Persistence requires sustained effort Organizational discipline Technology necessary but not sufficient Analyze cost/benefit ratio Don’t start unless its worthwhile Is your data worth it?

20+ Years Later – Why Are We Still Talking About This?
Seems easy enough – assign an ID and, if there is a corresponding resolution system, keep it current Assign it to what? Resolve it to what? Any required metadata and if so, where? If it resolves how can I understand what comes back? What if multiple pieces of data come back? Who is in charge? Who can I trust? How long does it have to keep working? Why will it keep working? BUT

Persistent Identifiers (PIDs) and Data Sharing

Similar presentations

Presentation on theme: "Persistent Identifiers (PIDs) and Data Sharing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Persistent Identifiers (PIDs) and Data Sharing

Similar presentations

Presentation on theme: "Persistent Identifiers (PIDs) and Data Sharing"— Presentation transcript:

Similar presentations

About project

Feedback