Persistent Identifiers (PIDs) and Data Sharing

Slides:



Advertisements
Similar presentations
The Corporation for National Research Initiatives The Handle System Persistent, Secure, Reliable Identifier Resolution.
Advertisements

Towards a Persistent Identifier Infrastructure for European e-Research Daan Broeder CLARIN / MPG 2008 CNRI Handle System Workshop.
doi> Digital Object Identifier: overview
IDF open meeting 2007 doi>. Eight possible innovations doi> Innovative uses of the DOI System.
Secure Naming structure and p2p application interaction IETF - PPSP WG July 2010 Christian Dannewitz, Teemu Rautio and Ove Strandberg.
A Unified Approach to Combat Counterfeiting: Use of the Digital Object Architecture and ITU-T Recommendation X.1255 Robert E. Kahn President & CEO CNRI,
DDI3 Uniform Resource Names: Locating and Providing the Related DDI3 Objects Part of Session: DDI 3 Tools: Possibilities for Implementers IASSIST Conference,
Persistent identifiers – an Overview Juha Hakala The National Library of Finland
Information Types and Registries Giridhar Manepalli Corporation for National Research Initiatives Strategies for Discovering Online Data BRDI Symposium.
Handle System Overview Larry Lannom 18 May 2004 Corporation for National Research Initiatives Copyright©
The Digital Object Identifier: A Tool for E-Commerce and Rights Management doi> Glen Secor 26 Nov 01.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Persistent Identifiers Reinhard.
Locating objects identified by DDI3 Uniform Resource Names Part of Session: Concurrent B2: Reports and Updates on DDI activities 2nd Annual European DDI.
Tobias Weigel (DKRZ) Tobias Weigel Deutsches Klimarechenzentrum (DKRZ) Persistent Identifiers Solving a number of problems through a simplistic mechanism.
CNRI Handle System and its Applications
Resolving Unique and Persistent Identifiers for Digital Objects Why Worry About Identifiers? Individuals and organizations, including governments and businesses,
Chapter 4 Networking and the Internet Introduction to CS 1 st Semester, 2015 Sanghyun Park.
WHY LIBRARIES WILL CARE HOW LINKING WORKS... November, 2000.
OCLC Online Computer Library Center Erpanet Symposium on Persistent Identifiers PURLs Stuart Weibel Senior Research Scientist June 17, 2004.
OCLC Online Computer Library Center Erpanet Symposium on Persistent Identifiers A framework for understanding Identifiers and “info” URIs Stuart Weibel.
Ten Minute Handle System Overview July 2012 Larry Lannom Corporation for National Research Initiatives
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Cloud Computing Project By:Jessica, Fadiah, and Bill.
Globally Unique Identifiers in Biodiversity Informatics Kevin Richards Landcare Research NZ TDWG 2008.
Persistent Identifiers (PIDs) & Digital Objects (DOs) Christine Staiger & Robert Verkerk SURFsara.
Introduction to Active Directory
Lifecycle Metadata for Digital Objects November 15, 2004 Preservation Metadata.
Low-Risk Persistent Identification: the “Entity” (N2T) Resolver 10 October 2006 John Kunze, California Digital Library, University of California.
1 CS 502: Computing Methods for Digital Libraries Guest Lecture William Y. Arms Identifiers: URNs, Handles, PURLs, DOIs and more.
Course on persistent identifiers, Madrid (Spain) Information architecture and the benefits of persistent identifiers Greg Riccardi Director Institute for.
1 This slide indicated the continuous cycle of creating raw data or derived data based on collections of existing data. Identify components that could.
Intentions and Goals Comparison of core documents from DFIG and Publishing Workflow IG show that there is much overlap despite different starting points.
RDA WG on Dynamic Data Citation
RDA 9th Plenary Breakout 3, 5 April :00-17:30
Naming for Mobile Systems
Norman Paskin International DOI Foundation
Lecture 1 Introduction to Database
Introduction to Persistent Identifiers
An Overview of Data-PASS Shared Catalog
Data Type Registries #2 12 Month Status Larry Lannom, Tobias Weigel Date Location TBD? CC BY-SA 4.0.
The RPID Testbed Rob Quick Manager – High Throughput Computing
Data Type Registries Breakout
Corporation for National Research Initiatives
Marketplace & service catalog concepts, first design analysis
Distribution and components
Advance Software Engineering
Maggie, Carlo, Peter, Rebecca (GEDE discussions)
Linking persistent identifiers at the British Library
C2CAMP (A Working Title)
Net 323 D: Networks Protocols
Plethora: Infrastructure and System Design
Network Services.
Persistent identifiers in VI-SEEM
Why the Multistakeholder Approach Works
Data Management: Documentation & Metadata
Overlay Networking Overview.
Automating Profitable Growth™
Hydra: a case study Chris Awre
AWS Cloud Computing Masaki.
Bina Ramamurthy Chapter 9
Bina Ramamurthy Chapter 9
OpenURL and Canonical Citation Linking in Classics A Collaborative Project at Cornell between Classics and the University Library Metadata Working Group.
Bina Ramamurthy Chapter 9
Bird of Feather Session
Introduction To Distributed Systems
EE 122: Lecture 22 (Overlay Networks)
Mobile IP Outline Homework #4 Solutions Intro to mobile IP Operation
Mobile IP Outline Intro to mobile IP Operation Problems with mobility.
Presentation transcript:

Persistent Identifiers (PIDs) and Data Sharing A Brief Overview October 2016 Improving Data Sharing And Re-Use In And For Africa Larry Lannom Corporation for National Research Initiatives http://www.cnri.reston.va.us/ http://www.handle.net/

PIDs for Data– Why Bother? Managing increasing amounts of primary and secondary data on the Net over long periods of time Managing increasingly complex data relationships on the Net over long periods of time When the attributes of that data such as location(s), responsible parties, and the underlying systems may change dramatically over time Science builds on past work and increasingly relies on collaboration within virtual distributed communities All of this absolutely requires reliable, long-term persistent references to bind together the distributed data, processes, and parties involved – referential integrity This slide is 8 years or so old, pre-dating RDA. I was happy to see that there wasn’t any reason to change it.

PID Considerations – Big Picture No lack of unique identifiers in the world – that part is easy Unique identification is NOT a technical challenge (U.S. SS# - 1935) Strength in numbers – at this point you would need a very good reason to start yet another PID scheme Smaller independent schemes will be more fragile and vulnerable to a small group moving on in any fashion, i.e., less persistent Reliable well-run systems will tend to grow (nobody gets fired for assigning DOIs?) If there is some aspect of a current widely used scheme that doesn’t work for your case, talk to that community What problem are you trying to solve? Don’t start with deciding on a scheme, start with defining the requirements Resolution Systems – basic decision point Single authoritative resolution system (≠ single point of failure): DNS, Handle No single authoritative system (but controlled minting): ISBN, SS#

Requirements: Identifier String Not based on any changeable attributes of the entity Location Ownership Any other attribute that may change w/o changing data itself Opaque, preferably a ‘dumb number’ A well known pattern invites assumptions that may be misleading Meaningful semantics invite IP wars, language problems Unique Avoid collisions, referential uncertainty Nice to have Human-readable Cut-able, paste-able, embeddable Fits common systems, e.g., URI specification All of the above contribute to persistence

Requirements: Identifier Resolution System Reliable Redundant, no single points of failure Fast enough to not appear broken Scalable Higher loads managed with more computers, not new software Flexible Adapt to changing computing environments Useful to new applications Trusted Resolution/Administration must be trusted Organization must be committed to the long term Open Architecture Leverage efforts of a community in building apps on your infrastructure Transparent But users needing details of the id/infrastructure NOT a good feature Persistence, again

Using a Resolution System with Existing Identifiers No lack of identifiers in the world ISBN mapped to DOI Example: 10.97812345/99990 The syntax specification, reading from left to right, is: Handle System DOI name prefix = "10.” ISBN (GS1) Bookland prefix = "978." or "979.” ISBN Publisher prefix = variable length numeric string of 2 to 8 digits Prefix/suffix divider = "/” ISBN Title enumerator and checkdigit = variable length numeric string of 8 to 2 digits

Persistence is (primarily) an Organizational Issue No technology runs itself Organizations need to commit to persistence Organizations need the resources to keep their commitments Size helps Business model needed (profits not required, but funds are) Organizations need dedication to persistence Conflicts of interest, e.g., if profit is the motive (not the case in any major system of which I am aware) then lack of profit will be a problem Regional organizations will have difficulty growing International organization is best, even with the accompanying political and cultural issues

Persistence is an Organizational Issue (but don’t make it harder than it needs to be) Do not bake changeable attributes into the string, such that users and developers operate with mistaken assumptions Ownership if ownership can change Organizational names (count the orgs that are > 100 years old and still have the same name) Assume resolution will change over time Usage can and will change Computing/networking environments that seem eternal will change Indirection is a good thing ID string will exist as a static set of bits in various formats while the computing and usage environments shift Disconnect the string from those things that will change New functions will evolve over time – don’t make it difficult to connect the ID to those new functions Best example of the problem – broken URLs An example of a solution – adding functionality to DOIs Going from 1-to-1 to 1-to-Many Adding linked data as a resolution option

PID Systems: Which One? Handle/DOI appears to be the leading candidate DOI is a community using Handle System technology DOI Registration Agencies (RAs) serve specific communities The International DOI Federation (IDF) guarantees the back-end and the continued viability of each DOI even if the responsible RA goes out of business Other Handle communities are looking to establish that same level of trust GWDG, working with EPIC and Max Planck Institutes is going down this same path International Foundation (DONA) has been established in Geneva to oversee the Handle System and related efforts This was formerly the sole responsibility of CNRI Six organizations have joined DONA, will manage the root servers and have the ability to mint prefixes South Africa expected to join shortly

Many Other PID Systems URN (Uniform Resource Name) RFC 1737: Functional Requirements for Uniform Resource Names (Sollins & Masinter, 1994) RFC 2141: URN Syntax (Moats, 1997) RFC 2168: Resolution of Uniform Resource Identifiers using the Domain Name System (Daniel & Mealling, 1997) Many updates – new IETF group currently at work PURL (Persistent URL) OCLC 1995 Close ties to W3C ARK (Archival Resource Key) John Kunze, CDL 2001? XRI (Extensible Resource Identifier) OASIS 2005? Recommend Persistent identifiers – an overview by Juha Hakala, 2010

Related RDA Efforts PID Interest Group Data Type Registries Initial output at level of principles How do I understand the data of others (MIME ++) For PIDs – how do I interpret the return from PID resolution? Follow-on Working Group addressing common type record PID Info Types Working Group In Maintenance/Adoption Mode Focused on the types of the type/value pairs of PID resolution Assumes a Data Type Registry Data Citation Working Group Fourteen Recommendations on Citation of Evolving Data

Progressions Pre-Internet The Web hits Organizational Efforts Dedicated Systems – SSN, ISBN Self-explanatory (affordance) – bib cite The Web hits DNS + file name – terrific for quick start, disaster for long-term info mgmt Permalink, Cool URIs, etc. – just be careful – its not enough for the important stuff over time Our notion of PID One level of indirection, e.g., 1 DOI = 1 URL: good start, but is that all? One to Many – id resolves to current state data (original Handle goal) Multiple copies, Services such as LoD Typing of id values, typing of ids (DOI profiles did not take, RDA Data Type Registry, now going to ISO) Organizational Efforts Research Orgs: CNRI, OCLC Standards Bodies: IETF, W3C, RDA (small s standards), NISO, ISO International Bodies: DONA, RDA again

What is Left? Can we get to the same level of common infrastructure for digital object ids as we have for physical addresses (IP) on the Internet? What would/should this include? a common reliable resolution architecture (could be multiple, but all well known and obvious, possibly interchangeable) an international governance body that was more dedicated to efficiency and openness than to profits a platform on which multiple applications and businesses could be built – as unconstrained as possible Enough accompanying information to allow the named entity to be understood and used an abstraction/indirection layer for digital entities to take us to the next level – the ability to conjure a named entity into existence, given permissions, with no worries about where it is and how it is formed.

The DO Cloud End users, developers, and automated processes ID: 123… A ID: 987/… F ID: 843… G deal with persistently identified,consistently structured digital objects which are securely & redundantly managed & accessed via the Internet Identifier Service Repository which is an overlay on existing or future infor-mation storage systems.

“It's tough to make predictions, especially about the future.” ― Yogi Berra

PID Advantages Persistent Identity via Indirection Static references into fluid systems over time Data on networks moves Ownership/responsibility change Formats change Embedded Ids For data object in hand – current state data Updates New related entities Networks of Persistent Links Data / metadata links Provenance chains Inheritance across a broad set of entities

PID Disadvantages Extra level of effort / cost on creation Analysis – what to identify / granularity Coordination across organizations Maintain resolution system Persistence requires sustained effort Organizational discipline Technology necessary but not sufficient Analyze cost/benefit ratio Don’t start unless its worthwhile Is your data worth it?

20+ Years Later – Why Are We Still Talking About This? Seems easy enough – assign an ID and, if there is a corresponding resolution system, keep it current Assign it to what? Resolve it to what? Any required metadata and if so, where? If it resolves how can I understand what comes back? What if multiple pieces of data come back? Who is in charge? Who can I trust? How long does it have to keep working? Why will it keep working? BUT