Relevance of RDA Outputs in the Humanities

Relevance of RDA Outputs in the Humanities
A Use Case “Deep Dive” Bridget Almas, Tufts University @BridgetAlmas RDA WG/IG Chairs Meeting NIST, Gaithersburg, MD 8-9 December, 2015

Presentation Structure
Overview of our use case Deep dive into our current infrastructure solution Identification of unmet needs Consideration of RDA Outputs as solutions PID Types, Data Types Registry + proposed WG Research Data Collections Cost/Benefit analysis this is the target workflow

Use Case: Perseus, Perseids and SoSOL
Perseus Digital Library: Open-access online resource for classical texts and language resources. Its flagship collection contains over 68 million words of textual and linguistic data. Perseus is a practical experiment in which we explore possibilities and challenges of digital collections in a networked world. The Perseids Perseus: developing an online platform for collaborative editing and publication of texts and annotations, integrating and extending existing open-source tools using standard ontologies and data formats. Core objectives include putting together large sets of data that allow for new analysis and engaging students in participation of the production of knowledge and data in a pedagogical context. SoSOL: is version controlled, transparent and fully audited, multi-author, web-based, real-time editing environment developed to enable interoperability between 5 different content and data providers in the papyrological community. this is the target workflow Perseids and SoSOL have both been funded by the Andrew W. Mellon Foundation. Perseids has also been funded by Institute of Museum and Library Services, the National Endowment for the Humanities and Tufts University

Long Term Vision for the Perseus Digital Library
Provide the ability for users to annotate and add their own contextual information Easily incorporate data as it comes off the OCR pipeline Seamlessly incorporate data from other open access platforms In return easily make its own data available to these platforms (both inside and outside of the Perseus ecosystem) Archive all data used by and in Perseus for the long term in institutional repositories

Long Term Vision for the Perseus Digital Library
this is the target workflow

First Step: Perseids Data Publications
initial scope was inward facing assumed all published data would be local to the PDL

SoSOL Supports working with collection of related data objects as publications where each individual data object may belong to a different parent collection. Provides a version controlled, transparent and fully audited, multi-author, web-based, real-time editing environment Developed for a related community in the same domain The Integrating Digital Papyrology (IDP) project, a multi-institutional project aimed at supporting interoperability between five different digital papyrological resources Source: Ryan Baumann and Hugh Cayless, Duke University

SoSOL cont’d Each project had its own identifier, metadata and data schemas The IDP project enabled the following: linking the digital transcriptions of texts with the images, dates, and other metadata a common search interface on the merged and mapped data the entire community of papyrologists to take control of the process of populating this shared data third-party use of the data and tools. Now maintained by the Duke Collaboratory for Classics Computing Source: Ryan Baumann and Hugh Cayless, Duke University

SoSOL Source:

SoSOL as Infrastructure: Data Model
Publication facilitates cross-collection work on related data objects under a single shared workflow. Has the potential to be reused as the basis for a new persistent representation of a specific aggregation of resources. Identifier Type Each data object is defined by an “identifier type”.

Each Identifier type has: a format or syntax CRUD rules and operations 0 to many related identifier types an associated data format, which itself may have validation rules CRUD operations specialized editing tools and strategies

Each Publication has provenance information versioning information one or more instantiations of identifier types, each of which has a PID data content

In the IDP deployment of this infrastructure for Papyri.info, responsibility for PID operations is split: a SPARQL endpoint provides query capabilities for existing data object identifiers and their relationships rules implemented in the SoSOL application code are used to create new PIDs for new versions of existing data objects, and for completely new objects final acceptance of new and changed files files triggers a process that automatically ingests new identifiers into the triple store and publishes the data object(s).

Perseids use of SoSOL infrastructure
Provided much of the infrastructure we needed: the ability to work with text-centric publications that combined a variety of different data types support for stable persistent identifiers a versioned, collaborative editing environment the ability to extend with data-type specific behavior the ability to support different review workflows for different data types

We had our own identifier systems for our texts and data objects CTS: URN based identifier structure for identifying texts and canonically cited passages of texts, and an API for retrieving fragments of texts by these URNS CITE: URN/API protocol for identifying and retrieving digital representations of text-related objects, grouped in collections of like objects. Each of these specifications had schemas for defining metadata about identifiers Each of these specifications provided a read-only API for using the identifiers to retrieve metadata and the data objects And provided support for identifying fragments for a subset of the supported data types (texts and images)

We expected: to replace the SPARQL endpoint with CTS and CITE APIs to be able to postpone automatic creation of identifiers for new data collections and texts by: using rule based approaches for automatic construction of new identifiers using manual out-of-band means for creation of top-level collection identifiers for new collections and texts

Perseids and SoSOL - Scalability Pain Points
Tight coupling between PID Type and Data Type presents problems Knowledge about the data structures, metadata and pid systems is implicit and embedded (in the code or in query formulation) when applying a single identifier scheme (CITE URNs) to a wide variety of different data types it results in unnecessary redundancy in the code

Perseids and SoSOL - Scalability Pain Points
Perseids’ lack of defined APIs and implementations for CRUD of new texts, data and collection objects New use cases where CTS doesn’t fully meet our needs as PIDs for texts want to be able to continue supporting these identifiers without requiring a large rewrite of the code, while also expanding to other schemes In CITE, the line between metadata and data is too easily blurred using it to define PIDs and Metadata for data objects whereas it was intended to contain the data (except in the case of images…)

Perseids: evolving view of Data Publication
Publication as a vertical collection of data objects, expressed as a, possibly ever-expanding graph, where each node in the graph may itself belong to one or more other graphs (horizontal collections). See also Linguistics Data Interoperability presentation from RDA P6

Perseids: Infrastructure Needs
Persistent identification of text and related data objects that is stable throughout the creation, curation, publication and post-publication lifecycle that can be leveraged easily across projects and which can allow for a wide variety of persistent identifier schemes without requiring code changes for each one that is supported by institutional infrastructure and still allows for domain and project specific PID schemes

Perseids: Infrastructure Needs
Support for concurrent annotation and curation of diverse data types, from diverse sources, in a common environment A means to formalize and express details about the data types we are working with without coupling them tightly to the identifier schemes used to identify data objects that adhere to them Support for multiple different models of collections (e.g. both horizontal and vertical) and a well-defined CRUD API for working with them

Analysis of RDA Outputs

PID Types API provides: a conceptual model for a PID record a CRUD API for interacting with PID records which can abstract the differences between PID implementations Data Types Registry provides: a data model for formally expressing data types an API for CRUD and Query operations on a data type registry fulfillment of a dependency for the PIT API

Research Data Collections WG (pending) would provide: formalization of conceptual models for collections a CRUD API for collection operations

RDA Adoption - Concrete Benefits
Domain-neutrality Improves our data management practices Enhances the ability for our data to be reused by others more clearly and thoughtfully expressed documented APIs for interacting with it Gives us the ability to scale more rapidly to support new PID types

RDA Implementation - Potential Benefits
Sustainability if implementations of the APIs are built and maintained by diverse communities and projects Greater bi-directional interoperability if other projects with whom we want to share data find the same benefit and implement Institutional support and long-term preservation if our libraries and institutional repositories also deploy and implement the solutions

RDA Implementation - Costs/Risks
Significant amount of code refactoring but at least some of this would need to be done anyway, if we rolled our own solutions Overhead of fully understanding the RDA outputs and figuring out how to implement them always more difficult to understand a solution somebody else developed, even if documented Overhead in participating in a collaborative effort on the RD Collections definition but hopefully would result in a more robust solution Possibility of gaps/disconnects requiring custom extensions of the RDA outputs might not be fully discoverable until implementation

Relevance of RDA Outputs in the Humanities

Similar presentations

Presentation on theme: "Relevance of RDA Outputs in the Humanities"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Relevance of RDA Outputs in the Humanities

Similar presentations

Presentation on theme: "Relevance of RDA Outputs in the Humanities"— Presentation transcript:

Similar presentations

About project

Feedback