Sustainable Preservation of Linked Data Vassilis Christophides
Linked Data vs Cultural Artifacts! Linked datasets are digitally-born objects designed to be copied, rely on vocabularies and integrity constraints (understandable by both people and programs), whose data and structures changing over time
Digital Object vs Data Preservation Source: Preserving Our Digital Heritage: The National Digital Information Infrastructure and Preservation Program 2010 Report. A Collaborative Initiative of the Library of Congress
Frame Linked Data Preservation as a Sustainable Economic Activity Economic activity: deliberate allocation of resources – Cost of losing datasets Sustainable: ongoing resource allocation over long periods of time – Involved data subjects Articulate the problem/provide recommendations & guidelines – Economic and societal benefits Technical Social Economic Blue Ribbon Task Force on Sustainable Digital Preservation and Access, Final report 2010
Sustainability Conditions Who benefits from use of the preserved data? Who selects what data to preserve? Who owns the data? Who preserves the data? Who pays both for data and preservation services? recognition of the benefits of preservation by decision makers selection of datasets with long- term value incentives for decision makers to act in the public interest or to elaborate new business models appropriate governance of preservation activities ongoing and efficient allocation of resources to preservation timely actions to ensure long- term data access and usability
Benefits & Incentives Clearly articulate benefits of digital preservation activity – “Value proposition” for digital preservation – Benefits should emphasize outcomes – Articulate benefits cultivate sense of value, “willingness to pay” Clearly articulate incentives for decision-makers to act – Accept responsibility to undertake preservation – Identify and leverage institutional “self-interest”: e.g., business opportunity; mission-driven; policy compliance – Orchestrate incentives over complete digital lifecycle
Selection & Allocation of Resources Selection: can’t “preserve everything for all time” – Prioritization: allocate resources where they generate most value – Circumscribed set of materials; realistic preservation goals – Manage expectations; align expectations and capacity Support ongoing, efficient allocation of resources – Coordinate resource transfer from those who are willing to pay to those who are willing to preserve (pricing, donations, fees/taxes) – Efficiency: productive use of resources; leverage economies of scale, economies of scope
Organization & Governance Preservation activities can be managed through a variety of organizational forms, e.g.: – Organization with no private interest in preservation (e.g., third party service) – Organization with private interest in preservation; preserves on behalf of itself and other organizations (e.g., research library) – Organizations with mandate to preserve, conferred by public policy, to fulfill stated public interest (e.g., national archive) Governance: strategy, responsibility, accountability Organization/governance trust
The Scientific Data Life Cycle Data Life Cycle Labs A New Concept to Support Data-Intensive Science
Cost of Curated Data Production When we think of the cost of digital preservation, we naturally separate it from the cost of creating that data in the first place – Certainly this is true for data we find in traditional libraries – Unfortunately, this separation is inappropriate for curated databases Like traditional reference works, curated databases have been created with an enormous amount of effort and require continuous updating by experts to keep them consistent with the most recent scientific discoveries Curated data/embedded code 10 7 Production code10 6 Book10 5 Movie10 3 High-energy Physics (Large Hadron Collider) 0.1
Data-as-a-Service (DaaS) Pricing Models By far the most common case is that of a fixed price for the entire data set (CustomLists, Infochimps) or a fixed number of transactions per month based on client subscriptions (Azure DataMarket, Infochimps API) DaaS pricing models are based on tiered data access falling into – Volume-based model: 1) quantity-based pricing and 2) pay per call (A “call” is a single request/response interaction with the API for data) – Data type-based model: An example is a mapping API that offers the geo-coordinates and zip codes of the neighbourhoods in an urban area while additional attributes including school or post office locations are sold for an additional charge – Hybrid pricing models combine value with volume charges to create finer-grained pricing to better meet both the buyers’ and sellers’ needs Existing pricing models favour essentially big customers that can typically afford to purchase the entire data sets they need, but small customers often need only a few data items from them and cannot afford to pay the full price
DataMarket Target Customers Source Hjalmar Gislason DataMarket, Inc Emerging DaaS business models: A case study European Data Forum (EDF), Dublin 2013