A Perspective on Preservation of Linked Data Richard Cyganiak DERI, NUI Galway
How is Linked Data preservation different? Easier because RDF is (sometimes) self- describing – Representation information and context tends to be explicit and machine-processable Harder because it is tied to a particular technology infrastructure – If the domain name is lost, a dataset can no longer be LD (cf. TimBL's four principles) – Doesn't mean the data is no longer useful
Why think about preservation of LD? 1.Can the preservation community teach us how to make data more self-describing? 2.Preservation requires packaging. LD needs better data packaging 3.Preservation requires versioning. LD needs better versioning 4.LD datasets do go offline. How can we deal with it? Preserving the bits is not necessarily the hardest problem!
Access and formats Multiple methods of publishing/accessing LD – Dereferenceable URIs – SPARQL endpoints – RDF dumps (triple/quad) – Embedding into web pages (RDFa, microdata) Focus on RDF dumps to keep things tractable and to maximise usefulness for non-RDF data
Vocabularies Meaning of an LD dataset depends on used vocabularies (a.k.a. ontologies) – Most important representation information – Vocabularies can change and disappear too – Need to be preserved alongside the data Vocabularies would be good starting point for LD preservation – Note: LOV already archives versions of 100s of vocabularies (
Versioning How to package individual versions of a dataset in an explicit, machine-readable way? There is no strong notion of versioning in the RDF community. – Books have editions. Software products have releases. This is important for data too. What version of Dataset X are you using? “Dependencies” between datasets and vocabularies, incl. versions? See also: Memento
Cataloging and packaging How can the various parts of a dataset and its surrounding information be packaged and held together in an explicit, machine-readable way? What metadata needs to be recorded about these packages to preserve context and make them findable? Potential benefit: Tooling for setting up a local copy of a published/archived dataset including all its dependencies See also: OKFN's data packages –
Existing relevant (?) standards VoID – Metadata standard for RDF vocabularies DCAT – Upcoming W3C standard for data catalogs PROV – W3C standard for provenance DDI Discovery Vocabulary – Used by data archives to document statistical microdata, survey data, etc.
Summary 1.The most important repository for LD preservation will be one that versions vocabularies 2.Focus on bulk RDF (dumps, not SPARQL endpoints or deref URI crawling) 3.Work towards good practices for making data self- describing and for metadata? 4.Work towards standards and good practices for packaging, versioning, dependencies? 5.Use existing standards: VoID, DCAT, PROV, Disco 6.Preservation across time… 7.But also preservation across space and communities