Provenance of scientific information as experienced in DRIVER 6th e-Infrastructure Concertation Event Lyon, 24 th November 2008 Wolfram Horstmann Bielefeld University / DRIVER
Notions of Provenance Where do data objects* originate from? –Scientific Work -- examples Instrumentation techniques –Manufacturers of hard- and software Methodologies –Processes, e.g. gene sequencing –Technical/Local -- examples (web)-identifiers Database, repository name * Primary data, documents, metadata …
Why Provenance? Quoting / Citing / Referencing as global scientific principle –„Reproducible research“ Giving credits to authors / creators in distributed environments Original location / context has to be known Experienced in Grid-Environments [1]
Provenance & Interoperability Re-Use / Sharing: “Addressing/Accessing” –Common view, common use –Unidirectional: No change of data objects! Federation: “Discovering in Context” –Remote representation of distributed DOs Aggregation: “Contextualizing” –Add unchanged object in a context Processing/Annotation: “Changing” –Uni- vs. Bidirectional: Change of DOs and remote representation vs. back-storage (e.g. CVS)
Scenarios in DRIVER
Digital Scientific Data
Digital Object Collections ⊃ ⊃ ⊃⊃
Digital Object Repositories =
Digital Information Space
Conventional Web Data
„Simple“ Applications
Metadata Infrastructure
Basic Provenance Settings Indicate Production Situation –Metadata Author, Instrumentation etc. Remote Representation –Indicate place of origin in remote systems Metadata as digital objects / first order citizens –Allow lineage respresentation Credits in remote environments / versioning
Orders of Provenance 1st order: Metadata –Provenance attached to data –Minimal „knowledge“ required in application –Allow remote handling of data objects –Require metadata infrastructure –Metadata introduce 2 objects: requires linkage 2nd order: context / compounds –Express multiple relations between objects –May introduce semantic model
Provenance in DRIVER #1 Simple Objects: OAI-PMH [2] –1st order provenance Metadata: minimum OAI-DC –2nd order provenance DRIVER explicit identifiers for repositories OAI-PMH: inline representation („about“)
Semantic/Compound Data
„Semantic“ Applications
Provenance in DRIVER #2 „Enhanced Publications“ –Research project in DRIVER-II –Representation of data /document packages –Use of OAI-ORE
Provenance in OAI-ORE OAI-ORE: Object Re-Use and Exchange [4] –Uses Resource Maps < Named Graphs –Uses „lineage“ to represent expl. Provenance –Future: explicit provenance model [7] ?
Summary Provenance essential for … –Indicating origin in distributed data spaces Accessing / Addressing Federation / Aggregation Processing / Annotation –Document and data citation / trace-back –1st order: describing data > metadata –2nd order: describing context > semantic data
Lessons learnt in DRIVER Use web-enabled Identification (URI/UDDI etc.) –„Dark“ databases don‘t interoperate 1st order provenance at place of origin –Requires metadata to describe origin –Enables a metadata infrastructure –Introduces linkage problem 2nd order provenance in contexts –Requires data provider identification in federators / aggregators in order to link back –May require semantic model for context –Would benefit from a semantic infrastructure
Resources [1] On provenance in the eScience / grid-environment – –In GLITE [2] On provenance in OAI-PMH – [3] On provenance OAI-ORE (referred to as ore:lineage) – (general) – (definition) [4] Named Graphs, Provenance and Trust (Caroll et al. ) – [5] W3C: On provenance in RDF – [6] Open Provenance Model – [7] DRIVER: Digital Repository Infrastructure for European Research –