Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar Anand (5), Carole Goble (1) Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science (1)School of Computer Science, University of Manchester (2)Dept. of Computer Science, University of California, Davis (3)Dept. of Computer Science, Gonzaga University (4)Dept. of Computer Science, Appalachian State University (5)San Diego Supercomputer Center, University of California, San Diego WORKS’10, New Orleans
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Context: Data Sharing Implicit collaboration through data sharing –Alice uses n th generation input dataset x and produces n+1 st output dataset z –… as part of run R A of workflow W A –… output z is published in some data-space. –Bob uses Alice’s outputs z and produces n+2 nd generation dataset v –… using workflow W B, possibly with pre-processing f – Alice and Bob may not know each other
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Motivation: Virtual Joint Experiments How do we ensure that Charlie gets a complete account of the history of W c ’s outputs? How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v? traces T A and T B will be critical need to compose them to obtain T C We can view the composition W C as a new, virtual workflow
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Provenance Composition: the Data Tree of Life (DToL) We can formulate our questions in terms of provenance of the datasets produced by virtual workflow W C : –What is the complete provenance of v? Answering the question requires tracing v’s derivation all the way to x But, to achieve this, we need to ensure: T A and T B are properly connected Provenance queries run seamlessly over and across T A and T B
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Test scenario: 1 st Provenance Challenge Workflow DataONE Summer-of-Code Project –Split First Provenance Challenge workflow at various points –Publish Part-I from system X, use as input for Part-II on system Y X, Y in { Kepler/SDF, Kepler/COMAD, Taverna }
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Common Model of Provenance (approx. OPM) Data provenance for a single workflow run is well understood T A trace instance of W A : h: T A ➔ W A homomorphism h(x 1 ➔ a 1 ) = h(x 2 ➔ a 2 ) = X ➔ A, h(a 1 ➔ y 1 ) = h(a 2 ➔ y 2 ) = A ➔ Y... Workflow spec: digraph W= (V W, E W ) V W = A ∪ C - actors A (processors) - channels C (FIFO data buffers) E W = E in ∪ E out in edges E in ⊆ A x C out edges E out ⊆ C x A Trace graph: acyclic digraph T = (V T, E T ) V T = I ∪ D (invocations I, data D) E T = E read ∪ E write read edges E read ⊆ D x I write edges E write ⊆ I x D
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Data and Invocation Dependencies ( ddep, idep ) - read, write are natural observables for a workflow run - possible additional relations (recorded or inferred): invocation dependencies: data dependencies: “a 2 depends on a 1 ” because a 1 has written data d, a 2 has read d Explicit or via: “d 2 depends on d 1 ” … because some actor invocation a read d 1 prior to writing d 2 (Note: in some models of computation the rules above are not correct)
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Provenance queries Local (“non-closure”) queries on a trace T: –Find the data and traces published by Alice / Bob –Find the inputs, outputs, and intermediate data products of T –Find (selected) actors and channels used in T –Find inputs and outputs of an invocation a i in T Easy and not very interesting E.g. answer to (3) is just the set of nodes in h(T) Closure queries: operate on the transitive closure ddep* over ddep: suppose ddep* spans multiple traces T A, T B we must define the standard query: so that it operates on the composition of T A, T B
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Issues in Provenance Composition Main problems and approaches: Closure queries now must span multiple provenance traces –heterogeneity of both workflow and provenance models I - Trace disconnect: –traces that should “join” on the shared data, are really disconnected – make data sharing process itself provenance-aware III - Data identifiers mismatch –different workflows adopt different data identification schemes – assert data equivalence as part of provenance II - Model heterogeneity: –common provenance model with local ➔ global mapping –different workflow and provenance models
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Part I – Provenance Stitching The missing link: make every data copy step provenance-aware - r : data reference in store S - trace-equivalence of data items d in S, d’ in S’: d ≃ d’ if d’ is obtained by copying d from S to S’:
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Part II - Mapping to a Common Provenance Model Mapping rules (= code, queries) defined from Kepler and Taverna provenance models to common model (details omitted): In the result T P each reference r found in T S is replaced with ρ(r) – OPM used as intermediate target model – … doesn’t “nail” everything – a mixed blessing … – … but team-work made it work!
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Part III – Data Identifier Reconciliation We have seen that the copy operation … r’ = copy(r, S, S’) … on shared data store S generates a data equivalence assertion It also keep track of ID mappings: added to renaming map from a set of S-specific references to a set of public references
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Extended (across-runs) Provenance Queries Closure queries are redefined on the extended provenance trace that includes trace-equivalences d ≃ d’ as follows: for instance between
Prototype Architecture
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 Conclusions 1/2 In theory, provenance interoperability should be solved/easy using e.g. OPM In practice it isn’t (cf. Provenance Challenge workshops), e.g. –different mappings to OPM –different identifier schemes –traces broken “at the seams” Summer-of-code DToL prototype demonstrates feasibility of provenance-aware collaboration / workflow interoperation through data –Extends potential of provenance analysis beyond isolated workflow- based experiments Findings relevant for data preservation in –Tracing data access is key
Linking Provenance Traces … P Missier, B Ludäscher et al. WORKS’10 DataONE: – Data Tree-of-Life (DToL Summer Project) – Runtime wf systems interoperability can be very hard –… and benefits not clear (unless “layered” approach w/ different roles of wf systems) wf provenance interoperability to the rescue! Next Steps: –DataONE Working Group on Provenance for Scientific Workflows –Develop DOPM (DataONE Provenance Model; OPM++) Conclusions 2/2