Download presentation
Presentation is loading. Please wait.
Published byJasmin Young Modified over 9 years ago
1
Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)
2
My interest in scientific workflow and provenance… In a previous life… Research Scientist, PNNL, DOE National Laboratory Research Scientist, PNNL, DOE National Laboratory Machine learning, pattern recognition over large data sets Machine learning, pattern recognition over large data sets Scientific experiment management system (EMSL) Scientific experiment management system (EMSL) Electronic laboratory notebook for experiment capture Electronic laboratory notebook for experiment capture More recently… Database Group, Microsoft Research in Redmond, WA Database Group, Microsoft Research in Redmond, WA ImmortalDB (ICDE’06, SIGMOD’06), Event Processing, Phoenix Extend commercial software to support scientific research Extend commercial software to support scientific research Tailor software for the sciences, provide free of charge Serve as a positive force in the community (Tony Hey) Practical value, challenging information management research issues…
3
Objectives for this initial effort Provenance capture that is automatic & transparent Should persist provenance data for a fixed period of time Support multiple levels of representation WF description Logical log (o & p) deviations step-by-step trace. Version and lock the executables Efficient representation and management Opportunity to significantly reduce execution provenance storage costs An enactment engine for scientific workflows that documents all steps linking original inputs with final results so an experiment (execution) can be verified, reproduced or rerun
4
Issues NOT considered in our initial effort Annotations and provenance of the workflow Annotations and provenance of the workflow How to include external provenance How to include external provenance Evaluate our prototype on actual scientific workflows Evaluate our prototype on actual scientific workflows Provide query and analysis support over execution provenance traces… Focus on mechanism, implement something simple but useful, consider how to manage this virtual data product Provenance capture that is automatic & transparent Support multiple levels of representation Version and lock the executables Efficient representation and management
5
Types of Provenance to Capture in Workflow Execution Experiment Design Serialize the workflow schedule (XOML) Invocation Record Invocation of specific activities, events and rules Deviations from the defined schedule (shims, etc) Interaction Provenance Input variables, runtime parameters, activation inputs External services invoked, return value(s), etc Job Provenance Start/complete time, etc A workflow schedule sequential, event, rule driven An Activity What about internal state?
6
Architecture Overview Query and Management Interface (QMI) Provenance Storage Service Interface (PSI) Workflow Execution Provenance Storage Service (built using CLFS) Logical Logging Utility Problem Solving Environment Workflow Enactment Engine (WinWF) Client Query Library Management Routines Provenance Services Trace execution Difference analysis Reload runtime state … HPC Job Scheduler CreateJOB(XOML) ExecuteTask(JID, Act)
7
Implementation – extending base activity classes Activities are the basic building blocks They are the unit of execution, re-use and composition The root of entire workflow is itself an activity Composite activities contains other activities EG: Sequence, Parallel, Synchronize, Exclusive Choice, Merge,… Basic activities are steps within a workflow Activities are simply classes Properties and events are introduced to intercept and pass control to provenance capture service at runtime… Each class defines provenance persistence methods that are invoked by the workflow runtime
8
Workflow Execution My Experiment rt.StartWorkflow(typeof(WF1)); Instance Manager Persist Provenance 1 App calls StartWorkflow(…) WF1 Invoke1 2 Instance Manager: Loads workflow type Creates instance Enqueues WF1 with Scheduler 3 Scheduler dequeues WF1, serializes XOML calls Executor(SequentialWorkflow base) which enqueues Sequence Activity MyWF.dll Persist provenance to disk Execute until idle Create instance Execute Sequence Save SequentialWorkflow Execute Sequence Execute OnEvent1 WF1 Instance WF1 Scheduler Sequence OnEvent1 WF1 4 Dequeue Sequence & calls Executor which serializes ActRec and enqueues OnEvent1 Dequeue OnEvent1, serialize ActRec and call Executor which subscribes to event5 InstanceMgr calls Flush() on WF1 (Activity base class) to flush provenance records and gets back stream6 Instance Mgr call Provenance service passing serialized stream – Provenance Storage service saves to disk7 Base Activity Library Runtime Engine Runtime Services
9
Transparent Interception and Logical Logging...SEQUENCEActivityWorkflow Activity 1 Workflow Activity N Each activity is creating an operation history – a time serial stream of provenance records. Each record represents a change in operational state, such as sequence advancing, a synchronize or branch being taken, activities passing data via method calls. Replay of the log is an accurate repeated history of state changes, up to and including the “present” state Provenance Service “weaves” these records into the workflow XOML, recording LSNs for individual activities, insertions (shims), etc.
10
Host Process Workflow Foundation Provenance Capture Integrated into Runtime Engine and Services Base Activity Library, classes augmented with provenance capture My Experiment Runtime Services hosting flexibility - pluggable implementations (with defaults) Provenance Storage (PSI) Communication Tracking … Runtime Engine provides intrinsic behaviors to activities Tracking Infrastructure State Management Workflow Execution Provenance Management
11
Query Support (initial) Individual Workflow Execution Trace Display a graphical trace of the execution; Query for skipped steps, inserted steps, etc Query for the codes (activities) invoked. Query for machine execution stati Multiple Workflow Execution Traces Comparative trace (shallow, versus deep compare) Still “early days” for our query support over a workflow execution provenance trace store
12
An Issue to Consider … It may not possible to rerun experiment, to either validate or recreate a result because original workflow is lost (activities have been updated). Assign a version identifier (strong name) to the workflow assembly so it can be associated with the result; only retain if provenance is retained. Updating any activity in the workflow will change this version number, resulting in a new version being created. User is able to rerun the experiment by invoking workflow using fully-specified reference found in the provenance record;
13
Extended Windows Workflow Foundation Transparently capture execution trace leading to a result Towards a layered provenance model Initial query facility built over this provenance data This summer, evaluation and necessary extensions, analysis support Luciano Digiampietri (UniCamp/Brazil), project intern Tying provenance to code versioning In general, how to manage provenance data and code so the scientist simply doesn’t have to worry about it… An interesting data management challenge Provenance as a first class derived data item To Sum Up…
14
Closing Comments… Provenance presents many, many open questions, but offers so much potential… Execution provenance (sadly) is just the tip… Is this even provenance – where to draw the line? Shall we revel in complexity, or focus on the low-hanging fruit? Can’t we do both? Standards (agreements) on representation/protocols Try to reach a “tipping point” Welcome your feedback, suggestions and open to opportunities to collaborate on this problem…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.