Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automated Provenance Capture Architecture and Implementation Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL.

Similar presentations


Presentation on theme: "Automated Provenance Capture Architecture and Implementation Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL."— Presentation transcript:

1 Automated Provenance Capture Architecture and Implementation Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL

2 Methodology Simulated using Kepler workflow system. We did not attempt to leverage looping. Programmed stub actors for each step with proper inputs, outputs, and user controlled parameters. Implemented an execution event listener that optionally records workflow. No changes to core Kepler were made. Applied/extended our existing content management/provenance system to see how far we could go with it. Implemented actors / workflows for queries/visual analysis using xslt/graphviz

3 Provenance Capture Architecture Naming Service Content Store Workflow Engine Prov Capture Service Metadata extraction Query Service Triple Store Provenance/Content System Prov Service Translation Analysis Tools Query Tools Browsing Tools Client Tools eventsWorkflow Tools Workflow UI Indexing Annotation Tools Harvesting Tools

4 SDG Provenance Capture Implementation URL, LSID Content Store Kepler Engine Prov Capture Service Metadata extraction Triple Store SAM Translation Node x Node Comparison Bategalj/ Mrvar Algorithm DavExplorer/ Ecce Client Tools events Workflow Tools Kepler UI All within Kepler Nettool Kepler workflows SEDASL Lucene Defuddle GML, webDav URIQA (rdf), webDAV Query processor Prov processor URIQA(rdf),/ webDAV Triple Harvesters

5 Physical Model Named Thing Property Content 0..n 1 Property can be a link or a value Any “thing”, for which we want to capture some information, is given a unique id with which properties and relationships can be associated. Additionally, content can be associated with these “things”.

6 Logical Overlay Workflow Instance startedExecution finishedExecution Actor Instance startedExecution finishedExecution creator wasRunBy owningInstitution created title Port Value format created hasSource hasOutput [arbitrary triple]* isPartOf title isInput [arbitrary triple]* title format Parameter hasParameter format title hasValue OR hasHashOf Value [arbitrary triple]* hasValue OR hasHashOf Value format 11 1 0..n 1 hasStatus createdWith uid

7 Semantically Extended DASL Queries Select - all properties or a specific list - format (gxl, rdf, webDAV) Scope - a url or query (i.e. 2 phase) - names of properties to follow (and direction) - stop conditions (property/values comparisons, depth) Where - property name/value comparisons, content search

8 Workflow Comparisons Node-by-node comparisons –Nodes match if all node attributes and incoming and outgoing edges match –Nodes are similar if attributes and edges match to some specified XX% After node comparisons, edges are compared –Edges match if connecting nodes were found to be exactly matching or similar and edge attributes match –Edges are similar if attributes match to some specified XX% Outputs include: –Matching or similar nodes, –Matching or similar edges, –Nodes only in first or second graph, –Edges only in first or second graph title instantiationOf source value format isPartOf-reverse isInput-reverse hasOutput isInput hasOutput-reverse isPartOf Nodes only in First Graph: node52 (atlas-z.gif, ) node14 (imageformat, ) node53 (convertyimage,) node57 (atlas-y.gif, ) node36 (convertzimage) node34 (imageformat, ) node15 (imageformat, ) node78 (atlas-x.gif, ) node26 (convertximage) Count: 9

9 Workflow Graph Distances Implements social network algorithm based on triad census (Batagelj and Mrvar, 2001; Chin, Whitney, Powers, and Johnson, 2004) Examines every three possible nodes of a workflow graph Every three possible combination fall into 1 of 64 possible triad states Census is counts of triads that exist in the graph, which may be used to summarize or profile overall graph structure Distance is computed by taking Euclidean distance of two triad censuses which is normalized to a 0.0..1.0 value Most useful for assessing similarity across large, complex workflows Distance computed for two workflow graphs: 0.095888 (0, 4, 6, 1,…)

10 What’s Cool –Combined rdf assertions with scientific content management flexible capabilities for metadata extraction (e.g. Defuddle to extract data from warp file). Existing rdf harvestors could be plugged in through same mechanism Extensible translation mechanism (browse tools can provide views of raw data such as a table of warp parameters) –Conceptually simple model that can apply to much more than workflow execution. –Readily adaptable to alternative models, constructs, relationships. –Indexing and Query of content or metadata –All relationships are reverse indexed automatically. You can search up or down and event mix directions on specific properties. –Flexible event based model so as to minimize connections into workflow engine –Actors can contribute their own metadata easily through events –User control over which actors to capture provenance on –Automatic content type determination –Multiple output formats –Capability to capture hashes instead of values –Leveraged DASL extension mechanisms –Based on existing standards (http) – existing tools can be leveraged –Pluggable authentication model base on JAAS –Everything is open source

11 Limitations –Prov capture is slow. We do one assertion at a time currently but they could all be packaged up into one request –RDF predicates can’t contain special characters but things like parameters often have these characters. –SAM can be made to work but current implementation based on WebDAV ties resources to metadata. We had to create dummy resources. –SAM not rdf based. –Big files (reference images) are duplicated as part of provenance tracking because they are data inputs to multiple actors. –Did not get to LSID service but it would be nice if this wasn’t a separate protocol to deal with.

12 Kepler/Workflow Comments –Decided to stay with brute force model instead of loop based model. Loop based model would probably introduce controller actors that would obscure the provenance capture. –Issue of what to capture provenance on for more general workflows –Coding actors for each thing you want to do doesn’t scale and is a barrier to adoption by scientists. –Can’t control actor firing order which resulted in things like AlignWarp4 producing warp1.warp –We used string constant actors to supply input files but it makes more sense for Kepler to support the concept of a data source. –We could not tell if a port value was a file except by using File.exists() –Would like to see events be external for complete separation from workflow engine.

13 Out of (Current) Scope –Dynamically changing and continuing workflows (ie evolving workflows) –Pointing back to provenance on Actors. A real system would do this and the Actors themselves would have global ideas that could be referenced. –Capturing provenance on workflow descriptors and pointing back to them (same as above for actors) –Use of lsids – we have service running but never got to the point of inserting it. Instead we used a url name generation service. –Signing results

14 Brainstorming Categorization How data was generated –User set parameter values –Workflow structure/execution capture –Outside tools –Auto-generated metadata/content The structure of the query –2 phase query (or recursive) –Specifying wht to include –Specifying what to exclude What it will be used for –Exploratory analysis –Directed query to answer a specific question –Debugging –Verification –Comparison


Download ppt "Automated Provenance Capture Architecture and Implementation Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL."

Similar presentations


Ads by Google