Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.

Slides:



Advertisements
Similar presentations
R O O T S Field-Sensitive Points-to-Analysis Eda GÜNGÖR
Advertisements

Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
The Pumping Lemma for CFL’s
1 CIS 461 Compiler Design and Construction Fall 2014 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
1 CS 201 Compiler Construction Machine Code Generation.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
1.6 Behavioral Equivalence. 2 Two very important concepts in the study and analysis of programs –Equivalence between programs –Congruence between statements.
Closure Properties of CFL's
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
A Propagation Model for Provenance Views of Public/Private Workflows Susan Davidson U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington.
SYMBOLIC MODEL CHECKING: STATES AND BEYOND J.R. Burch E.M. Clarke K.L. McMillan D. L. Dill L. J. Hwang Presented by Rehana Begam.
1 Potential for Parallel Computation Module 2. 2 Potential for Parallelism Much trivially parallel computing  Independent data, accounts  Nothing to.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
A Visual Query Language for Business Processes Catriel Beeri Hebrew University Anat Eyal, Simon Kamenkovich, Tel Aviv University Tova Milo.
Lectures on Network Flows
Introduction to Computability Theory
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture15: Reductions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture7: The Pumping Lemma for Context Free Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture5: Context Free Languages Prof. Amos Israeli.
Lecture 15UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 15.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Penn ESE 535 Spring DeHon 1 ESE535: Electronic Design Automation Day 13: March 4, 2009 FSM Equivalence Checking.
Firewall Policy Queries Author: Alex X. Liu, Mohamed G. Gouda Publisher: IEEE Transaction on Parallel and Distributed Systems 2009 Presenter: Chen-Yu Chang.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Programming Language Semantics Mooly SagivEran Yahav Schrirber 317Open space html://
CS 206 Introduction to Computer Science II 12 / 10 / 2008 Instructor: Michael Eckmann.
Semantics with Applications Mooly Sagiv Schrirber html:// Textbooks:Winskel The.
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
1 CSCI 2400 section 3 Models of Computation Instructor: Costas Busch.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Complexity 2-1 Problems and Languages Complexity Andrei Bulatov.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
1 Automatic Refinement and Vacuity Detection for Symbolic Trajectory Evaluation Orna Grumberg Technion Haifa, Israel Joint work with Rachel Tzoref.
Logic Circuits Chapter 2. Overview  Many important functions computed with straight-line programs No loops nor branches Conveniently described with circuits.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra National University of Singapore.
CALTECH CS137 Spring DeHon CS137: Electronic Design Automation Day 9: May 6, 2002 FSM Equivalence Checking.
“Software” Esterel Execution (work in progress) Dumitru POTOP-BUTUCARU Ecole des Mines de Paris
Introduction to Parsing
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
1 Turing’s Thesis. 2 Turing’s thesis: Any computation carried out by mechanical means can be performed by a Turing Machine (1930)
Verification & Validation By: Amir Masoud Gharehbaghi
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
CHAPTER 8 THE DISJOINT SET ADT §1 Equivalence Relations 【 Definition 】 A relation R is defined on a set S if for every pair of elements (a, b), a, b 
1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,
Putting Lipstick on Pig: Enabling Database-styleWorkflow Provenance YaelAmsterdamer,Susan B.Davidson,Daniel Deutch Tova Milo,Julia Stoyanovich,ValTannen.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
 2004 SDU Lecture8 NON-Context-free languages.  2004 SDU 2 Are all languages context free? Ans: No. # of PDAs on  < # of languages on  Pumping lemma:
SOFTWARE TESTING LECTURE 9. OBSERVATIONS ABOUT TESTING “ Testing is the process of executing a program with the intention of finding errors. ” – Myers.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
More NP-Complete and NP-hard Problems
Capability-Sensitive Query Processing on Internet Sources
RE-Tree: An Efficient Index Structure for Regular Expressions
Lectures on Network Flows
Program Slicing Baishakhi Ray University of Virginia
Parsing Costas Busch - LSU.
The Pumping Lemma for CFL’s
Interval Partitioning of a Flow Graph
Disjoint Sets DS.S.1 Chapter 8 Overview Dynamic Equivalence Classes
Applications of Regular Closure
Unit III – Chapter 3 Path Testing.
Presentation transcript:

Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo

Provenance in Scientific Workflows Scientific workflow systems – Automate the execution of scientific workflows – Capture the provenance at various levels of details Provenance graphs are currently captured by these systems as DAGS: 2 Taverna [Hull+06] Kepler [Altintas+04] VisTrails [Callahan+06] BP-QL [Beeri+06] M1: Split Entry M4: Run Alignment M3: Format Annotation M2: Check Annotation M6: Build Phylo. Tree M5: Format Alignment d3 d7 d1 d5 d8 d6 d2 d9 d4

and I mean that provenance is “captured”… 3

How can we “release” provenance? Since provenance is a DAG, can use any graph query language – E.g. ProQL [Karvouranakis+ SIGMOD’10] Provenance Challenges showed a number of desired queries which should be optimized for. – Many were simple reachability queries “Identify the data sources that contributed some data leading to the production of publication p”. – Others required regular path queries “Find all publications p that resulted from starting with data of type x, then performing a repeated analysis using either technique a1 or technique a2, terminated by producing a result of type s, and eventually ending by publishing p.” Provenance can also be very large, so we need “views” – Zoom-in/Zoom-out [Biton+ ICDE’08], [Amsterdamer+ VLDB’11] 4

Overview The problem – How to efficiently answer reachability and regular path queries over provenance graphs representing workflow executions? The approach – Reachability labeling The setting – Dynamic provenance graphs – Derivation-based model of executions 5

Reachability Labeling Given a graph, we assign each node a label such that using only the labels of any two nodes, we can decide if one can reach the other. Goal: design a compact and efficient labeling scheme Labeling scheme should also be dynamic and persistent. Labeling function Decoding predicate dl(d)l(d1)l(d2) Yes iff d1 reaches d2 (1, 9) (2, 7) (3, 5) (5, 8) (4, 6) Interval encoding [Santoro+85] Test interval containment 6 Logarithmic label sizeLinear construction timeConstant query time

“Surprising” things Compact and efficient labeling is achievable for DAGs generated by workflow specifications We can efficiently implement regular path queries using labeling. 7

Roadmap Introduction Workflow Model Reachability Labeling Scheme From Reachability to Regular Path Queries Conclusions 8

Grammar-Based Workflow Model Specification and execution (a.k.a. run or provenance graph) – Specification is small (a “constant”), execution is big (due to repeated parallel/sequential behaviors) Specification = context-free graph grammar Formally, G = (Σ, Δ, S, P) – Σ: a set of atomic modules – Δ: a set of composite modules – S: start module – P: a set of graph-based productions of form M := W All possible executions = context-free graph language Formally, L(G) = { R in Σ* | S =>* R } 9 terminals variables start variable composite module sub-workflow

Example: Workflow Specification 10 Alternative productions Recursive productions

Example: Workflow Execution 11 Specification : An execution :

Derivation-based view of execution 12

Fine-Grained Dependencies Modules may have multiple inputs and output (we have assume single input/output so far). Typical assumption: Black-box dependencies – Every output depends on every input. – Creates problems for composite modules. In a fine-grained model, an output depends on a subset of inputs. 13 d5= d1+d2 d4= d1 d1 d2 M3

Roadmap Introduction Workflow Model Reachability Labeling Scheme From Reachability to Regular Path Queries Conclusions 14

Reachability Labeling Labels represent the sequence of node replacements (derivations) that produced the module execution – “compressed” parse tree, whose height is proportional to the size of the workflow grammar 15

Testing Reachability Problem Statement: Given two nodes (x,y) in an execution is y reachable from x? – a node in the execution is a leaf in the compressed parse tree Method (roughly speaking): Look for least common ancestor of nodes, and examine the specification to see if the node-types are reachable. – Is b:3 reachable from c:1? – Is d:1 reachable from a:1? 16

Roadmap Introduction Workflow Model Reachability Labeling Scheme From Reachability to Regular Path Queries Conclusions 17

Idea: Intersect Query with Execution 18 Query _*e_* as a DFA: Fine-grained representation of sample execution for _*e_* : Sample execution:

Regular Path Query becomes Reachability Query 19 Is b:1 reachable from c:1 via _*e_* ? Reachable?

But labeling is dynamic… Executions are not query-dependent, labels cannot be changed once they are generated – Labels represent reachability – Intersect the specification with query at query-time 20

Revisiting example 21 Is b:1 reachable from c:1 via _*e_* ?

Some caveats Workflow grammar must be strictly linear-recursive – All recursions are node-disjoint The query must be safe with respect to the specification – Connections between input/output “states” must be stable for any execution of each composite module 22 Example: Query R1= _*e_* is safe, but R2= e is unsafe.

Conclusions Since workflow provenance graphs are derived from a specification, there is additional information that can be used to develop compact and efficient labeling schemes To answer (safe) regular path queries: – Create reachability labels at execution time – Intersect query with the specification at query time – Combine to answer query Queries which are not safe can be broken into smaller, safe subqueries and their results joined together These ideas can also be used to answer queries over views of fine-grained workflows. 23

24