Manish Kumar Anand Eighth Biennial Ptolemy Miniconference Berkeley, California A Provenance Framework to Capture, Store, Query, and.

Slides:



Advertisements
Similar presentations
Analyzing Regression Test Selection Techniques
Advertisements

Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.
IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow.
SQL*PLUS, PLSQL and SQLLDR Ali Obaidi. SQL Advantages High level – Builds on relational algebra and calculus – Powerful operations – Enables automatic.
Chapter 6: Transform and Conquer
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
1 Semantic Processing. 2 Contents Introduction Introduction A Simple Compiler A Simple Compiler Scanning – Theory and Practice Scanning – Theory and Practice.
Software Metrics II Speaker: Jerry Gao Ph.D. San Jose State University URL: Sept., 2001.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Building Knowledge-Driven DSS and Mining Data
5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Query Processing Presented by Aung S. Win.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Invitation to Computer Science 5th Edition
2.2 A Simple Syntax-Directed Translator Syntax-Directed Translation 2.4 Parsing 2.5 A Translator for Simple Expressions 2.6 Lexical Analysis.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Clone-Cloud. Motivation With the increasing use of mobile devices, mobile applications with richer functionalities are becoming ubiquitous But mobile.
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
Reviewing Recent ICSE Proceedings For:.  Defining and Continuous Checking of Structural Program Dependencies  Automatic Inference of Structural Changes.
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
CS4432: Database Systems II Query Processing- Part 2.
Technology Mapping. 2 Technology mapping is the phase of logic synthesis when gates are selected from a technology library to implement the circuit. Technology.
Visualization Four groups Design pattern for information visualization
Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Reasoning about the Behavior of Semantic Web Services with Concurrent Transaction Logic Presented By Dumitru Roman, Michael Kifer University of Innsbruk,
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
1 Execution Strategies for SQL Subqueries Mostafa Elhemali, César Galindo- Legaria, Torsten Grabs, Milind Joshi Microsoft Corp.
Conceptualization Relational Model Incomplete Relations Indirect Concept Reflection Entity-Relationship Model Incomplete Relations Two Ways of Concept.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
1 An infrastructure for context-awareness based on first order logic 송지수 ISI LAB.
AUTOMATIC GENERATION OF MODEL TRAVERSALS FROM METAMODEL DEFINITIONS Authors: Tomaž Lukman, Marjan Mernik, Zekai Demirezen, Barrett Bryant, Jeff Gray ACM.
Graph Database - Neo4j ISQS3358, Spring Graph Database A graph database is a database that uses graph structures for semantic queries with nodes,
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
Chapter 1 Overview of Databases and Transaction Processing.
SDN Network Updates Minimum updates within a single switch
Efficient Evaluation of XQuery over Streaming Data
Programming Languages Translator
RE-Tree: An Efficient Index Structure for Regular Expressions
Ch. 8 Priority Queues And Heaps
G-CORE: A Core for Future Graph Query Languages
XML indexing – A(k) indices
Search Engine Architecture
A Semantic Type System and Propagation
The Ohio State University
Presentation transcript:

Manish Kumar Anand Eighth Biennial Ptolemy Miniconference Berkeley, California A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler

2 Scientific Workflows Discoveries achieved via complex computations Workflows replacing traditional scripting approaches Enable automation, reproducibility, sharing, provenance Perl script Scientific workflow system

3 Provenance A record of processes, inputs/outputs, dependencies Supports reproducibility, interpretation, verification AZG AYG AXG AlignWarpReslice SoftmeanSlicerConvert

4 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline

5 Conventional Provenance Models Records – Inputs / outputs of invocations Infers –Data dependency –Invocation dependency Workflow execution graph Data dependency Invocation dependency input( a, s 1 ), output( a, s 2 ), input( b, s 2 ), input( c, s 2 ), … Assumptions: -Data is atomic -Invocations consume all inputs and produce new outputs - Every output depends on all inputs

6 s1s1 a s2s2 s3s3 s4s4 (b) Challenges in Modeling Provenance Many scientific workflow systems also support: a)Both data “transformers” and “pass-through” b)Processes with different dependency patterns c)Structured data (XML) Models of provenance must consider these factors s1s1 a (a) s2s2 s3s3 s4s4 s1s1 a s2s2 s3s3 s2s2 s1s1 s3s3 s4s4 s5s5 s1s1 s2s2 s3s3 a (c)

7 Unified Provenance Model

8 Efficient Provenance Representation Instead of storing each version –Only store a single combined version Along with a set of updates (  ’s) –Updates and dependencies represented as annotations a a -a a -a 1 Expanded Condensed  a = { ins (5,a), dep (5,2), del (3,a)}  a = { ins (5,a), dep (5,2), del (3,a), ins (6,a), dep (5,3), dep (5,4), dep (6,2), dep (6,3), dep (6,4)}

9 Expanding and Condensing Traces a -a 1 Expanded a -a Condensed

10 Trace Views convertslicersoftmeanreslicewarpalignwarp Image Header Image Header RefImage AnatomyImage Images … S1S WarpParamSet AnatomyImage Images … S2S ReslicedImage Images … S3S AtlasImage Images … S4S AtlasSlice AtlasImage Images … S5S5 Image Header Image Header AtlasGraphic AtlasImage Images … S6S6 Condensed Trace Expanded Trace Using a postorder (i.e, bottom-up, left-to-right) traversal Remove annotations from a node n (i) dep(n,c) if dep(n,p) and child(p,c) (ii) dep(n,d) if child(p,n) and dep(p,d) (iii) ins(n,x) if child(p,n) and ins(p,x) (iv) del(n,y) if child(p,n) and del(p,y) Remove invocation order annotations -Those implied according to rules in (3--8) Uses three distinct preorder (i.e., top-down, left-to-right) traversals Pass 1: rules (1-2) and rules (3-5) -Infers insertion and deletion annotations -Infers invocation order from nodes and parent-child relationships Pass 2: rules (6-8) -Infers remaining invocation precedence relationships Pass 3: rules (9-10) -Expands dependencies sets and propagates dependencies to child nodes

11 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline

12 Storage Strategies Use standard relational DBMS and minimize storage size, update time and query time Store immediate and transitive dependencies –Faster query execution Reduction techniques –Represent dependencies in reduced form

13 Storage Strategies 5 storage strategies – NC: Naive Collapsed – NE: Naive Expanded – SE: Simple Expanded – RE: Reduced Expanded – RC: Reduced Collapsed Compare: –Storage size, update time, query time NC Trace Collapsed NE Trace Expanded SE Trace Expanded Transitive Dep. RE Reduced Trace Expanded Transitive Dep. RC Reduced Trace Collapsed Transitive Dep. Reduction Algorithms

14 Analysis of Storage Strategies SE : Worst storage size and update time RC : Very expensive query time RE: Recommended storage strategy Storage sizeRC < RE < NC < NE < SE Update timeRC < RE < NC < NE < SE Query timeSE < RE < NE < RC < NC Storage Size Traces Cells (1000) Update Time Traces Time(s) Query Time NE NC SE RE RC RE RC NE SE

15 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline

16 Querying Provenance can be Expensive Queries are often recursive –Complex to formulate –Expensive to evaluate Standard querying approaches –Tied to storage representation –Query language expertise Need to query across structures, lineage, or both How to express provenance queries easily and execute them efficiently? (Q) Select lineage path that derived all children of AtlasImage created by slicer Structures Lineage

17 select t.runId, t2.nodeId, t.nodeId as depNodeId from ( select d1.runId, d1.pDep, d1.nodeId from dependency d1 where runId=runId_in union select p1.runId, p1.fromPointer as pDep, d2.nodeId from dependency d2, depSubsetPointer p1 where p1.runId=runId_in and d2.runId=runId_in and d2.pDep=p1.toPointer ) as t, depMinMaxPointer p2, ( select t.runId, r1.nodeId, t.pDep from ( select dc1.runId, dc1.pDepC, dc1.pDep from depCdepPointer dc1 where runId=runId_in union select p1.runId, p1.fromPointer as pDepC, dc2.pDep from depCdepPointer dc2, depCSubsetPointer p1 where p1.runId=runId_in and dc2.runId=runId_in and dc2.pDepC=p1.toPointer ) as t, depCMinMaxPointer p2, runCollData r1, runItemProv rp1 where p2.runId = runId_in and r1.runId=runId_in and rp1.runId=runId_in and r1.nodeId=nodeId_in and r1.pointer=rp1.pointer and rp1.pDep = p2.fromPointer and t.pDepC=p2.toPointer and t.pDep BETWEEN p2.depMin AND p2.depMax union … … To Express this Query … SQL (eg, transitive dependencies) Hard for domain scientists (… and SQL experts) Optimization depends on SQL engine [He et al. SIGMOD 08] Need for higher-level provenance query language create procedure depc(in runId_in varchar(255), in nodeId_in Integer) begin DECLARE finished integer default 0; … declare cur_1 cursor for select depNodeId from dependency where runId=runId_in and itemNodeId=nodeId_tmp; set nodeId_tmp = nodeId_in; set depCnt = (select count(*) from dependency where runId=runId_in and itemNodeId=nodeId_tmp); if (depCnt is not null) then open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; if finished then leave get_cur_1; end if; insert into depcT (nodeId) values(depNodeId_tmp); end LOOP get_cur_1; close cur_1; set cnt = 1; while (cnt <= depCnt) do set nodeId_tmp = (select nodeId from depcT where no=cnt); set row_limit = (select count(*) from dependency where itemnodeId=nodeId_tmp and runId=runId_in); set row_cnt =0; open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; set flag = (select 1 from depcT where nodeId = depNodeId_tmp); if (flag is null) then insert into depcT (nodeId) values(depNodeId_tmp); end if; if (row_cnt > row_limit) then leave get_cur_1; end if; set row_cnt = row_cnt + 1; … … SQL (stored procedures)

18 QLP Constructs First Provenance Challenge Queries Formulated in QLP Query 1*..//AtlasXGraphic Query 2#softmean..//AtlasXGraphic Query 3#softmean..#slicer..#convert..//AtlasXGraphic Query 4invocations(#align_warp[m=“12”, dateofExecution="Monday"] Query 5outputs(//AnatomyHeaders[maximum=“4096”]..//AtlasGraphic) Query 6outputs(#align_warp[-m=“12”]..#softmean) Query 7#convert..*, #pgmtoppm..* Query 8outputs(//AnatomyImages[center=“Uchicago”].#align_warp) Query 9//AtlasGraphic[studyModality=“speech” | “visual” |

19 Querying Multiple Dimensions 1. Obtain structures version operators 2. Apply XPath expressions to structure 3. Apply lineage queries to each resulting node Q QLP : * derived slicer * derived 18 Structures Lineage //AtlasImage/* (Q) Select lineage path that derived all children of AtlasImage created by slicer AtlasSlice AtlasImage Images … slicer

20 Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance Outline

21 Provenance Browser Browse different views of a trace Data dependencies, collection structure, actor invocations Move “forward” and “backward” through execution

22 Collection History Collection and invocation view Incrementally step through execution history

23 Conclusion Capture –Supports nested data collections, explicit data dependency, update semantics Storage –Reduce update time, storage size and query time Query –A high-level provenance query language (QLP) Query structures with lineage graphs Formulate queries easily and concisely Browse/Vizualize –Provenance Browser, a visualization tool to view and navigate across provenance views

24 References 1.M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs. SSDBM M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Efficient Provenance Storage over Nested Data Collections. EDBT S. Bowers, T. McPhillips, S. Riddle, M. K. Anand, B. Ludäscher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. IPAW 2008