A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.

Slides:



Advertisements
Similar presentations
TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.
Advertisements

Database System Concepts and Architecture
Michael Pizzo Software Architect Data Programmability Microsoft Corporation.
Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow.
1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.
Data Modeling and Database Design Chapter 1: Database Systems: Architecture and Components.
Database Concepts Lec. 5. What Is a Database? Data are unprocessed raw facts that include text, number, images, audio, and video. Information is processed.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
An Intelligent Broker Approach to Semantics-based Service Composition Yufeng Zhang National Lab. for Parallel and Distributed Processing Department of.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
ICS (072)Database Systems Background Review 1 Database Systems Background Review Dr. Muhammad Shafique.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Overview of Database Languages and Architectures.
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Chapter 4 Relational Databases and Enterprise Systems
Chapter One Overview of Database Objectives: -Introduction -DBMS architecture -Definitions -Data models -DB lifecycle.
Introduction to Data bases concepts
Database Management System Lecture 2 Introduction to Database management.
1 Introduction to databases concepts CCIS – IS department Level 4.
DBMS Lecture 9  Object Database Management Group –12 Rules for an OODBMS –Components of the ODMG standard  OODBMS Object Model Schema  OO Data Model.
Introduction to Database Systems Motivation Irvanizam Zamanhuri, M.Sc Computer Science Study Program Syiah Kuala University Website:
Tutorial 11 Using and Writing Visual Basic for Applications Code
Database Design - Lecture 2
1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E
Evolution of Programming Languages Generations of PLs.
Information Systems: Databases Define the role of general information systems Describe the elements of a database management system (DBMS) Describe the.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.
Introduction to Database Systems Fundamental Concepts Irvanizam Zamanhuri, M.Sc Computer Science Study Program Syiah Kuala University Website:
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Provenance Challenge Simon Miles, Mike Wilde, Ian Foster and Luc Moreau.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1Mr.Mohammed Abu Roqyah. Database System Concepts and Architecture 2Mr.Mohammed Abu Roqyah.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil.
1 Querying the Physical World Son, In Keun Lim, Yong Hun.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Data The fact and figures that can be recorded in system and that have some special meaning assigned to it. Eg- Data of a customer like name, telephone.
44271: Database Design & Implementation Physical Data Modelling Ian Perry Room: C49 Tel Ext.: 7287
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
Data Models. 2 The Importance of Data Models Data models –Relatively simple representations, usually graphical, of complex real-world data structures.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
D ATABASE MANAGEMENT SYSTEM By Rubel Biswas. W HAT IS I NFORMATION ? It’s just something you can’t avoid. It is generally referred to as data.
1 / 23 Presenter: Dong Dai, DISCL Lab. TTU Data-Intensive Scalable Computing Laboratory Department of Computer Science Accelerating Scientific.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
Introduction to DBMS Purpose of Database Systems View of Data
CS4222 Principles of Database System
CSCI-235 Micro-Computer Applications
System Design.
improve the efficiency, collaborative potential, and
Introduction What is a Database?.
Database Database is a large collection of related data that can be stored, generally describes activities of an organization. An organised collection.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
Introduction to DBMS Purpose of Database Systems View of Data
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Probabilistic Databases
UNIT-I Introduction to Database Management Systems
Database System Concepts and Architecture
Chaitali Gupta, Madhusudhan Govindaraju
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI *joint work with Yong Zhao, Microsoft Corporation.

Introduction A scientific workflow is a formal specification of a scientific process. Provenance management is essential for scientific workflows to support reproducibility, result interpretation, and workflow debugging. Provenance querying is a fundamental operation for provenance management for provenance access and various provenance usages. Two major concerns of provenance querying: expressiveness and efficiency.

Existing scientific workflow systems What about a logic programming approach?

Our preliminary contributions 1.We identify a set of requirements that are desirable for a scientific workflow provenance query language, 2.Based on these requirements, we propose FLOQ, a Frame Logic based query language for scientific workflow provenance, 3.We show by examples that FLOQ is expressive enough to formulate common provenance queries, including all the provenance challenge queries proposed in the provenance challenge series.

Outline of the talk Introduction Requirements for scientific workflow languages A gentle introduction to F-Logic F-logic approach to workflow provenance storage Query examples Conclusions and future work

Requirements of scientific workflow languages The language should be based on a well-defined semantics to represent computations and their relationships; The language should be able to define, query, and manipulate data structures, as they are essential components of workflows, on which various operations are performed; The language should have declarative syntax for both computation and data declarations and flexible query specification; The language should allow the composition of simple queries into more complex queries; The language ideally should support inference capability for provenance reasoning.

A gentle introduction to F-logic (1) Objects and properties Zhao[name -> “Yong Zhao”, affiliation -> UChicago] Lu[name-> “Shiyong Lu”, affiliation->WSU, teaches(2008)- >{CS300, CS501}] UChicago[name-> “University of Chicago”, location-> Chicago] WSU[name-> “Wayne State University”, location-> Detroit]

A gentle introduction to F-logic (2) Class or schema information A set of similar objects can be categorized into a class. An F-Logic program can also represent the structural information of a class and its type signature (types for method arguments and results): Employee[name *=> string, affiliation *=> University] Faculty[teaches(integer) => Course] University[name => string, location => City]

A gentle introduction to F-logic (3) Class hierarchy and class membership Faculty::Employee Zhao:Employee Lu:Faculty UChicago:University WSU:University

A gentle introduction to F-logic (4) Predicates Location(UChicago, Chicago) Rules ?U[offers(?Y)->?C] :- ?F[teaches(?Y)->?C],?F[affiliation->?U]

A gentle introduction to F-logic (5) Nested specification Lu[affiliation->WSU[name-> “Wayne State University”]]:Faculty Path expression Lu.affiliation Aggregation ?N = count{?C[?Y]|Lu[teaches(?Y)->?C]} > counts the number of courses taught by Lu in each year.

Sample fMRI scientific workflow

Defines types and procedures in F-logic

F-logic program of the sample fMRI workflow

An invocation record align_warp_inv_uuid[call->align_warp_uuid, host-> “uchost”, arch-> “ia64”, start_time-> “ T09:55:33”^^_dateTime, duration-> “00:29:55”^^_duration, exit_code->0].

Basic queries Find all the types defined in the workflow: Q: ?X::_type. A: ?X = Image ?X = Header ?X = Volume Find all procedures that take Volume as a parameter: Q: ?X[?Y=>Volume]::_procedure. A: ?X=align_warp ?Y=iv ?X=align_warp ?Y=reference Find procedure calls that ran on ia64 processors: Q: ?[call->?X, arch-> “ia64”]. A: ?X=align_warp_uuid

Lineage queries DirectlyDerived(?X, ?Y) :- ?Proc[in->?X, out->?Y],?Proc:_procedure. Derived(?X,?Y) :- DirectlyDerived(?X,?Y). Derived(?X,?Y) :- Derived(?X,?Z), Derived(?Z,?Y). Q: Derived(vol1, ?X). A: ?X=w1, svol1, atlas, atlas_x.ppm, atlas_x.jpg, atlas_y.ppm, … Q: Deirved(?Y, atlas). A: ?Y=svol1, svol2, svol3, svol4, w1, w2, w3, w4, vol1, vol2, … Similarly, if we want to track what procedures are used to process a dataset and its derived datasets, we can define: ConsumedBy(?X, ?P) :- ?P[in->?X], ?P:_procedure. ConsumedBy(?X, ?P) :- DirectlyDerived(?X, ?Y),ConsumedBy(?Y, ?P).

Virtual data relationship query The query of “find all the procedures that have an input of type Volume and an output of type Warp” can be formulated as: ?Proc[input->?X, output->?Y]:_procedure, ?Proc[?X=>Volume, ?Y=>Warp].

Annotation query The query of “show the values of all annotation predicates developerName of procedures that accept or produce an argument of type Volume with predicate Model=nonlinear” is formulated as: developerName[on(?Proc)->?Name], ?Proc[?=>Volume]::_procedure, Model[on(?Proc)->‘nonlinear’].

Aggregation query The query of “find all the align_warp invocations that ran three times longer than the average run time” is formulated as: ?X=avg{?d|?Inv[call->?C, duration->?d], ?C:align_warp}, ?Inv[call->?C, duration->?d], ?C:algin_warp, ?d > 3* ?X.

Provenance challenge query Q2 Q2: Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean. ProducedBy(?X, ?P) :- ?P[out->?X], ?P:_procedure. ProducedInBetween(?X, ?Y, ?P1, ?P2) :- ProducedBy(?X, ?P1), Derived(?Y, ?X), ConsumedBy(?Y, ?P2). Q: ProducedInBetween(‘atlas_x.jpg’, ?Y, ?P, softmean). A: ?P = convert, slicer, softmean ?Y = altas_x.ppm, atlas, svol1, svol2, svol3, svol4

Provenance challenge query Q3 Q3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic. ProducedBy(?X, ?P, ?D) :- ?P[out->?X], ?P:_procedure, ?D is 1. ProducedBy(?X, ?P, ?D) :- DirectlyDerived(?X, ?Y), ProducedBy(?Y, ?P, ?D0), ?D is ?D Q: ProducedBy(‘atlas_x.jpg’, ?P, ?S), ?S >=3, ?S=<5. A: ?S = 3, ?P = softmean ?S = 4, ?P = reslice ?S = 5, ?P = align_warp

Provenance challenge query Q7 Q7. A user has run the workflow twice, in the second instance replacing each procedure (convert) in the final stage with two procedures: ppmtopnm, then pnmtojpeg. Find the differences between the two workflow runs. Q: ProceducedInBetween(‘atlas_x.jpg’, ?Y, ?P, softmean). A1: ?P=convert, slicer, softmean A2: ?P=pnmtojpeg, ppmtopng, slicer, softmean

Modification queries Modification queries allow the ability to couple queries with updates to define new procedures, annotations, and work requests. F-logic supports updating the knowledge base, such as inserting and deleting facts and rules on-the-fly. In our case, we can specify that for each call to convert, we insert two consecutive calls ppmtopnm and pnmtojpeg, and then delete the calls to convert, in this way, the workflow is transformed into a new one with similar functionality.

Conclusions and future work We have described FLOQ, our Frame Logic based approach to representing and querying scientific workflow provenance. We have demonstrated that our previously identified provenance problems and challenges can be addressed by this approach. We plan to formally define the syntax and semantic of FLOQ and address the scalability issue: To develop a partitioning scheme for provenance data so that different portions of provenance information are loaded into memory dynamically as needed. To investigate database persistence techniques for logic programs where tables can be stored externally in relational databases and loaded on-demand..