Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow.

Slides:



Advertisements
Similar presentations
Reporting Workflow Rita Noumeir, Ph.D. IHE Technical Committee.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Provenance Challenge, Sept Modeling Provenance through User views Sarah Cohen-Boulakia Shirley Cohen Susan Davidson Thunyarat (Bam) Amornpetchkul.
IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
WS-VLAM Introduction presentation WS-VLAM Semantic tools Systems, Networking, and Engineering group Institute of informatics University of Amsterdam.
Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18 Yogesh L. Simmhan Beth Plale, Dennis Gannon, Srinath Perera Indiana University.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
Ngu, Texas StatePtolemy Miniconference, February 13, 2007 Flexible Scientific Workflows Using Dynamic Embedding Anne H.H. Ngu, Nicholas Haasch Terence.
Manish Kumar Anand Eighth Biennial Ptolemy Miniconference Berkeley, California A Provenance Framework to Capture, Store, Query, and.
Computational Physics Kepler Dr. Guy Tel-Zur. This presentations follows “The Getting Started with Kepler” guide. A tutorial style manual for scientists.
7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007 Provenance Framework in Kepler Ilkay AltintasNorbert Podhorszki Contributors: S. Bowers,
UvA, Amsterdam June 2007WS-VLAM Introduction presentation WS-VLAM Requirements list known as the WS-VLAM wishlist System and Network Engineering group.
Composing Models of Computation in Kepler/Ptolemy II Summary. A model of computation (MoC) is a formal abstraction of execution in a computer. There is.
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06.
© 2008 IBM Corporation Behavioral Models for Software Development Andrei Kirshin, Dolev Dotan, Alan Hartman January 2008.
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
January, 23, 2006 Ilkay Altintas
JSP Standard Tag Library
.NET Framework Introduction: Metadata
Reusable Code For Your Appx Processes Presented By: Gary Rogers.
Composing Models of Computation in Kepler/Ptolemy II
Rethinking Game Architecture with Immutability Jacob Dufault Faculty Advisor: Dr. Phil Bernhard, Dept of Computer Science, Florida Institute of Technology.
Capture and Replay Often used for regression test development –Tool used to capture interactions with the system under test. –Inputs must be captured;
Database System Concepts and Architecture
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows Bertram Ludäscher.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
1 Ilkay ALTINTAS - July 24th, 2007 Ilkay ALTINTAS Director, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD.
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Provenance Challenge Simon Miles, Mike Wilde, Ian Foster and Luc Moreau.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.
Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar.
Wrapping Scientific Applications As Web Services Using The Opal Toolkit Wrapping Scientific Applications As Web Services Using The Opal Toolkit Sriram.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
WDO-It! 101 Workshop: Creating an abstraction of a process UTEP’s Trust Laboratory NDR HP MP.
Kepler+PF+RWS, Kepler+PF+RWS, Podhorszki, Altintas et al. Provenance GGF18 RWS Provenance Experiments in Kepler (Kepler + PR + RWS) Norbert.
REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil.
University of California, Davis Daniel Zinn 1 University of California, Davis Daniel Zinn 1 Daniel Zinn Bertram Ludäscher University of California at Davis.
Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.
David Adams ATLAS Virtual Data in ATLAS David Adams BNL May 5, 2002 US ATLAS core/grid software meeting.
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific.
Clotho in Kepler Help sharing Clotho’s awesomeness to the world Use scientific workflow to create, reuse, share and extend Clotho’s operations.
Towards Self-Describing Workflows for Climate Models Kathy Saint – UCAR Ufuk Utku Turuncoglu – ITU Sylvia Murphy – NCAR Cecelia DeLuca – NCAR.
Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.
SDM Center Experience with Fusion Workflows Norbert Podhorszki, Bertram Ludäscher Department of Computer Science University of California, Davis UC DAVIS.
Ocean Observatories Initiative OOI Cyberinfrastructure Life Cycle Objectives Review January 8-9, 2013 Scientific Workflows for OOI Ilkay Altintas Charles.
1 Pegasus and wings WINGS/Pegasus Provenance Challenge Ewa Deelman Yolanda Gil Jihie Kim Gaurang Mehta Varun Ratnakar USC Information Sciences Institute.
DICOMwebTM 2015 Conference & Hands-on Workshop University of Pennsylvania, Philadelphia, PA September 10-11, 2015 DICOMweb Workflow API (UPS-RS) Jonathan.
Workflow-Driven Science using Kepler Ilkay Altintas, PhD San Diego Supercomputer Center, UCSD words.sdsc.edu.
Audit API : Hints and Tricks Mehdi BELMEKKI, Consultancy Team Alfresco.
Of 24 lecture 11: ontology – mediation, merging & aligning.
1 / 23 Presenter: Dong Dai, DISCL Lab. TTU Data-Intensive Scalable Computing Laboratory Department of Computer Science Accelerating Scientific.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
A Reusable Framework for Automated Record Creation and Population
Efficient Evaluation of XQuery over Streaming Data
Computational Physics Kepler
Data Model.
A Semantic Type System and Propagation
Chaitali Gupta, Madhusudhan Govindaraju
Scientific Workflows Lecture 15
Presentation transcript:

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow Framework aka Kepler/DAKS (for Luc’s collection: before: “We do provenance!”; now: “ … and it almost killed us!”) Shawn Bowers Timothy McPhillips Bertram Ludaescher in collaboration with Ilkay Altintas Norbert Podhorszki

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Goals for the Provenance Challenge Implement an RWS-style provenance model for Collection- Oriented Scientific Workflows Take advantage of Collection-Oriented SWFs to –Automatically infer state-reset events –Reduce the number of provenance-relevant events that need to be recorded (keep it minimal) –Simplify association of traces and provenance into one self- contained “trace” file for input, output, and dependencies Support science-oriented provenance and queries –Emphasize data dependencies (lineage) as well as process details Decouple provenance representation from particular scientific workflow technology (Kepler)

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Workflows Generic support for workflows that operate over nested data collections (trees) Abstract Model –Actors receive input trees, read contents of subtrees matching some criteria (scope), and optionally add or delete subtree nodes –Each scope instance corresponds to one actor invocation AlignWarp Scope = AnatomyImage … AnatomyImage ImageReferenceImage WarpParamSet … AnatomyImage ImageReferenceImage

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Workflows Kepler Implementation –Collections are serialized within heterogeneous token streams –Actor execution is pipelined based on each actor’s scope –Enables concurrent processing of nested data collections –Collections can contain data, metadata, actor parameters, and other collections

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Provenance Challenge SWF Input data is read by collection reader –Execution driven by number and size of anatomy image sets specified by XML file Slicer configured on the fly via parameter tokens –E.g. to create the 3 slices required for each image set Output trace serialized into XML by collection writer –Trace implicitly contains input data, output data, and lineage

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Collection-Oriented Provenance Embedded Provenance Tokens –Data and invocation dependencies stored as tokens within the stream –Actor API for declaring data dependencies –Invocation dependencies added automatically … Data Dependencies –Insertion and deletion events capture actor, invocation count, and direct data dependencies Process Dependencies –Invocation dependencies record which steps created data or modified collections used by another actor invocation Insertion Dependencies AnatomyImage ImageReferenceImage WarpParamSet

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Minimal Provenance Information Without Provenance With Provenance

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Execution traces imply provenance graphs Graph edges encode data lineage and process relations Lineage(Trace, Node, DependentNode, Actor, InvocCount) Provenance operations work over traces and graphs: Input(Trace, Node) Output(Trace, Node) Param(Trace, Name, Value, Actor, InvocCount) Metadata(Trace, Key, Value, Node) etc. Querying Collection-Oriented Provenance AtlasSlice (337) Image (311) Header (312) Slicer : 1 Data/Collection creation lineageCollection “last version” lineage AtlasImage (308) Image (311) Header (312) AtlasSlice (337) AtlasImage (308) Image (311) Header (312) Slicer : 1

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results We used two different runs –Each run has embedded metadata and parameter settings –First run equivalent to challenge workflow –Second run containing three sets of image collections, containing different numbers of images WorkflowInput ImageCollection AnatomyImage ReferenceImage1 Image Header1 Header AnatomyImage ReferenceImage2 Image Header2 Header AnatomyImage ReferenceImage3 Image Header3 Header AnatomyImage ReferenceImage4 Image Header4 Header input to first run

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results WorkflowInput ImageCollection AnatomyImage input to second run … ImageCollection AnatomyImage … … … … … … … … We used two different runs –Each run has embedded metadata and parameter settings –First run equivalent to challenge workflow –Second run containing three sets of image collections, containing different numbers of images

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 1) Full Data Dependencies Query: ?- trace(1, T), nodeId(T, 341, N1), nodeId(T, 349, N2), nodeId(T, 357, N3), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges).

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 1) Question 1: Process that led to Atlas X Graphic Query: ?- trace(1, T), nodeId(T, 341, N), lineageEdges(T, N, Edges), drawEdges(Edges). Returns subset of lineage edges

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Challenge Results (Trace 2) Question 1: Process that led to Atlas X Graphic Query: trace(2, T), nodeId(T, 973, N1), nodeId(T, 1093, N1), nodeId(T, 1193, N1), lineageEdges(T, [N1, N2, N3], Edges), drawEdges(Edges). Single workflow run where not all output dependent on all input.

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Summary Benefits of our approach –Provenance support for Collection-Oriented SWFs –Minimal provenance information stored in self-contained trace file –Provenance automatically embedded within data stream, simple actor provenance API –Able to answer provenance challenge queries using simple operations (see WIKI entry) -- Note that we ignored question 7 Suggestion for Future Provenance Challenge –More complex/realistic workflows (e.g., from Bioinformatics) Loops, nesting, partial dependencies, concurrency –More “scientist-oriented” provenance queries Explicit queries for data dependencies (e.g., see Wiki entry) Assume user doesn’t know the structure of the trace (Queries 5)

Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. References An Approach for Pipelining Nested Collections in Scientific Workflows, Timothy McPhillips and Shawn Bowers, SIGMOD Record 34, 12-17, 2005.An Approach for Pipelining Nested Collections in Scientific Workflows A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher, Shirley Cohen, Susan B. Davidson. International Provenance and Annotation Workshop (IPAW'06), 2006.A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data, Timothy McPhillips, Shawn Bowers, Bertram Ludaescher. 3rd International Workshop on Data Integration in the Life Sciences (DILS'06), 2006.Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data