Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian,

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Feedback on OPM Yogesh Simmhan Microsoft Research Synthesis of pairwise conversations with: Roger Barga Satya Sahoo Microsoft Research Beth Plale Abhijit.
Web Service Ahmed Gamal Ahmed Nile University Bioinformatics Group
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
As computer network experiments increase in complexity and size, it becomes increasingly difficult to fully understand the circumstances under which a.
Corporate Context: A SOA & BPM Alliance Via Business Data Management Amir Bahmanyari Architect.
Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18 Yogesh L. Simmhan Beth Plale, Dennis Gannon, Srinath Perera Indiana University.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
An Intelligent Broker Approach to Semantics-based Service Composition Yufeng Zhang National Lab. for Parallel and Distributed Processing Department of.
Workshop on Cyber Infrastructure in Combustion Science April 19-20, 2006 Subrata Bhattacharjee and Christopher Paolini Mechanical.
May 29, 2007 Dynamically Adaptive Weather Analysis and Forecasting in LEAD: Issues in Data Management, Metadata, and Search Beth Plale Director, Center.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18 Yogesh L. Simmhan Beth Plale, Dennis Gannon, Srinath Perera Indiana University.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
May 29, 2007 Metadata, Provenance, and Search in e-Science Beth Plale Director, Center for Data and Search Informatics School of Informatics Indiana University.
January, 23, 2006 Ilkay Altintas
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
- 1 - Grid Programming Environment (GPE) Ralf Ratering Intel Parallel and Distributed Solutions Division (PDSD)
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Knowledge based Learning Experience Management on the Semantic Web Feng (Barry) TAO, Hugh Davis Learning Society Lab University of Southampton.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
CONTENTS Arrival Characters Definition Merits Chararterstics Workflows Wfms Workflow engine Workflows levels & categories.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Instant Karma Collecting Provenance for AMSR-E Beth Plale Director, Data to Insight Center Indiana University Helen Conover Information Technology and.
Usage of `provenance’: A Tower of Babel Luc Moreau.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
San Diego Supercomputer Center SDSC Storage Resource Broker Data Grid Automation Arun Jagatheesan et al., San Diego Supercomputer Center University of.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Phase II Additions to LSG Search capability to Gene Browser –Though GUI in Gene Browser BLAST plugin that invokes remote EBI BLAST service Working set.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
AgINFRA science gateway for workflows and integrated services 07/02/2012 Robert Lovas MTA SZTAKI.
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Enabling Grids for E-sciencE Astronomical data processing workflows on a service-oriented Grid architecture Valeria Manna INAF - SI The.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Services for advanced workflow programming.
Indiana University School of Informatics The LEAD Gateway Dennis Gannon, Beth Plale, Suresh Marru, Marcus Christie School of Informatics Indiana University.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Overview of Grid Webservices in Distributed Scientific Applications Dennis Gannon Aleksander Slominski Indiana University Extreme! Lab.
XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
A service Oriented Architecture & Web Service Technology.
AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.
Middleware independent Information Service
CHAPTER 3 Architectures for Distributed Systems
SDM workshop Strawman report History and Progress and Goal.
San Diego Supercomputer Center University of California, San Diego
Overview of Workflows: Why Use Them?
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian, Abhijit Borude, et al Indiana University

Putting the ‘e’ in e-Science Many scientific domains are moving to in Silico experiments…Earth Sciences, Life Sciences, Astronomy Common requirements ◦ Complex & Dynamic Systems, ◦ Adaptive Resources ◦ Data Deluge ◦ Need for Collaboration Cyberinfrastructure to support these needs ◦ Massively Parallel Systems ◦ High Bandwidth Computer Networks ◦ Petascale Data Archives Grid Middleware provides the glue to tie these using a Service Oriented Architecture

Workflows as Experiments Data-driven applications designed as workflows Data flows across applications as they are transformed, fused and used generating derived data Control flows determine path to execute but data flow determines data movement and dependency Manually keeping track of input & derived data to experiments is challenging given the number of data and complexity of application

Data Management Challenges Complex, dynamic data-processing pipelines Remote execution on Grid resources How was a particular dataset created? Collaboratory environments with shared resources Large search space & missing metadata How good is a given dataset for one’s application?

Data Provenance Metadata that describes the causality of an event ◦ Along with context to interpret it What, when, where, who, how, … We consider provenance for ◦ Workflow execution ◦ Service invocations ◦ Data products Workflow & Service Provenance ◦ Describes execution of a workflow & invocation of service Data Provenance ◦ Describes usage and generation of data products Provenance /’pr ɒ v ə nəns, -,n ɑ ns/ The history or pedigree of a work of art, manuscript, etc. A record of the ultimate derivation and passage of an item through its various owners. Source: The Oxford English Dictionary

Benefits What if the experiment fails? ◦ Did the workflow run correctly? Completely? ◦ Was the correct data/service/parameter used? ◦ Verification, Validation Can my peer run the experiment & get the same result? ◦ Repeatability Can I use the results in my publication? ◦ Attribution, Copyright Can I trust the results of prediction? ◦ Data Quality How much did it cost? How much will it cost? ◦ Resource Usage & Prediction

[7/43][ ] Gateway Services Core Grid Services LEAD Science Gateway Architecture Grid Portal Server Grid Portal Server Execution Management Execution Management Information Services Information Services Self Management Self Management Data Services Data Services Resource Management Resource Management Security Services Security Services Resource Virtualization (OGSA) Compute ResourcesData ResourcesInstruments & Sensors Proxy Certificate Server (Vault) Proxy Certificate Server (Vault) Events & Messaging Resource Broker Community & User Metadata Catalog Community & User Metadata Catalog Workflow engine Resource Registry Resource Registry Application Deployment Application Deployment User’s Grid Desktop

What is Karma Provenance Framework  A standalone framework  to collect data provenance  for adaptive workflows  with low overhead and lightweight schema  able to answer complex queries  Data Provenance is a form of metadata to track derivation history of data created by a workflow run executing across organizations (space) over a period of time  Data Usage: Move forward in time  Workflow trace: Inverse view from the actors

A Typical e-Science Experiment Weather forecast using WRF in LEADPre-ProcessingAssimilationVisualizationForecast

Workflows Abstract Workflow Model Temporal & Spatial composition ◦ Data Flow vs. Invocation Flow Central vs. Distributed Orchestration Assumption Directed Graph of Service Nodes & Data Edges ◦ Data Driven Applications Hierarchical Composition: Workflows a form of Service Workflow definition not required Standalone, independent of Workflow System Provides Port Uses Port Data Flow

Workflows Simple & Complex Workflow Models D1 Service S1 D2 Workflow WF1 D1 Workflow WF2 Service S3 D2 Service S2 D3D4

[12/43][ ] Pro ven anc e Fra me wor k in Sup por t of Dat a Qua lity Esti mati on Activities Collecting Provenance Activities generated during lifecycle of workflow “Sensors” generate activities: Instrumentation of services, clients Track execution across space, time, depth & operation ◦ Space: which service ◦ Time: when (logical time) ◦ Depth: distance from invocation root (client » workflow » service … nested workflows) ◦ Operation: Track dataflow 18 activities defined Support Dynamic, Adaptive Workflow

WF Engine Web Service Instrumentation of Services & WF WF Tracking WS Client WF Tracking Karma Service

Karma Provenance Service Provenance Listener Provenance Listener Activity DB Activity DB Karma Architecture Workflow Instance 10 Data Products Consumed & Produced by each Service Workflow Instance 10 Data Products Consumed & Produced by each Service Service 2 Service 2 … … Service 1 Service 1 Service 10 Service 10 Service 9 Service 9 10P/10C 10C 10P10C10P/10C 10P Workflow Engine Workflow Engine Message Bus WS-Eventing Service API WS-Messenger Notification Broker WS-Messenger Notification Broker Publish Provenance Activities as Notifications Application–Started & –Finished, Data–Produced & –Consumed Activities Workflow–Started & –Finished Activities Provenance Query API Provenance Query API Provenance Browser Client Provenance Browser Client Query for Workflow, Process, & Data Provenance Subscribe & Listen to Activity Notifications A Framework for Collecting Provenance in Data-Centric Scientific Workflows A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al.; ICWS, 2006

Service Invocation State Diagram Invoking Service Service Invoked Service Invocation Failed Data Transfer In Computation Data Consumed Data Produced Data Transfer Out Sending Result Sending Fault Received Response SERVICESERVICE CLIENTCLIENT Start I/P Staging Compu tation O/P Staging End

Activities Types & Source ActivityGenerated By [Service | Workflow] Initialized Service [Service | Workflow] Terminated Service Invoking Service Client Service Invoked Service Invoking Service [Succeeded | Failed] Client Data Transfer Service Computation Service Data Produced Service Data Consumed Service Sending [Result | Fault] Service Received [Result | Fault] Client Sending Response [Succeeded | Failed] Service Type Independent Bounding Operational Bounding

[17/43][ ]Provenance Framework in Support of Data Quality Estimation Client Service D1 D2 Time Space Operation S: Initialize S: Terminate S: Send Response Successful C: Receive Response S: Send Response S: Transfer Output Data D2 S: Produce Data D2 S: Perform Computation S: Consume Data D1 S: Transfer Input Data D1 C: Invocation Successful S: Invoked C: Invoke Service Transfer Consume Produce Compute ClientService Depth Activities Sequence Diagram for Basic Workflow

[18/43][ ] Pro ven anc e Fra me wor k in Sup por t of Dat a Qua lity Esti mati on Workflow Engine Service S2 D1 D2D2 Service S1 D2 D3D3 D1 D2D3 Workflow WF D1D3 Sequence Diagram for Simple Workflow

[19/43][ ] Pro ven anc e Fra me wor k in Sup por t of Dat a Qua lity Esti mati on Activities Naming Uniquely identifying data & services is critical for provenance Data product has GUID. Replicas have URLs. Service & Workflow instances have GUID Services defined in the context of workflows have a Node ID in the workflow name space Clients have GUID Entity: 4-tuple ◦ Invocation: 2-tuple ◦

Activities Provenance Activity Contents Activity Type Source Entity: 4-tuple ◦ Remote Entity: 4-tuple Attributes ◦ todo Annotations

Activities Modeling Activities in XML <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” serviceID=“urn:qname: /> T23:56:28.677Z Convert Service was Invoked <initiator serviceID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” /> <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” serviceID=“urn:qname: /> T23:56:32.324Z lead:uuid: atlas-x.gif gsiftp://tyr1.cs.indiana.edu/tmp/ _Convert/outputData/atlas-x.gif T23:56:32.324Z

Activities Publishing Activities as Notifications Activities are modeled as notifications that are sent by different components ◦ Loosely coupled, easy to generate provenance XML Representation of provenance activities WS-Messenger Notification Broker acts as message bus ◦ WS-Eventing & WS-Notification Provenance service & interested clients subscribe to notification

Backend Provenance Database ~Union of provenance model Provenance incrementally built Relational database (MySQL)

Information Model Data Provenance View Data Provenance Entity is the state of a service or a client Invocation relates a client (invoker) to a service (invokee). Status. Data provenance of produced data relates invocation with consumed data Lightweight schema Karma2: Provenance Management for Data Driven Workflows Karma2: Provenance Management for Data Driven Workflows, Simmhan, Y., et al.; J. Web Svc. Res., 2008 Client ENTITY (Invoker) Service ENTITY (Invokee) Request Response

Information Model Data Provenance & Usage Views Client ENTITY (Invoker) Service ENTITY (Invokee) Request Response

Information Model Workflow & Process Provenance Views Client ENTITY (Invoker) Service ENTITY (Invokee) Request Response

Dissemination Querying Provenance All 5 provenance models can be queried for by ID ◦ Data Provenance (by Data ID) ◦ Recursive Data Provenance (by Data ID, depth) ◦ Data Usage (by Data ID) ◦ Process provenance (by Invoker & Invokee) ◦ Workflow Trace (by Invoker & Invokee, depth) Service API to query and return results as XML Document Provenance Challenge Workshop ◦ Direct API, Incremental client, Graph matching algorithm Incremental building of complex queries Query Capabilities of the Karma Provenance Framework Query Capabilities of the Karma Provenance Framework, Simmhan, Y., et al.; 1 st Provenance Challenge & CCPE J., 2007

Applications: Process Monitoring Realtime Monitoring using XBaya

Applications: Information Integration Visual Exploration using Karma GUI

Performance & Scalability Study Experimental Setup odin001 odin065 odin064 odin128 … … Provenance Clients tyr10tyr12tyr11 tyr13 Karma WS-Messenger Broker PReServ in Tomcat 5.0, Embedded Java DB MySQL Gbps Network Dual-Processor 2.0 GHz 64-bit Opteron, 4GB RAM Dual-Processor 2.0 GHz 64-bit Opteron, 16GB RAM, Local IDE disk Generate Provenance Query Provenance Karma Service, WS-Messenger Notification Broker, MySQL PReServ in Tomcat 5.0 container Tyr web-services cluster (16 Nodes) Odin computer cluster (128 Nodes) Gigabit Ethernet, local IDE disk storage SLURM job manager for parallel job submission on Odin Java 1.5, Jython Provenance Service Components

[31/43][ ] Performance & Scalability Study Collecting Provenance Comparative Study of Karma with PReServ (U. Soton) Provenance services on tyr (2Ghz/16GB/64bit) & clients on odin (2Ghz/4GB/64bit) Time to collect provenance activities synchronously 1.Single service with increasing number of service invocations  Karma scales linearly 2.Linear workflow with increasing number of data produced/ consumed  Karma scales linearly, PReServ constant Performance Evaluation of the Karma Provenance Framework Performance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006

[32/43][ ] Performance & Scalability Study Collecting Provenance Performance Evaluation of the Karma Provenance Framework Performance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006  Time to collect provenance from simulated ensemble WRF forecasting workflow  Scalability with increasing # of parallel runs 1–20 concurrent workflows Karma scales sub- linear

[33/43][ ] Performance & Scalability Study Querying Provenance Performance Evaluation of the Karma Provenance Framework Performance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006  Response time to query workflow, process, and data provenance from Karma (PReServ was order of magnitude slower)  Scalability with increasing # of concurrent clients  Karma contains 1000 workflow invocations  Query for 20 workflow/200 process/200 data provenance documents

Related Work PReServ, U. of Southampton (Luc Moreau)  Standalone, Annotation support  No data provenance, workflow concept; poor performance VisTrails, U. of Utah (Juliana Freire)  Workflows for graphical modeling  Constrained to browser PASS, Harvard U. (Margo Seltzer)  System level provenance  No service/data abstraction Trio, Stanford U. (Jennifer Widom)  Tuple level provenance on Database operations  Restricted to databases Data Collector, IBM (alphaworks)  Automatically record & track SOAP Messages  No data provenance

What is new in Karma3? Process control flow tracking Vertical integration across applications ◦ Support for database queries Process & data abstraction Mining provenance logs ◦ WF composition ◦ Semantic support (S-OGSA)