Download presentation
Presentation is loading. Please wait.
Published byPhilip Harvey Modified over 9 years ago
1
Karma Provenance Collection Framework for Data-driven Workflows Yogesh Simmhan Microsoft Research Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian, Abhijit Borude, et al Indiana University
2
Putting the ‘e’ in e-Science Many scientific domains are moving to in Silico experiments…Earth Sciences, Life Sciences, Astronomy Common requirements ◦ Complex & Dynamic Systems, ◦ Adaptive Resources ◦ Data Deluge ◦ Need for Collaboration Cyberinfrastructure to support these needs ◦ Massively Parallel Systems ◦ High Bandwidth Computer Networks ◦ Petascale Data Archives Grid Middleware provides the glue to tie these using a Service Oriented Architecture
3
Workflows as Experiments Data-driven applications designed as workflows Data flows across applications as they are transformed, fused and used generating derived data Control flows determine path to execute but data flow determines data movement and dependency Manually keeping track of input & derived data to experiments is challenging given the number of data and complexity of application
4
Data Management Challenges Complex, dynamic data-processing pipelines Remote execution on Grid resources How was a particular dataset created? Collaboratory environments with shared resources Large search space & missing metadata How good is a given dataset for one’s application?
5
Data Provenance Metadata that describes the causality of an event ◦ Along with context to interpret it What, when, where, who, how, … We consider provenance for ◦ Workflow execution ◦ Service invocations ◦ Data products Workflow & Service Provenance ◦ Describes execution of a workflow & invocation of service Data Provenance ◦ Describes usage and generation of data products Provenance /’pr ɒ v ə nəns, -,n ɑ ns/ The history or pedigree of a work of art, manuscript, etc. A record of the ultimate derivation and passage of an item through its various owners. Source: The Oxford English Dictionary
6
Benefits What if the experiment fails? ◦ Did the workflow run correctly? Completely? ◦ Was the correct data/service/parameter used? ◦ Verification, Validation Can my peer run the experiment & get the same result? ◦ Repeatability Can I use the results in my publication? ◦ Attribution, Copyright Can I trust the results of prediction? ◦ Data Quality How much did it cost? How much will it cost? ◦ Resource Usage & Prediction
7
[7/43][2007-08-16] Gateway Services Core Grid Services LEAD Science Gateway Architecture Grid Portal Server Grid Portal Server Execution Management Execution Management Information Services Information Services Self Management Self Management Data Services Data Services Resource Management Resource Management Security Services Security Services Resource Virtualization (OGSA) Compute ResourcesData ResourcesInstruments & Sensors Proxy Certificate Server (Vault) Proxy Certificate Server (Vault) Events & Messaging Resource Broker Community & User Metadata Catalog Community & User Metadata Catalog Workflow engine Resource Registry Resource Registry Application Deployment Application Deployment User’s Grid Desktop
8
What is Karma Provenance Framework A standalone framework to collect data provenance for adaptive workflows with low overhead and lightweight schema able to answer complex queries Data Provenance is a form of metadata to track derivation history of data created by a workflow run executing across organizations (space) over a period of time Data Usage: Move forward in time Workflow trace: Inverse view from the actors
9
A Typical e-Science Experiment Weather forecast using WRF in LEADPre-ProcessingAssimilationVisualizationForecast
10
Workflows Abstract Workflow Model Temporal & Spatial composition ◦ Data Flow vs. Invocation Flow Central vs. Distributed Orchestration Assumption Directed Graph of Service Nodes & Data Edges ◦ Data Driven Applications Hierarchical Composition: Workflows a form of Service Workflow definition not required Standalone, independent of Workflow System Provides Port Uses Port Data Flow
11
Workflows Simple & Complex Workflow Models D1 Service S1 D2 Workflow WF1 D1 Workflow WF2 Service S3 D2 Service S2 D3D4
12
[12/43][2007-08-16] Pro ven anc e Fra me wor k in Sup por t of Dat a Qua lity Esti mati on Activities Collecting Provenance Activities generated during lifecycle of workflow “Sensors” generate activities: Instrumentation of services, clients Track execution across space, time, depth & operation ◦ Space: which service ◦ Time: when (logical time) ◦ Depth: distance from invocation root (client » workflow » service … nested workflows) ◦ Operation: Track dataflow 18 activities defined Support Dynamic, Adaptive Workflow
13
WF Engine Web Service Instrumentation of Services & WF WF Tracking WS Client WF Tracking Karma Service
14
Karma Provenance Service Provenance Listener Provenance Listener Activity DB Activity DB Karma Architecture Workflow Instance 10 Data Products Consumed & Produced by each Service Workflow Instance 10 Data Products Consumed & Produced by each Service Service 2 Service 2 … … Service 1 Service 1 Service 10 Service 10 Service 9 Service 9 10P/10C 10C 10P10C10P/10C 10P Workflow Engine Workflow Engine Message Bus WS-Eventing Service API WS-Messenger Notification Broker WS-Messenger Notification Broker Publish Provenance Activities as Notifications Application–Started & –Finished, Data–Produced & –Consumed Activities Workflow–Started & –Finished Activities Provenance Query API Provenance Query API Provenance Browser Client Provenance Browser Client Query for Workflow, Process, & Data Provenance Subscribe & Listen to Activity Notifications A Framework for Collecting Provenance in Data-Centric Scientific Workflows A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al.; ICWS, 2006
15
Service Invocation State Diagram Invoking Service Service Invoked Service Invocation Failed Data Transfer In Computation Data Consumed Data Produced Data Transfer Out Sending Result Sending Fault Received Response SERVICESERVICE CLIENTCLIENT Start I/P Staging Compu tation O/P Staging End
16
Activities Types & Source ActivityGenerated By [Service | Workflow] Initialized Service [Service | Workflow] Terminated Service Invoking Service Client Service Invoked Service Invoking Service [Succeeded | Failed] Client Data Transfer Service Computation Service Data Produced Service Data Consumed Service Sending [Result | Fault] Service Received [Result | Fault] Client Sending Response [Succeeded | Failed] Service Type Independent Bounding Operational Bounding
17
[17/43][2007-08-16]Provenance Framework in Support of Data Quality Estimation Client Service D1 D2 Time Space Operation S: Initialize S: Terminate S: Send Response Successful C: Receive Response S: Send Response S: Transfer Output Data D2 S: Produce Data D2 S: Perform Computation S: Consume Data D1 S: Transfer Input Data D1 C: Invocation Successful S: Invoked C: Invoke Service Transfer Consume Produce Compute ClientService Depth Activities Sequence Diagram for Basic Workflow
18
[18/43][2007-08-16] Pro ven anc e Fra me wor k in Sup por t of Dat a Qua lity Esti mati on Workflow Engine Service S2 D1 D2D2 Service S1 D2 D3D3 D1 D2D3 Workflow WF D1D3 Sequence Diagram for Simple Workflow
19
[19/43][2007-08-16] Pro ven anc e Fra me wor k in Sup por t of Dat a Qua lity Esti mati on Activities Naming Uniquely identifying data & services is critical for provenance Data product has GUID. Replicas have URLs. Service & Workflow instances have GUID Services defined in the context of workflows have a Node ID in the workflow name space Clients have GUID Entity: 4-tuple ◦ Invocation: 2-tuple ◦
20
Activities Provenance Activity Contents Activity Type Source Entity: 4-tuple ◦ Remote Entity: 4-tuple Attributes ◦ todo Annotations
21
Activities Modeling Activities in XML <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” serviceID=“urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService” /> 2006-09-10T23:56:28.677Z Convert Service was Invoked...... <initiator serviceID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” /> <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” serviceID=“urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService” /> 2006-09-10T23:56:32.324Z lead:uuid:1157946992-atlas-x.gif gsiftp://tyr1.cs.indiana.edu/tmp/20060910235628_Convert/outputData/atlas-x.gif 2006-09-10T23:56:32.324Z
22
Activities Publishing Activities as Notifications Activities are modeled as notifications that are sent by different components ◦ Loosely coupled, easy to generate provenance XML Representation of provenance activities WS-Messenger Notification Broker acts as message bus ◦ WS-Eventing & WS-Notification Provenance service & interested clients subscribe to notification
23
Backend Provenance Database ~Union of provenance model Provenance incrementally built Relational database (MySQL)
24
Information Model Data Provenance View Data Provenance Entity is the state of a service or a client Invocation relates a client (invoker) to a service (invokee). Status. Data provenance of produced data relates invocation with consumed data Lightweight schema Karma2: Provenance Management for Data Driven Workflows Karma2: Provenance Management for Data Driven Workflows, Simmhan, Y., et al.; J. Web Svc. Res., 2008 Client ENTITY (Invoker) Service ENTITY (Invokee) Request Response
25
Information Model Data Provenance & Usage Views Client ENTITY (Invoker) Service ENTITY (Invokee) Request Response
26
Information Model Workflow & Process Provenance Views Client ENTITY (Invoker) Service ENTITY (Invokee) Request Response
27
Dissemination Querying Provenance All 5 provenance models can be queried for by ID ◦ Data Provenance (by Data ID) ◦ Recursive Data Provenance (by Data ID, depth) ◦ Data Usage (by Data ID) ◦ Process provenance (by Invoker & Invokee) ◦ Workflow Trace (by Invoker & Invokee, depth) Service API to query and return results as XML Document Provenance Challenge Workshop ◦ Direct API, Incremental client, Graph matching algorithm Incremental building of complex queries Query Capabilities of the Karma Provenance Framework Query Capabilities of the Karma Provenance Framework, Simmhan, Y., et al.; 1 st Provenance Challenge & CCPE J., 2007
28
Applications: Process Monitoring Realtime Monitoring using XBaya
29
Applications: Information Integration Visual Exploration using Karma GUI
30
Performance & Scalability Study Experimental Setup odin001 odin065 odin064 odin128 … … Provenance Clients tyr10tyr12tyr11 tyr13 Karma WS-Messenger Broker PReServ in Tomcat 5.0, Embedded Java DB MySQL Gbps Network Dual-Processor 2.0 GHz 64-bit Opteron, 4GB RAM Dual-Processor 2.0 GHz 64-bit Opteron, 16GB RAM, Local IDE disk Generate Provenance Query Provenance Karma Service, WS-Messenger Notification Broker, MySQL PReServ in Tomcat 5.0 container Tyr web-services cluster (16 Nodes) Odin computer cluster (128 Nodes) Gigabit Ethernet, local IDE disk storage SLURM job manager for parallel job submission on Odin Java 1.5, Jython Provenance Service Components
31
[31/43][2007-08-16] Performance & Scalability Study Collecting Provenance Comparative Study of Karma with PReServ (U. Soton) Provenance services on tyr (2Ghz/16GB/64bit) & clients on odin (2Ghz/4GB/64bit) Time to collect provenance activities synchronously 1.Single service with increasing number of service invocations Karma scales linearly 2.Linear workflow with increasing number of data produced/ consumed Karma scales linearly, PReServ constant Performance Evaluation of the Karma Provenance Framework Performance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006
32
[32/43][2007-08-16] Performance & Scalability Study Collecting Provenance Performance Evaluation of the Karma Provenance Framework Performance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006 Time to collect provenance from simulated ensemble WRF forecasting workflow Scalability with increasing # of parallel runs 1–20 concurrent workflows Karma scales sub- linear
33
[33/43][2007-08-16] Performance & Scalability Study Querying Provenance Performance Evaluation of the Karma Provenance Framework Performance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006 Response time to query workflow, process, and data provenance from Karma (PReServ was order of magnitude slower) Scalability with increasing # of concurrent clients Karma contains 1000 workflow invocations Query for 20 workflow/200 process/200 data provenance documents
34
Related Work PReServ, U. of Southampton (Luc Moreau) Standalone, Annotation support No data provenance, workflow concept; poor performance VisTrails, U. of Utah (Juliana Freire) Workflows for graphical modeling Constrained to browser PASS, Harvard U. (Margo Seltzer) System level provenance No service/data abstraction Trio, Stanford U. (Jennifer Widom) Tuple level provenance on Database operations Restricted to databases Data Collector, IBM (alphaworks) Automatically record & track SOAP Messages No data provenance
35
What is new in Karma3? Process control flow tracking Vertical integration across applications ◦ Support for database queries Process & data abstraction Mining provenance logs ◦ WF composition ◦ Semantic support (S-OGSA)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.