WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.

ProActive Task Manager Component for SEGL Parameter Sweeping Natalia Currle-Linde and Wasseim Alzouabi High Performance Computing Center Stuttgart (HLRS),

Small-Scale Peer-to-Peer Publish/Subscribe

Transparent Robustness in Service Aggregates Onyeka Ezenwoye School of Computing and Information Sciences Florida International University May 2006.

Variability Oriented Programming – A programming abstraction for adaptive service orientation Prof. Umesh Bellur Dept. of Computer Science & Engg, IIT.

GridFlow: Workflow Management for Grid Computing Kavita Shinde.

1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.

Dagstuhl, February 16, 2009 Layers in Grids Uwe Schwiegelshohn 17. Februar 2009 Layers in Grids.

WORKFLOWS IN CLOUD COMPUTING. CLOUD COMPUTING  Delivering applications or services in on-demand environment  Hundreds of thousands of users / applications.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Towards auto-scaling in Atmosphere cloud platform Tomasz Bartyński 1, Marek Kasztelnik 1, Bartosz Wilk 1, Marian Bubak 1,2 AGH University of Science and.

Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)

Cracow Grid Workshop 2003 Institute of Computer Science AGH A Concept of a Monitoring Infrastructure for Workflow-Based Grid Applications Bartosz Baliś,

SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.

CGW 2003 Institute of Computer Science AGH Proposal of Adaptation of Legacy C/C++ Software to Grid Services Bartosz Baliś, Marian Bubak, Michał Węgiel,

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

DISTRIBUTED COMPUTING

Modeling & Simulation: An Introduction Some slides in this presentation have been copyrighted to Dr. Amr Elmougy.

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

Recording application executions enriched with domain semantics of computations and data Master of Science Thesis Michał Pelczar Krakow,

Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.

In each iteration macro model creates several micro modules, sends data to them and waits for the results. Using Akka Actors for Managing Iterations in.

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,

A Novel Approach to Workflow Management in Grid Environments Frank Berretz*, Sascha Skorupa*, Volker Sander*, Adam Belloum** 15/04/2010 * FH Aachen - University.

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Cracow Grid Workshop, October 27 – 29, 2003 Institute of Computer Science AGH Design of Distributed Grid Workflow Composition System Marian Bubak, Tomasz.

Data Tagging Architecture for System Monitoring in Dynamic Environments Bharat Krishnamurthy, Anindya Neogi, Bikram Sengupta, Raghavendra Singh (IBM Research.

Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.

Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.

OMIS Approach to Grid Application Monitoring Bartosz Baliś Marian Bubak Włodzimierz Funika Roland Wismueller.

Service - Oriented Middleware for Distributed Data Mining on the Grid ，劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.

©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.

Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.

A Component Platform for Experimenting with Autonomic Composition A component framework for supporting composition of autonomic services and bio-inspired.

Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.

Enabling Grids for E-sciencE Astronomical data processing workflows on a service-oriented Grid architecture Valeria Manna INAF - SI The.

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.

Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.

NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.

7. Grid Computing Systems and Resource Management

Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing

David Chiu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University 1 Supporting Workflows through Data-driven Service.

An Overview of Scientific Workflows: Domains & Applications Laboratoire Lorrain de Recherche en Informatique et ses Applications Presented by Khaled Gaaloul.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.

GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.

Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.

AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.

Enabling Grids for E-sciencE Agreement-based Workload and Resource Management Tiziana Ferrari, Elisabetta Ronchieri Mar 30-31, 2006.

Provenance: Problem, Architectural issues, Towards Trust

Similarities between Grid-enabled Medical and Engineering Applications

Grid Computing.

Computer Programming.

The Globus Toolkit™: Information Services

Composite Subscriptions in Content-based Pub/Sub Systems

A General Approach to Real-time Workflow Monitoring

Presentation transcript:

WORKS08, Austin, Texas, November 17th, 2008 Monitoring Infrastructure for Grid Scientific Workflows Institute of Computer Science and ACC CYFRONET AGH Kraków, Poland Bartosz Baliś, Marian Bubak

WORKS08, Austin, Texas, November 17th, 2008 Outline Challenges in Monitoring of Grid Scientific Workflows Challenges in Monitoring of Grid Scientific Workflows GEMINI infrastructure GEMINI infrastructure Event model for workflow execution monitoring Event model for workflow execution monitoring On-line workflow monitoring support On-line workflow monitoring support Information model for recording workflow executions Information model for recording workflow executions

WORKS08, Austin, Texas, November 17th, 2008 Motivation Monitoring of Grid Scientific Workflows important in particularly many scenarios Monitoring of Grid Scientific Workflows important in particularly many scenarios  On-line & off-line performance analysis, dynamic resource reconfiguration, on-line steering, performance optimization, provenance tracking, experiment mining, experiment repetition, … Consumers of monitoring data: humans (provenance) and processes Consumers of monitoring data: humans (provenance) and processes On-line & off-line scenarios On-line & off-line scenarios Historic records: provenance, retrospective analysis (enhancement of next executions) Historic records: provenance, retrospective analysis (enhancement of next executions)

WORKS08, Austin, Texas, November 17th, 2008 Grid Scientific Workflows Traditional scientific applications Traditional scientific applications  Parallel  Homogeneous  Tightly coupled Scientific worfklows Scientific worfklows  Distributed  Heterogeneous  Loosely Coupled  Legacy applications often in the backends  Grid environment  Challenges for monitoring arise  Challenges for monitoring arise

WORKS08, Austin, Texas, November 17th, 2008 Challenges Monitoring infrastructure that conceals workflow heterogeneity Monitoring infrastructure that conceals workflow heterogeneity  Event subscription and instrumentation requests Standardized event model for Grid workflow execution Standardized event model for Grid workflow execution  Currently events tightly coupled to workflow environments On-line monitoring support On-line monitoring support  Existing Grid information systems not suitable for fast notification-based discovery Monitoring information model to record executions Monitoring information model to record executions

WORKS08, Austin, Texas, November 17th, 2008 GEMINI: monitoring infrastructure Standardized, abstract interfaces for subscription and instrumentation Standardized, abstract interfaces for subscription and instrumentation Complex Event Processing: subscription management via continuous querying Complex Event Processing: subscription management via continuous querying Event representation Event representation  XML: self describing, extensible but poor performance  Google protocol buffers: under investigation Monitors: query & sub engine, event caching, services Monitors: query & sub engine, event caching, services Sensors: lightweight collectors of events Sensors: lightweight collectors of events Mutators: manipulation of monitored entities (e.g. dynamic instrumentation) Mutators: manipulation of monitored entities (e.g. dynamic instrumentation)

WORKS08, Austin, Texas, November 17th, 2008 Outline Event model for workflow execution monitoring Event model for workflow execution monitoring On-line workflow monitoring support On-line workflow monitoring support Information model for recording workflow executions Information model for recording workflow executions

WORKS08, Austin, Texas, November 17th, 2008 Workflow execution events Motivation: capture commonly used monitoring measurements concerning workflow execution Motivation: capture commonly used monitoring measurements concerning workflow execution Attempts to standardize monitoring events exist, but oriented to resource monitoring Attempts to standardize monitoring events exist, but oriented to resource monitoring  GGF DAMED ‘Top N’  GGF NMWG Network Peformance Characteristics Typically monitoring systems introduce a single event type for application events Typically monitoring systems introduce a single event type for application events

WORKS08, Austin, Texas, November 17th, 2008 Workflow Execution Events – taxonomy Extension of GGF DAMED Top N events Extension of GGF DAMED Top N events Extensible hierarchy; example extensions: Extensible hierarchy; example extensions:  Loop entered – started.codeReigon.loop  MPI app invocation – invoking.application.MPI  MPI Calls – started.codeRegion.call.MPISend  Application-specific events Events for initiators and performers Events for initiators and performers  Invoking, invoked; started, finished Event for various execution levels Event for various execution levels  Workflow, task, code region, data operations Events for various execution states Events for various execution states  Failed, suspended, resumed, … Events for execution metrics Events for execution metrics  Progress, rate

WORKS08, Austin, Texas, November 17th, 2008 Outline Event model for workflow execution monitoring Event model for workflow execution monitoring On-line workflow monitoring support On-line workflow monitoring support Information model for recording workflow executions Information model for recording workflow executions

WORKS08, Austin, Texas, November 17th, 2008 On-line Monitoring of Grid Workflows Motivation  Reaction to time-varying resource availability and application demands  Up-to-date execution status Typical scenario: ‘subscribe to all execution events related to workflow Wf_1234’  Distributed producers, not known apriori Prerequisite: automatic resource discovery of workflow components  New producers are automatically discovered and transparently receive appropriate active subscription requests

WORKS08, Austin, Texas, November 17th, 2008 Resource discovery in workflow monitoring Challenge: complex execution life cycle of a Grid workflow Challenge: complex execution life cycle of a Grid workflow  Abstract workflows: mapping of tasks to resources at runtime  Many services involved: enactment engines, resource brokers, schedulers, queue managers, execution managers, …  No single place to subscribe for notifications about new workflow components  Discovery for monitoring must proceed bottom-up: (1) local discovery, (2) global advertisement, (3) global discovery Problem: current Grid information services are not suitable Problem: current Grid information services are not suitable  Oriented towards query performance  Slow propagation of resource status changes  Example: average delay from event ocurrence to notification in EGEE infrastructure ~ 200 seconds (Berger et al., Analysis of Overhead and Waiting Times in the EGEE production Grid)

WORKS08, Austin, Texas, November 17th, 2008 Resource discovery: solution What kind of resource discovery is required? What kind of resource discovery is required?  Identity-based, not attribute-based  Full-blown information service functionality not needed  Just simple, efficient key-value store Solution: a DHT infrastructure federated with the monitoring infrastructure to store shared state of monitoring services Solution: a DHT infrastructure federated with the monitoring infrastructure to store shared state of monitoring services  Key = workflow identifier  Value = producer record (Monitoring service URL, etc.)  Multiple values (= producers) can be registered Efficient key-value stores Efficient key-value stores  OpenDHT  Amazon Dynamo: efficiency, high availability, scalability. Lack of strong data consistency (‘eventual consistency’)  Avg get/put delay ~ 15/30ms; 99th percentile ~ 200/300ms (Decandia et al. Dynamo: Amazon’s Highly Available Key-value Store)

WORKS08, Austin, Texas, November 17th, 2008 Monitoring + DHT (simplified architecture)

WORKS08, Austin, Texas, November 17th, 2008 DHT-based scenario

WORKS08, Austin, Texas, November 17th, 2008 Evaluation Goal:  Measure performance & scalability  Comparison with centralized approach Main characteristic measured:  Delay between ocurrence of a new workflow component to beginning of data transfer, for different workloads Two methodologies:  Queuing Network models with multiple classes, analytical solution  Simulation models (CSIM simulation package)

WORKS08, Austin, Texas, November 17th, st methodology: Queuing Networks Solved analitycally (a) DHT solution QN model (b) Centralized solution QN model

WORKS08, Austin, Texas, November 17th, nd methodology: discrete-event simulation CSIM simulation package

WORKS08, Austin, Texas, November 17th, 2008 Input parameters for models Workload intensity  Measured in job arrivals per second  Taken from EGEE: 3000 to jobs per day  Large scale production infrastructure  Assumed range: from 0.3 to 10 job arrivals per second Service demands  Monitors and Coordinator: prototypes built and measured  DHT: from available reports on large-scale deployments  OpenDHT, Amazon Dynamo

WORKS08, Austin, Texas, November 17th, 2008 Service demand matrices

WORKS08, Austin, Texas, November 17th, 2008 Results (centralized model)

WORKS08, Austin, Texas, November 17th, 2008 Results (DHT model)

WORKS08, Austin, Texas, November 17th, 2008 Scalability comparison: centralized vs. DHT Conclusion: DHT solution scalable as expected, but centralized solution can still handle relatively large workloads before saturation

WORKS08, Austin, Texas, November 17th, 2008 Outline Event model for workflow execution monitoring Event model for workflow execution monitoring On-line workflow monitoring support On-line workflow monitoring support Information model for recording workflow executions Information model for recording workflow executions

WORKS08, Austin, Texas, November 17th, 2008 Information model for wf execution records Motivation: need for structured information about past experiments executed as scientific workflows in e-Science environments Motivation: need for structured information about past experiments executed as scientific workflows in e-Science environments  Provenance querying  Mining over past experiments  Experiment repetition  Execution optimization based on history State of the art State of the art  Monitoring information models do exist but for resource monitoring (GLUE), not execution monitoring  Provenance models are not sufficient  Repositories for performance data are oriented towards event traces or simple performance-oriented information Experiment Information (ExpInfo) model  Ontologies used to describe the model and represent the records

WORKS08, Austin, Texas, November 17th, 2008 ExpInfo model A simplified example with particular domain ontologies

WORKS08, Austin, Texas, November 17th, 2008 ExpInfo model: set of ontologies General experiment information General experiment information  Purpose, execution stages, input/output data sets Provenance information Provenance information  Who, where, why, data dependencies Performance information Performance information  Duration of computation stages, scheduling, queueing, performance metrics (possible) Resource information Resource information  Physical resources (hosts, containers) used in the computation Connection with domain ontologies Connection with domain ontologies  Data sets with Data ontology  Execution stages with Application ontology

WORKS08, Austin, Texas, November 17th, 2008 Aggregation of data to information From monitoring events to ExpInfo records From monitoring events to ExpInfo records Standardized process described by aggregation rules and derivation rules Standardized process described by aggregation rules and derivation rules Aggregation rules specify how to instantiate individuals Aggregation rules specify how to instantiate individuals  Ontology classes associated with aggregation rules through object properties Derivation rules specify how to compute attributes, including object properties = associations betwen individuals Derivation rules specify how to compute attributes, including object properties = associations betwen individuals  Attributes are associated with derivation rules via annotations Semantic Aggregator uses collects wf execution events and produces ExpInfo records according to aggregation and derivation rules Semantic Aggregator uses collects wf execution events and produces ExpInfo records according to aggregation and derivation rules

WORKS08, Austin, Texas, November 17th, 2008 Aggregation rules ExperimentAggregation: eventTypes = started.workflow, finished.workflow instantiatedClass = protos/Experiment ecidCoherency = 1ComputationAggregation: eventTypes = invoking.wfTask, invoked.wfTask instantaitedClass = protos/Computation acidCoherency = 2

WORKS08, Austin, Texas, November 17th, 2008 Derivation rules The simplest case – an XML element mapped directly to a functional property: MonitoringData/experimentStarted/ownerLogin MonitoringData/experimentStarted/ownerLogin </ext-ns:Derivation> More complex case: which XML elements are needed and how to compute an attribute: cyfronet.gs.aggregator.delegates.ExpPlugin cyfronet.gs.aggregator.delegates.ExpPlugin software.execution.started.application/time software.execution.started.application/time software.execution.finished.application/time software.execution.finished.application/time </ext-ns:Derivation>

WORKS08, Austin, Texas, November 17th, 2008 Applications Coordinated Traffic Management  Executed within K-Wf Grid infrastructure for workflows  Workflows with legacy backends  Instrumentation & tracing Drug Resistance application  Executed within ViroLab virtual laboratory for infectious diseases virolab.cyfronet.pl  Recording executions, provenance querying, visual ontology-based querying based on ExpInfo model

WORKS08, Austin, Texas, November 17th, 2008 Conclusion Several monitoring challenges specific to Grid scientific workflows Standardized taxonomy for workflow execution events DHT infrastructure to improve performance of resource discovery and enable on-line monitoring Information model for recording workflow executions

WORKS08, Austin, Texas, November 17th, 2008 Future Work Enhancement of event & information models Enhancement of event & information models  Work-in-progress, requires extensive review of existing systems to enhance event taxonomy, event data structures and information model Model enhancement & validation Model enhancement & validation  Performance of large-scale deployment  Classification of workflows w.r.to generated workloads  (Preliminary study: S. Ostermann, R. Prodan, and T. Fahringer. A Trace-Based Investigation of the Characteristics of Grid Workflows) Information model for worfklow status Information model for worfklow status  Similar to resource status in information systems