End to End Workflow Monitoring Panorama 360 George Papadimitriou

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Microsoft SharePoint 2013 SharePoint 2013 as a Developer Platform
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
05 | Configuration and Deployment Richard Currey | Senior Technical Trainer–New Horizons United George Squillace | Senior Technical Trainer–New Horizons.
Process-oriented System Automation Executable Process Modeling & Process Automation.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
HTML+JavaScript M2M Applications Viewbiquity Public hybrid cloud platform for automating and visualizing everything.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Parallel Interactive Computing with PyTrilinos and IPython Bill Spotz, SNL (Brian Granger, Tech-X Corporation) November 8, 2007 Trilinos Users Group Meeting.
Using Biological Cyberinfrastructure Scaling Science and People: Applications in Data Storage, HPC, Cloud Analysis, and Bioinformatics Training Scaling.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
Transformation System report Luisa Arrabito 1, Federico Stagni 2 1) LUPM CNRS/IN2P3, France 2) CERN 5 th DIRAC User Workshop 27 th – 29 th May 2015, Ferrara.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
STAR Scheduling status Gabriele Carcassi 9 September 2002.
1
Troubleshooting Directories and Files Debugging
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
SQL Database Management
Architecture Review 10/11/2004
Accessing the VI-SEEM infrastructure
Monitoring Windows Server 2012
Agenda:- DevOps Tools Chef Jenkins Puppet Apache Ant Apache Maven Logstash Docker New Relic Gradle Git.
OPERATING SYSTEMS CS 3502 Fall 2017
What Is The SSIS Catalog and Why Do I Care?
Simulation Production System
Transaction Processing System (TPS)
Pegasus WMS Extends DAGMan to the grid world
Dockerize OpenEdge Srinivasa Rao Nalla.
Shared Services with Spotfire
Network Tools and Utilities
Simplify Your Science with Workflow Tools
Seismic Hazard Analysis Using Distributed Workflows
Data Bridge Solving diverse data access in scientific applications
HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.
MCTS Guide to Microsoft Windows 7
Job workflow Pre production operations:
Hierarchical Architecture
Accessing Spatial Information from MaineDOT
NAT , Device Discovery Chapter 9 , chapter 10.
OPERATING SYSTEM OVERVIEW
11/8/2018 2:35 PM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or.
Chapter 12: Automated data collection methods
The Extensible Tool-chain for Evaluation of Architectural Models
Transaction Processing System (TPS)
Haiyan Meng and Douglas Thain
Alan Chalker and Eric Franz Ohio Supercomputer Center
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
University of Southern California
CSSSPEC6 SOFTWARE DEVELOPMENT WITH QUALITY ASSURANCE
Transaction Processing System (TPS)
Module 01 ETICS Overview ETICS Online Tutorials
Get your ETL flow under statistical process control
Code Analysis, Repository and Modelling for e-Neuroscience
End to End Monitoring Solution using Open Source Technology where webMethods 9.10 is used as ESB IBM Confidential.
Chapter 2: Operating-System Structures
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
A General Approach to Real-time Workflow Monitoring
Code Analysis, Repository and Modelling for e-Neuroscience
Chapter 2: Operating-System Structures
Frieda meets Pegasus-WMS
Indexing with ElasticSearch
Final Review 27th March Final Review 27th March 2019.
DIBBs Brown Dog BDFiddle
Trace and Logs analysis
Presentation transcript:

End to End Workflow Monitoring Panorama 360 George Papadimitriou georgpap@isi.edu http://pegasus.isi.edu

Panorama 360: Project Overview Leverage the Pegasus WMS to structure, execute and monitor workflow execution Characterize performance: instrument data capture, summarize data, and publish results Create an open access common repository for storing end-to-end workflow performance and resource data captured using a variety of tools Apply and develop ML techniques for workflow performance analysis and infrastructure troubleshooting Record findings, distill best practices, and share and refine them with other program teams *Open for external contributors Goals: 1) Tools to enable end to end performance and resource usage capture tool 2) Open access data repository 3) ML Techniques for workflow performance analysis and infrastructure troubleshooting 4) Highlight best practices

Data Sources: Application and Infrastructure Pegasus Stampede events regarding the workflow and its status Pegasus Kickstart Online collects resource usage traces with frequency as low as 1 second in real-time Darshan collects file access statistics (eg. POSIX, MPI-IO) during the execution Globus collects transfer statistics and general information about the transfer (throughput, file transfer errors etc.)

Pegasus Kickstart Online Traces In the Panorama branch pegasus-kickstart supports fine-grained monitoring capabilities by invoking a helper tool called pegasus-monitor. pegasus-monitor can pull resource usage statistics for workflow running tasks within a predefined time interval. Minimum interval is limited to 1 second The statistics are being read from the /proc entry of the running task and among them include: Number of processes and threads stime and utime Bytes read and written iowait

Darshan Darshan is an HPC lightweight application-level I/O profiling tool that captures statistics on the behavior of HPC I/O operations. It captures data for each file opened by the application, including I/O operation counts, common I/O access sizes, cumulative timers, etc. I/O behavior is captured for POSIX IO, MPI-IO, HDF5, and Parallel netCDF data interface layers. It also captures a set of job-level characteristics, such as the number of application processes, the job’s start and end times, and the job unique identification provided by the scheduler. In Panorama we only expose accumulated performance data from the STDIO and POSIX modules Reference: https://www.mcs.anl.gov/research/projects/darshan/

Globus Globus is a research data management service, built on top of gridftp. It can be used to transfer files for your own computations or share files with the community. For every transfer request Globus creates logs containing transfer statistics, such as: Request and Completion time Source and Destination Transfer rate Number of failures Reference: https://www.globus.org

Data Sources: Problems They are scattered across multiple locations (Eg. execution site, cloud service, pegasus logs) They don’t contain metadata about the workflow, and it’s very hard to locate and match them in the future Captured data don’t have a common format Pegasus Kickstart logs are in XML format Pegasus Stampede events are in JSON format Pegasus Kickstart online logs are in JSON format Globus logs are in JSON format Darshan logs are in binary format

Data Collection: End-to-End Workflow Execution Monitoring Pegasus apart from planning and running the workflow, orchestrates the data collection A message queueing system is used, to decouple the publishers from the datastore Flexible search and visualization engines are used to explore the data

Data Collection: Architecture Overview

Data Collection: Tool enhancements and new tools pegasus-monitord: extended with JSON output format, the ability to pickup job related monitoring messages, and publish to amqp endpoints pegasus-transfer: extended to support Globus transfers and the ability to publish statistics in json format to amqp endpoints pegasus-darshan: wrapper to darshan-parser, that pushes darshan logs in JSON format to pegasus-monitord

Visualization: Detailed Workflow and Job Characteristics

Visualization: Time Series Data of Workflow Performance Cumulative Time Spent in User and System Execution Average CPU Utilization Bytes Read and Written Number of Compute Threads

Repository: Organization ElasticSearch Index Description panorama_transfer Globus logs panorama_kickstart Pegasus-Kickstart online traces panorama_stampede Workflow Events and Darshan logs

Repository: Open access data https://data.panorama.isi.edu https://kibana.panorama.isi.edu

How to Deploy: Prerequisites HTCondor 8.6+: https://research.cs.wisc.edu/htcondor/downloads/ Pegasus Panorama: Compile from source: https://github.com/pegasus-isi/pegasus/tree/panorama Pre-compiled binaries: http://download.pegasus.isi.edu/pegasus/4.9.2panorama/ Docker 17.02+: https://docs.docker.com/install/ Docker Compose: https://docs.docker.com/compose/

How to Deploy: Monitoring Backend (RabbitMQ, ELK Stack) On a host that has Docker and Docker Compose installed, clone https://github.com/Panorama360/data-collection-arch Change to the cloned directory and execute the following command: docker-compose up –d Example:

How to Deploy: Checking Services (RabbitMQ, ELK Stack) Now the host should have RabbitMQ, Elasticsearch, Logstash, and Kibana running as Docker containers with their service ports exposed. Try to access them… RabbitMQ: http://<hostname or ip>:15672 Elasticsearch: http://<hostname or ip>:9200 Logstash: http://<hostname or ip>:9600 Kibana: http://<hostname or ip>:5601

How to Deploy: Enabling Stampede Events In order to get pegasus-monitord to publish all of its events to an AMQP endpoint in JSON format, 3 properties must be specified in the workflow’s properties file (eg. “pegasus.properties”). pegasus.monitord.encoding = json pegasus.catalog.workflow.amqp.url = amqp://[username:password]@<hostname>[:port]/<exchange_name> pegasus.catalog.workflow.amqp.events = stampede.* Example: More about stampede events: https://pegasus.isi.edu/documentation/stampede_wf_events.php

How to Deploy: Enabling Transfer Events In order to get pegasus-transfer to publish transfer statistics from the Globus Transfer Service to an AMQP endpoint in JSON format, 2 profiles must be specified in the workflow’s sites catalog (eg. “sites.xml”), under the site where pegasus-transfer is going to be invoked (eg. “local”). env.PEGASUS_TRANSFER_PUBLISH = 1 env.PEGASUS_AMQP_URL = amqp://[username:password]@<hostname>[:port]/<exchange_name> Example:

How to Deploy: Enabling Kickstart Online Traces In order to get pegasus-kickstart to publish traces of resource usage statistics to an AMQP endpoint in JSON format, 2 profiles must be specified in the workflow’s sites catalog (eg. “sites.xml”) under the compute site. pegasus.gridstart.arguments = -m <interval in seconds> env.KICKSTART_MON_URL = rabbitmq://[USERNAME:PASSWORD]@<hostname>[:port]/api/exchanges/<exchange_name>/publish Example: Alternatively if we want to customize the monitoring interval per computational task we can specify the profile in the workflow’s transformation catalog (eg. “tx.txt”)

How to Deploy: Enabling Kickstart Online Traces (MPI Jobs) Usually MPI jobs are not launched by Pegasus-Kickstart. Thus, adding the gridstart.arguments profile doesn’t have any effect. We can work around this by using a wrapper script for the MPI job, that invokes directly pegasus-monitor. We still need to specify KICKSTART_MON_URL in the sites catalog. Example:

How to Deploy: Enabling Darshan Logs (MPI Jobs) In case your MPI application wasn’t compiled and statically linked with Darshan’s library, we need to set a profile in the transformation catalog, adding the path of the library to LD_PRELOAD. Transformation Catalog Wrapper Shell Script We launch the application using a wrapper script, and as post job steps: Build the darshan log path from the environmental variables Invoke pegasus-darshan with the files as input

How to Deploy: Enabling Darshan Logs (MPI Jobs) pegasus-darshan will output in stdout a monitoring payload, that will be picked by pegasus-monitord, which in its turn will publish it to the AMQP endpoint. This can also be used as a generic way of adding new tools to this architecture.

DEMO

George Papadimitriou GitHub: https://github.com/Panorama360 Website: https://panorama360.github.io George Papadimitriou Computer Science PhD Student University of Southern California email: georgpap@isi.edu https://panorama360.github.io/

Pegasus Get Started est. 2001 Automate, recover, and debug scientific computations. Pegasus Website http://pegasus.isi.edu Get Started Users Mailing List pegasus-users@isi.edu Support pegasus-support@isi.edu Pegasus Online Office Hours https://pegasus.isi.edu/blog/online-pegasus-office-hours/ Bi-monthly basis on second Friday of the month, where we address user questions and also apprise the community of new developments