Workflows and the Pegasus Workflow Management System

Slides:



Advertisements
Similar presentations
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Managing Workflows Within HUBzero: How to Use Pegasus to Execute Computational Pipelines Ewa Deelman USC Information Sciences Institute Acknowledgement:
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Managing Workflows with the Pegasus Workflow Management System
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
Introduction to Scientific Workflows and Pegasus Karan Vahi Science Automation Technologies Group USC Information Sciences Institute.
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
Managing large-scale workflows with Pegasus Karan Vahi ( Collaborative Computing Group USC Information Sciences Institute Funded.
Experiences Using Cloud Computing for A Scientific Workflow Application Jens Vöckler, Gideon Juve, Ewa Deelman, Mats Rynge, G. Bruce Berriman Funded by.
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
Dr. Ahmed Abdeen Hamed, Ph.D. University of Vermont, EPSCoR Research on Adaptation to Climate Change (RACC) Burlington Vermont USA MODELING THE IMPACTS.
Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Experiment Management from a Pegasus Perspective Jens-S. Vöckler Ewa Deelman
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
HUBzero® Platform for Scientific Collaboration Copyright © 2012 HUBzero Foundation, LLC International Workshop on Science Gateways, ETH Zürich, June 3-5,
Overview of Scientific Workflows: Why Use Them?
HPC In The Cloud Case Study: Proteomics Workflow
Accessing the VI-SEEM infrastructure
VisIt Project Overview
The SCEC CSEP TESTING Center Operations Review
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Pegasus WMS Extends DAGMan to the grid world
Volunteer Computing for Science Gateways
(on behalf of the POOL team)
Cloudy Skies: Astronomy and Utility Computing
U.S. ATLAS Grid Production Experience
Simplify Your Science with Workflow Tools
Seismic Hazard Analysis Using Distributed Workflows
Joseph JaJa, Mike Smorul, and Sangchul Song
Scaling Up Scientific Workflows with Makeflow
Scott Callaghan Southern California Earthquake Center
CyberShake Study 16.9 Discussion
US CMS Testbed.
Introduction to Makeflow and Work Queue
Haiyan Meng and Douglas Thain
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
University of Southern California
Module 01 ETICS Overview ETICS Online Tutorials
Pegasus Workflows on XSEDE
University of California, Berkeley
Laura Bright David Maier Portland State University
Large Scale Distributed Computing
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
Distributing META-pipe on ELIXIR compute resources
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
High Throughput Computing for Astronomers
A General Approach to Real-time Workflow Monitoring
Frieda meets Pegasus-WMS
End to End Workflow Monitoring Panorama 360 George Papadimitriou
Presentation transcript:

Workflows and the Pegasus Workflow Management System

Scientific Workflows Enable automation of complex, multi-step pipelines Provide reliable execution on unreliable infrastructure Support reproducibility of computations Can be shared and reused with other data/parameters/algorithms Enable recording of data provenance Support distributed, parallel execution to reduce time to solution

Benefits of Scientific Workflows (from the point of view of an application scientist) Framework for Collaboration and Composition Conducts a series of computational tasks. Resources distributed across Internet. Chaining (outputs become inputs) replaces manual hand-offs. Accelerated creation of products. Ease of use - gives non-developers access to sophisticated codes. Avoids need to download-install-learn how to use someone else's code. Provides framework to host or assemble community set of applications. Honors original codes. Allows for heterogeneous coding styles. Framework to define common formats or standards when useful. Promotes exchange of data, products, codes. Community metadata. Multi-disciplinary workflows can promote even broader collaborations. E.g., ground motions fed into simulation of building shaking. Certain rules or guidelines make it easier to add a code into a workflow. Slide courtesy of David Okaya, SCEC, USC 4

Science-grade Mosaic of the Sky

Science-grade Mosaic of the Sky Montage Workflow Re-projection Background Rectification Co-addition Output Input Size of mosaic in degrees square Number of input data files Number of tasks Number of intermediate files Total data footprint Cummulative wall time 1 84 387 850 1.9 GB 21 mins 2 300 1442 3176 6.8 GB 54 mins 4 685 3738 8258 18 GB 3 hours, 18 mins 6 1461 7462 16458 37 GB 7 hours, 7 mins 8 2565 12757 28113 64 GB 11 hours, 44 mins

Hunting Exoplanets with Kepler http://kepler.nasa.gov Kepler continuously monitors the brightness of over 175,000 stars Search for periodic dips in signals as Earth-like planets transit in front of host star. Over 380,000 light curves have been released Can take 1 hour to perform periodogram analysis of Kepler light curve Need to perform a bulk analysis of all the data to search for these periodic signals Kepler 6-b transit 210K input, 630K output files 210K tasks total

Southern California Earthquake Center CyberShake PSHA Workflow Builders ask seismologists: “What will the peak ground motion be at my new building in the next 50 years?” Seismologists answer this question using Probabilistic Seismic Hazard Analysis (PSHA) 286 Sites, 4 models Each site = one workflow Each workflow has 420,000 tasks in 21 jobs

Some workflows are structurally complex Gravitational-wave physics workflow Genomics Workflow

Some workflows are large-scale and data-intensive John Good (Caltech) Montage Galactic Plane Workflow 18 million input images (~2.5 TB) 900 output images (2.5 GB each, 2.4 TB total) 10.5 million tasks (34,000 CPU hours) Need to support hierarchical workflows and scale × 17

Sometimes the environment is just not exactly right Single core workload (SCEC CyberShake) Execution on a supercomputer Cray XT System / ALPS / aprun Designed for MPI codes

Sometimes the environment is complex Data Storage Campus Cluster XSEDE NERSC ALCF Open Science Grid FutureGrid Amazon Cloud Work definition Local Resource

Sometimes the environment is complex Data Storage data Campus Cluster XSEDE NERSC ALCF Open Science Grid FutureGrid Amazon Cloud Work definition Workflow Management System work Local Resource

Sometimes the environment is complex Data Storage data Campus Cluster XSEDE NERSC ALCF Open Science Grid FutureGrid Amazon Cloud You don’t want to recode your workflow Work definition Workflow Management System work Local Resource

Workflow Management You may want to use different resources within a workflow or over time Need a high-level workflow specification Need a planning capability to map from high-level to executable workflow Need to manage the task dependencies Need to manage the execution of tasks on the remote resources Need to provide scalability, performance, reliability

Pegasus Workflow Management System A collaboration since 2001 between USC and the Condor Team at UW Madison (includes DAGMan) Maps a resource-independent “abstract” workflow onto resources and executes the “concrete” workflow Provides reliability—can retry computations from the point of failure Provides scalability—can handle large data and many computations (KB to TB of data, 1 to 106 tasks) Infers data transfers, restructures workflows for performance Automatically captures provenance information Can run on resources distributed among institutions, laptop, campus cluster, Grid, Cloud

Planning Process Pegasus planner compiles abstract workflows into executable workflows Adds data management jobs (transfer, cleanup, registration) Performs optimizations (task clustering, reduction, etc.) Generates executable artifacts (DAG, submit scripts, etc.) Enables portability and optimization Workflows are expressed as DAXes, which contain no information about resources Pegasus maps the workflow into a DAG, which is executable Process involves: 1. Site selection, 2. Replica selection, Catalogs Scripts Condor Catalogs Scripts Information Catalogs Submit Scripts Pegasus Planner DAGMan DAX DAG

Workflows as Directed Acyclic Graphs Nodes are tasks Typically, executables with arguments Can also be sub-DAGs Edges are dependencies Typically data dependencies Can also be control dependencies No loops Recursion is possible What is DAGMan and what are its benefits? (throttling, retries, rescue DAGs, etc.)

Abstract vs Executable Workflow (DAX)

DAX

DAX Generator The DAX is just an instance of the workflow, the DAX generator is the real workflow

Advanced features Performs data reuse (data reduction) Registers data in data catalogs—supports checkpointing Reduces storage footprint by deleting data no longer needed Can cluster tasks together to reduce overhead and improve performance Understands complex data architectures (shared and non-shared file systems, distributed data sources, data transfer protocols) Supports many different execution modes which leverage different computing architectures (Condor pools, HPC resources, etc..)

Montage 1 degree workflow run with cleanup Data Cleanup Montage 1 degree workflow run with cleanup

Workflow Reduction Data Reuse Workflow-level checkpointing Sometimes it is cheaper to access the data than to regenerate it Keeping track of data as it is generated supports workflow-level checkpointing Kent changed 080501 – added period

Workflow Restructuring to improve application performance Cluster small running jobs together to achieve better performance Why? Each job has scheduling overhead – need to make this overhead worthwhile Ideally users should run a job on the grid/cloud that takes at least 10/30/60/? minutes to execute Clustered tasks can reuse common input data – less data transfers Level-based clustering

Pegasus-MPI-Cluster A master/worker task scheduler for running fine-grained workflows on batch systems Runs as an MPI job Uses MPI to implement master/worker protocol Works on most HPC systems Requires: MPI, a shared file system, and fork() Allows sub-graphs of a Pegasus workflow to be submitted as monolithic jobs to remote resources Master Worker

Enables earthquake scientists (SCEC) to run post-processing (single core) computations on new supercomputer architectures (e.g. Cray XT)

Workflow Monitoring Leverage Stampede Monitoring framework with DB backend Populates data at runtime. A background daemon monitors the logs files and populates information about the workflow to a database Stores workflow structure, and runtime stats for each task Command-line tools for querying the monitoring framework pegasus-status Status of the workflow pegasus-statistics Detailed statistics about the runtime of the workflow and groups of jobs pegasus-plots Generates plots of workflow behavior pegasus-analyzer Identifies failed jobs ------------------------------------------------------------------------------ Type Succeeded Failed Incomplete Total Retries Total+Retries Tasks 135002 0 0 135002 0 135002 Jobs 4529 0 0 4529 0 4529 Sub-Workflows 2 0 0 2 0 2 Workflow wall time : 13 hrs, 2 mins, (46973 secs) Workflow cumulative job wall time : 384 days, 5 hrs, (33195705 secs) Cumulative job walltime as seen from submit side : 384 days, 18 hrs, (33243709 secs)

Workflow Monitoring Dashboard A web-based workflow dashboard Queries the STAMPEDE database Troubleshooting Lists all the user’s workflows with status information Identifies failed workflows and jobs Displays job info including stdout, stderr Workflow statistics Workflow runtime, job runtime, overheads, etc. Charts Gantt chart of jobs over time Host chart with jobs per host Jobs over time Collaboration with Dan Gunter and Taghrid Samak, LBNL

Workflow Monitoring Dashboard Host Chart Workflow statistics Jobs and Runtime over Time Gantt Chart

Integration with HUBzero Credit: Frank McKenna UC Berkeley, NEES, HUBzero

Benefits of Pegasus Provides Support for Complex Computations Can be hidden behind a portal Portability / Reuse User created workflows can easily be run in different environments without alteration (XSEDE, OSG, FutureGrid, Amazon) Performance The Pegasus mapper can reorder, group, and prioritize tasks in order to increase the overall workflow performance Scalability Pegasus can easily scale both the size of the workflow, and the resources that the workflow is distributed over. Kent changed 080501 – removed period

Benefits of Pegasus Provenance Reliability Analysis tools Performance and provenance data is collected in a database, and the data can be summaries with tools such as pegasus-statistics, pegasus-plots, or directly with SQL queries. Reliability Jobs and data transfers are automatically retried in case of failures. Debugging tools such as pegasus-analyzer helps the user to debug the workflow in case of non-recoverable failures. Analysis tools Kent changed 080501 – removed period

If you get stuck… And you can draw…. We can help you!

Time to solution ~ 3 months Execution on USC resources

How to get started? Pegasus: http://pegasus.isi.edu Tutorial and documentation: http://pegasus.isi.edu/wms/docs/latest Virtual Machine with all software and examples http://pegasus.isi.edu/downloads Take look at some Pegasus applications: http://pegasus.isi.edu/applications Support: pegasus-users@isi.edu (public) pegasus-support@isi.edu (private)