Simplify Your Science with Workflow Tools

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

1 High Performance Computing at SCEC Scott Callaghan Southern California Earthquake Center University of Southern California.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

DIANE Overview Germán Carrera, Alfredo Solano (CNB/CSIC) EMBRACE COURSE Monday 19th of February to Friday 23th. CNB-CSIC Madrid.

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

NSF Geoinformatics Project (Sept 2012 – August 2014) Geoinformatics: Community Computational Platforms for Developing Three-Dimensional Models of Earth.

CyberShake Study 15.4 Technical Readiness Review.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

Experiment Management from a Pegasus Perspective Jens-S. Vöckler Ewa Deelman

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Sep 13, 2006 Scientific Computing 1 Managing Scientific Computing Projects Erik Deumens QTP and HPC Center.

Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.

Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Application Web Service Toolkit Allow users to quickly add new applications GGF5 Edinburgh Geoffrey Fox, Marlon Pierce, Ozgur Balsoy Indiana University.

STAR Scheduler Gabriele Carcassi STAR Collaboration.

Southern California Earthquake Center CyberShake Progress Update November 3, 2014 – 4 May 2015 UGMS May 2015 Meeting Philip Maechling SCEC IT Architect.

Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)

1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Overview of Scientific Workflows: Why Use Them?

HPC In The Cloud Case Study: Proteomics Workflow

Workload Management Workpackage

The SCEC CSEP TESTING Center Operations Review

Condor DAGMan: Managing Job Dependencies with Condor

Simulation Production System

Simplify Your Science with Workflow Tools

Pegasus WMS Extends DAGMan to the grid world

U.S. ATLAS Grid Production Experience

Example: Rapid Atmospheric Modeling System, ColoState U

Seismic Hazard Analysis Using Distributed Workflows

Data Bridge Solving diverse data access in scientific applications

Spark Presentation.

Scott Callaghan Southern California Earthquake Center

Workflows and the Pegasus Workflow Management System

Building Analytics At Scale With USQL and C#

CyberShake Study 16.9 Discussion

Condor: Job Management

US CMS Testbed.

Haiyan Meng and Douglas Thain

University of Southern California

Weaving Abstractions into Workflows

Pegasus Workflows on XSEDE

Overview of Workflows: Why Use Them?

Mats Rynge USC Information Sciences Institute

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

High Throughput Computing for Astronomers

A General Approach to Real-time Workflow Monitoring

Frieda meets Pegasus-WMS

CyberShake Study 18.8 Technical Readiness Review

CyberShake Study 2.2: Computational Review Scott Callaghan 1.

End to End Workflow Monitoring Panorama 360 George Papadimitriou

CyberShake Study 18.8 Planning

Presentation transcript:

Simplify Your Science with Workflow Tools International HPC Summer School June 27, 2017 Scott Callaghan Southern California Earthquake Center University of Southern California scottcal@usc.edu

Overview What are scientific workflows? What problems do workflow tools solve? Overview of available workflow tools CyberShake (seismic hazard application) Computational overview Challenges and solutions Ways to simplify your work Goal: Help you figure out if this would be useful

Scientific Workflows Formal way to express a scientific calculation Multiple tasks with dependencies between them No limitations on tasks Capture task parameters, input, output Independence of workflow process and data Often, run same workflow with different data You use workflows all the time…

Perhaps your workflow is simple… stage-in parallel job post-process stage-out

…or maybe a bit more complicated… Job 1 pre-process Job 2 visualize publish Job 3 Job 4

…or maybe just confusing edit file names stage-in post-process again if it failed stage-out parallel job post-process stage-in that other file do you still need this job? job you always forget job from past post-doc email your supervisor

Workflow Components Task executions Specify a series of tasks to run Data and control dependencies between tasks Outputs from one task may be inputs for another Task scheduling Some tasks may be able to run in parallel with other tasks File and metadata management Track when a task was run, key parameters Resource provisioning (getting cores) Computational resources are needed to run jobs

What do we need help with? Task executions What if something fails in the middle? Data and control dependencies Make sure inputs are available for tasks Task scheduling Minimize execution time while preserving dependencies Metadata Automatically capture and track Getting cores

Workflow tools can help! Automate your pipeline Define your workflow via programming or GUI Run workflow on local or remote system Can support all kinds of workflows Use existing code (no changes) Provide many kinds of fancy features and capabilities Flexible but can be complex Will discuss one set of tools (Pegasus) as example, but concepts are shared

Pegasus-WMS Developed at USC’s Information Sciences Institute Used in many domains, including LIGO project Workflows are executed from local machine Jobs can run on local machine or on distributed resources You use API to write code describing workflow (“create”) Python, Java, Perl Tasks with parent / child relationships Files and their roles Pegasus creates XML file of workflow called a DAX Workflow represented by directed acyclic graph

Sample Workflow my_job input.txt output.txt //Create DAX object dax = ADAG("test_dax") //Define my job myJob = Job(name="my_job") //Input and output files to my job inputFile = File("input.txt") outputFile = File("output.txt") //Arguments to my_job (./my_job input=input.txt output=output.txt) myJob.addArgument("input=input.txt", "output=output.txt") //Role of the files for the job myJob.uses(inputFile, link=Link.INPUT) myJob.uses(outputFile, link=Link.OUTPUT) //Add the job to the workflow dax.addJob(myJob) //Write to file fp = open("test.dax", "w") dax.writeXML(fp) fp.close() my_job output.txt

Getting ready to run (“Planning”) DAX is “abstract workflow” Logical filenames and executables Algorithm description Use Pegasus to “plan” workflow for execution Uses catalogs to resolve logical names, compute info Pegasus automatically augments workflow Staging jobs (if needed) with GridFTP or Globus Online Registers output files in a catalog to find later Wraps jobs in pegasus-kickstart for detailed statistics Generates a DAG Top-level workflow description (tasks and dependencies) Submission file for each job

Pegasus Workflow Path Jobs Execute Plan Run Scheduler (HTCondor) Create workflow description (you write this) Jobs Execute Plan Run DAX API Scheduler (HTCondor) Abstract workflow (DAX) Logical names, algorithm Concrete workflow (DAG) Physical paths, job scripts

Other tools in stack HTCondor (UW Madison) GRAM (Globus Toolkit) Pegasus ‘submits’ workflow to HTCondor Supervises runtime execution of DAG files Maintains queue Monitors dependencies Schedules jobs Retries failures Writes checkpoint GRAM (Globus Toolkit) Uses certificate-based authentication for remote job submission Supported by many HPC resources

Create workflow description Full workflow stack Local machine Remote machine Create workflow description What you do: Remote scheduler What the tools do: HTCondor Pegasus GRAM local queue remote queue DAG

Other Workflow Tools Regardless of the tool, trying to solve same problems Describe your workflow (Pegasus “Create”) Prepare your workflow for the execution environment (Pegasus “Plan”) Send jobs to resources (HTCondor, GRAM) Monitor the execution of the jobs (HTCondor DAGMan) Brief overview of some other available tools

Other Workflow Tools Swift (U of Chicago) Askalon (U of Innsbruck) Workflow defined via scripting language Workflow compiled internally and executed Focus on large data, many tasks Askalon (U of Innsbruck) Create workflow description Use workflow language Or use UML editor to graphically create Conversion: like planning, to prep for execution Submit jobs to Enactment Engine, which distributes jobs for execution at remote grid or cloud sites Provides monitoring tools //Create new type type messagefile; //Create app definition, returns messagefile app (messagefile t) greeting() { //Print and pipe stdout to t echo “Hello, world!” stdout=@filename(t); } //Create a new messagefile, linked to hello.txt messagefile outfile <“hello.txt”> //Run greeting() and store results outfile = greeting();

More Workflow Tools Kepler (diverse US collaboration) GUI interface Many models of computation (‘actors’) with built-in components (tasks) RADICAL Cybertools Sits atop SAGA, a Python API for submitting remote jobs Uses pilot jobs to provision resources UNICORE (Jülich Supercomputing Center) GUI interface to describe workflow Branches, loops, parallel loops Many more: ask me about specific use cases NCSA Blue Waters has webinars on several tools

Workflow Application: CyberShake What will peak ground motion be over the next 50 years? Used in building codes, insurance, government, planning Answered via Probabilistic Seismic Hazard Analysis (PSHA) Communicated with hazard curves and maps 2% in 50 years 0.4 g

CyberShake Computational Requirements Determine shaking of ~500,000 earthquakes per site Large parallel jobs 2 GPU wave propagation jobs, 800 nodes x 1 hr, 1.5 TB output Small serial jobs 500,000 seismogram calculation jobs, 1 core x 4.7 min, 30 GB Need ~300 sites for hazard map Decided to use scientific workflows Automation Data management Error recovery

Challenge: Resource Provisioning For large parallel jobs, submit to remote scheduler GRAM (or other tool) puts jobs in remote queue Runs like a normal batch job Can specify either CPU or GPU nodes For small serial jobs, need high throughput Putting lots of jobs in the batch queue is ill-advised Scheduler isn’t designed for heavy job load Scheduler cycle is ~5 minutes Policy limits too Solution: Pegasus-mpi-cluster (PMC)

Pegasus-mpi-cluster MPI wrapper around serial or thread-parallel jobs Master-worker paradigm Preserves dependencies HTCondor submits job to multiple nodes, starts PMC Specify jobs as usual, Pegasus does wrapping Uses intelligent scheduling Core counts Memory requirements Can combine writes Workers write to master, master aggregates to fewer files

Challenge: Data Management Millions of data files Pegasus provides staging Symlinks files if possible, transfers files if needed Transfers output back to local archival disk Supports running parts of workflows on separate machines Cleans up temporary files when no longer needed Directory hierarchy to reduce files per directory We added automated checks to check integrity Correct number of files, NaN, zero-value checks Included as new jobs in workflow

CyberShake Study 17.3 Hazard curves for 876 sites Used OLCF Titan and NCSA Blue Waters Averaged 1295 nodes (CPUs and GPUs) for 31 days Workflow tools scheduled 15,581 jobs 23.9 workflows running concurrently Generated 285 million seismograms Workflow tools managed 777 TB of data 308 TB of intermediate data transferred 10.7 TB (~17M files) staged back to local disk Workflow tools scale!

Problems Workflows Solve Task executions Workflow tools will retry and checkpoint if needed Data management Stage-in and stage-out data for jobs automatically Task scheduling Optimal execution on available resources Metadata Automatically track runtime, environment, arguments, inputs Getting cores Whether large parallel jobs or high throughput

Should you use workflow tools? Probably using a workflow already Replaces manual hand-offs and polling to monitor Provides framework to assemble community codes Scales from local computer to large clusters Provide portable algorithm description independent of data CyberShake has run on 9 systems since 2007 with same workflow Does add additional software layers and complexity Some development time is required

Final Thoughts Automation is vital, even without workflow tools Eliminate human polling Get everything to run automatically if successful Be able to recover from common errors Put ALL processing steps in the workflow Include validation, visualization, publishing, notifications Avoid premature optimization Consider new compute environments (dream big!) Larger clusters, XSEDE/PRACE/RIKEN/CC, Amazon EC2 Tool developers want to help you!

Links SCEC: http://www.scec.org Pegasus: http://pegasus.isi.edu Pegasus-mpi-cluster: http://pegasus.isi.edu/wms/docs/latest/cli-pegasus-mpi- cluster.php HTCondor: http://www.cs.wisc.edu/htcondor/ Globus: http://www.globus.org/ Swift: http://swift-lang.org Askalon: http://www.dps.uibk.ac.at/projects/askalon/ Kepler: https://kepler-project.org/ RADICAL Cybertools: https://radical-cybertools.github.io/ UNICORE: http://www.unicore.eu/ CyberShake: http://scec.usc.edu/scecpedia/CyberShake Update links

Questions?