CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Slides:

Advertisements

Similar presentations

Pegasus on the Virtual Grid: A Case Study of Workflow Planning over Captive Resources Yang-Suk Kee, Eun-Kyu Byun, Ewa Deelman, Kran Vahi, Jin-Soo Kim Oracle.

Advertisements

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Ewa Deelman Using Grid Technologies to Support Large-Scale Astronomy Applications Ewa Deelman Center for Grid Technologies USC Information.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Managing Workflows with the Pegasus Workflow Management System

Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.

December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide High Level Grid Services Warren Smith Texas.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.

STAR scheduling future directions Gabriele Carcassi 9 September 2002.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Managing large-scale workflows with Pegasus Karan Vahi ( Collaborative Computing Group USC Information Sciences Institute Funded.

Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Dr. Ahmed Abdeen Hamed, Ph.D. University of Vermont, EPSCoR Research on Adaptation to Climate Change (RACC) Burlington Vermont USA MODELING THE IMPACTS.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Combining the strengths of UMIST and The Victoria University of Manchester Adaptive Workflow Processing and Execution in Pegasus Kevin Lee School of Computer.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Review of Condor,SGE,LSF,PBS

GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Pegasus WMS: Leveraging Condor for Workflow Management Ewa Deelman, Gaurang Mehta, Karan Vahi, Gideon Juve, Mats Rynge, Prasanth.

Experiment Management from a Pegasus Perspective Jens-S. Vöckler Ewa Deelman

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Funded by the NSF OCI program grants OCI and OCI Mats Rynge, Gideon Juve, Karan Vahi, Gaurang Mehta, Ewa Deelman Information Sciences Institute,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.

Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute

Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.

Intermediate HTCondor: More Workflows Monday pm

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Pegasus WMS Extends DAGMan to the grid world

U.S. ATLAS Grid Production Experience

Intermediate HTCondor: Workflows Monday pm

An Introduction to Workflows with DAGMan

Grid Compute Resources and Job Management

Using Stork An Introduction Condor Week 2006

Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.

Overview of Workflows: Why Use Them?

Mats Rynge USC Information Sciences Institute

High Throughput Computing for Astronomers

Frieda meets Pegasus-WMS

Presentation transcript:

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations

2 DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”)‏

3 What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a “node” in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

4 A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D each node will run the Condor job specified by its accompanying Condor submit file Defining a DAG Job A Job BJob C Job D

5 Submitting a DAG To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

6 DAGMan Running a DAG DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor-G based on the DAG dependencies. Condor-G Job Queue C D A A B.dag File

7 DAGMan Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor-G queue at the appropriate times. Condor-G Job Queue C D B C B A

8 DAGMan Running a DAG (cont’d) In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor-G Job Queue X D A B Rescue File

9 DAGMan Recovering a DAG -- fault tolerance Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor-G Job Queue C D A B Rescue File C

10 DAGMan Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor-G Job Queue C D A B D

11 DAGMan Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor-G Job Queue C D A B

Additional DAGMan Features Provides other handy features for job management… – nodes can have PRE & POST scripts – failed nodes can be automatically re-tried a configurable number of times – job submission can be “throttled”

HANDS-ON

Ewa Deelman, eelman pegasus.isi.edu

Pegasus: Planning for Execution in Grids Abstract Workflows - Pegasus input workflow description –workflow “high-level language” –only identifies the computations that a user wants to do –devoid of resource descriptions –devoid of data locations Pegasus ( –a workflow “compiler” –target language - DAGMan’s DAG and Condor submit files –transforms the workflow for performance and reliability –automatically locates physical locations for both workflow components and data –finds appropriate resources to execute the components –provides runtime provenance DAGMan –A workflow executor –Scalable and reliable execution of an executable workflow

Pegasus Workflow Management System Condor Schedd DAGMan Pegasus mapper Reliable, scalable execution of independent tasks (locally, across the network), priorities, scheduling Reliable and scalable execution of dependent tasks A reliable, scalable workflow management system that an application or workflow composition service can depend on to get the job done A decision system that develops strategies for reliable and efficient execution in a variety of environments Cyberinfrastructure: Local machine, cluster, Condor pool, OSG, TeraGrid Abstract Workflow client tool with no special requirements on the infrastructure

Generating a Concrete Workflow Information – location of files and component Instances – State of the Grid resources Select specific – Resources – Files – Add jobs required to form a concrete workflow that can be executed in the Grid environment Data movement – Data registration – Each component in the abstract workflow is turned into an executable job

Information Components used by Pegasus Globus Monitoring and Discovery Service (MDS) – Locates available resources – Finds resource properties Dynamic: load, queue length Static: location of GridFTP server, RLS, etc Globus Replica Location Service – Locates data that may be replicated – Registers new data products Transformation Catalog – Locates installed executables

Example Workflow Reduction Original abstract workflow If “b” already exists (as determined by query to the RLS), the workflow can be reduced

Mapping from abstract to concrete Query RLS, MDS, and TC, schedule computation and data movement

Pegasus Research resource discovery and assessment resource selection resource provisioning workflow restructuring – task merged together or reordered to improve overall performance adaptive computing – Workflow refinement adapts to changing execution environment

Benefits of the workflow & Pegasus approach The workflow exposes – the structure of the application – maximum parallelism of the application Pegasus can take advantage of the structure to – Set a planning horizon (how far into the workflow to plan) – Cluster a set of workflow nodes to be executed as one (for performance) Pegasus shields from the Grid details

Benefits of the workflow & Pegasus approach Pegasus can run the workflow on a variety of resources Pegasus can run a single workflow across multiple resources Pegasus can opportunistically take advantage of available resources (through dynamic workflow mapping) Pegasus can take advantage of pre-existing intermediate data products Pegasus can improve the performance of the application.