Using Stork An Introduction Condor Week 2006

Slides:

Advertisements

Similar presentations

Week 6: Chapter 6 Agenda Automation of SQL Server tasks using: SQL Server Agent Scheduling Scripting Technologies.

Advertisements

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

A Computation Management Agent for Multi-Institutional Grids

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

1 Using Condor An Introduction ICE 2008.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Lecture 8 Configuring a Printer-using Magic Filter Introduction to IP Addressing.

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

Grid Computing I CONDOR.

1 Using Condor An Introduction ICE 2010.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid job submission using HTCondor Andrew Lahiff.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

Alain Roy Computer Sciences Department University of Wisconsin-Madison I/O Access in Condor and Grid.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

FTP COMMANDS OBJECTIVES. General overview. Introduction to FTP server. Types of FTP users. FTP commands examples. FTP commands in action (example of use).

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.

Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.

Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Core LIMS Training: Project Management

Intermediate HTCondor: More Workflows Monday pm

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

U.S. ATLAS Grid Production Experience

Intermediate HTCondor: Workflows Monday pm

An Introduction to Workflows with DAGMan

High Availability in HTCondor

Grid Compute Resources and Job Management

Condor: Job Management

Using Condor An Introduction Condor Week 2004

Troubleshooting Your Jobs

What’s New in DAGMan HTCondor Week 2013

Using Condor An Introduction Condor Week 2003

STORK: A Scheduler for Data Placement Activities in Grid

The Condor JobRouter.

Using Condor An Introduction Paradyn/Condor Week 2002

Condor-G Making Condor Grid Enabled

Troubleshooting Your Jobs

Presentation transcript:

Using Stork An Introduction Condor Week 2006

Audience Users already familiar with Condor, DAGMan, who need advanced data placement capabilities. This tutorial makes an excellent extension to this morning's Condor User's tutorial.

Tutorial Outline Classical data transfer problems Stork solutions Stork data placement job management Managing mixed data, CPU job work flows with DAGMan

He is a scientist. But he has a big Meet Friedrich*: He is a scientist. But he has a big problem. *Frieda's twin brother

I have a LOT of data to process. Where can I get help?

Friedrich's problem … Many large data sets to process. For each data set: stage in data from remote server run CPU data processing job stage out data to remote server

“Classic” Data Transfer Job #!/bin/sh globus-url-copy source dest Scripts often works fine for short, simple data transfers, but…

Many things can go wrong! Errors are more likely with large data sets: “The network was down.” “The data server was unavailable.” “The transferred data was corrupt.” “My workflow didn’t know the data was bad.”

Enter Stork: Creates notion of a data placement job: managed, scheduled just like a Condor CPU job. Friedrich will benefit from: Built-in fault tolerance Compatible with Condor DAGMan workflow manager

Supported Data Transfers local file system GridFTP FTP HTTP SRB NeST SRM modular extensible to other protocols

Fault Tolerance Retries failed jobs Can also retry failed job using alternate protocol, e.g. “first try GridFTP, then try FTP” Retry “stuck” jobs Configurable fault responses

Getting Stork Stork is bundled with Condor 6.7, and all future releases Available as a free download from http://www.cs.wisc.edu/condor Currently available for Linux platforms

Friedrich Installs a “Personal Stork” on his workstation… What do we mean by a “Personal” Stork? Condor/Stork on your own workstation, no root access required, no system administrator intervention needed After installation, Friedrich submits his jobs to his Personal Stork…

Friedrich’s Personal Stork Friedrich's workstation: Central Mgr. Master StartD SchedD Stork data jobs CPU jobs DAG N compute elements external data servers

Your Personal Stork will ... Keep an eye on your data and CPU jobs, and will keep you posted on their progress Throttle maximum jobs running Keep a log of your job activities Add fault tolerance to your jobs Detect and retry failed data placement jobs Enforce data placement, CPU job order dependencies

Creating a Submit Description File A plain ASCII text file Neither Stork nor Condor care about file extensions, nor statement order. Tells Stork about your job: data placement type, source/destination location/protocol, proxy location, input, output, error and log files to use, command-line arguments, etc.

Simple Submit File // c++ style comment lines [ dap_type = "transfer"; src_url = "gsiftp://server/path”; dest_url = "file:///dir/file"; x509proxy = "default"; log = "stage-in.out.log"; output = "stage-in.out.out"; err = "stage-in.out.err"; ] Note: different format from Condor submit files

Running stork_submit You give stork_submit the name of the submit file you have created: # stork_submit stage-in.stork stork_submit parses the submit file, checks for it errors, and sends job to Stork server. stork_submit returns the new job id: the job “handle”

Sample stork_submit returned job id # stork_submit stage-in.stork using default proxy: /tmp/x509up_u19100 ================ Sending request: [ dest_url = "file:///dir/file"; src_url = "gsiftp://server/path"; err = "path/stage-in.out.err"; output = "path/stage-in.out.out"; dap_type = "transfer"; log = "path/stage-in.out.log"; x509proxy = "default" ] Request assigned id: 1 # returned job id

The Job Queue stork_submit sends your job to the Stork server: Manages the local job queue View the queue with stork_q, or stork_status …

Getting Job Status stork_q queries all active jobs stork_status queries the named job id, which may be active, or complete # stork_status 12

Removing jobs If you want to remove a job from the job queue, you use stork_rm You can only remove jobs that you own (you can’t run stork_rm on someone else’s jobs unless you are root) You must give a specific job ID: # stork_rm 21 ·Removes a single job

More information about jobs Controlled by submit file log setting Stork creates a log file (user log) “The Life Story of a Job” Shows all events in the life of a job Always have a log file To turn it on in submit file: log = “filename”;

Sample Stork User Log 000 (001.-01.-01) 04/17 19:30:00 Job submitted from host: <128.105.121.53:54027> ... 001 (001.-01.-01) 04/17 19:30:01 Job executing on host: <128.105.121.53:9621> 008 (001.-01.-01) 04/17 19:30:01 job type: transfer 008 (001.-01.-01) 04/17 19:30:01 src_url: gsiftp://server/path 008 (001.-01.-01) 04/17 19:30:01 dest_url: file:///dir/file 005 (001.-01.-01) 04/17 19:30:02 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job

My jobs have have dependencies… Can Stork help solve my dependency* problems? * Not your personal problems!

Friedrich learns DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”) In the simplest case…

What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a “node” in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job B Job C a DAG is the best data structure to represent a workflow of jobs with dependencies children may not run until their parents have finished – this is why the graph is a directed graph … there’s a direction to the flow of work In this example, called a “diamond” dag, job A must run first; when it finishes, jobs B and C can run together; when they are both finished, D can run; when D is finished the DAG is finished Loops, where two jobs are both descended from one another, are prohibited because they would lead to deadlock – in a loop, neither node could run until the other finished, and so neither would start – this restriction is what makes the graph acyclic Job D

Defining Friedrich's DAG A DAG is defined by a text file, listing each of its nodes and their dependencies: # data-process.dag Data IN in.stork Job CRUNCH crunch.condor Data OUT out.stork Parent IN Child CRUNCH Parent CRUNCH Child OUT each node will run the Condor or Stork job specified by accompanying submit file IN CRUNCH OUT This is all it takes to specify the example “diamond” dag

Submitting a DAG To start your DAG, just run condor_submit_dag with your dag file, and Condor will start a personal DAGMan daemon to begin running your jobs: # condor_submit_dag friedrich.dag condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it. Just like any other Condor job, you get fault tolerance in case the machine crashes or reboots, or if there’s a network outage And you’re notified when it’s done, and whether it succeeded or failed % condor_q -- Submitter: foo.bar.edu : <128.105.175.133:1027> : foo.bar.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2456.0 user 3/8 19:22 0+00:00:02 R 0 3.2 condor_dagman -f -

In Review With Stork Friedrich now can… Submit his data processing jobs and go home! Stork manages the data transfers, including fault detection and retries Condor DAGMan manages his job dependencies.

Additional Resources http://www.cs.wisc.edu/condor/stork/ Condor Manual, Stork sections stork-announce@cs.wisc.edu list stork-discuss@cs.wisc.edu list

Questions?

Additional Slides

Important Parameters STORK_MAX_NUM_JOBS limits number of active jobs STORK_MAX_RETRY limits job attempts, before job marked as failed STORK_MAXDELAY_INMINUTES specifies “hung job” threshold

Current Restrictions Currently, best suited for “Personal Stork” mode Local file paths must be valid on Stork server, including submit directory. To share data, successive jobs in DAG must use shared filesystem

Future Work Enhance multi-user fair share Enhance support for DAGs without shared filesystem Enhance scheduling with configurable job requirements, rank Add job matchmaking Additional platform ports

Access to Data in Condor Use Shared Filesystem if available No shared filesystem? Condor can transfer files Can automatically send back changed files Atomic transfer of multiple files Can be encrypted over the wire Remote I/O Socket Standard Universe can use remote system calls (more on this later)

Condor File Transfer ShouldTransferFiles = YES Always transfer files to execution site ShouldTransferFiles = NO Rely on a shared filesystem ShouldTransferFiles = IF_NEEDED Will automatically transfer the files if the submit and execute machine are not in the same FileSystemDomain Universe = vanilla Executable = my_job Log = my_job.log ShouldTransferFiles = IF_NEEDED Transfer_input_files = dataset$(Process), common.data Transfer_output_files = TheAnswer.dat Queue 600

condor_master Starts up all other Condor daemons, including Stork If there are any problems and a daemon exits, it restarts the daemon and sends email to the administrator Acts as the server for many Condor remote administration commands: condor_reconfig, condor_restart, condor_off, condor_on, condor_config_val, etc.

Running a DAG DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. DAGMan A Condor Job Queue .dag File A B C First, job A will be submitted alone… D

Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor queue at the appropriate times. DAGMan A Condor Job Queue B B C Once job A completes successfully, jobs B and C will be submitted at the same time… C D

Running a DAG (cont’d) In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. DAGMan A Condor Job Queue Rescue File B X If job C fails, DAGMan will wait until job B completes, and then will exit, creating a rescue file. Job D will not run. In its log, DAGMan will provide additional details of which node failed and why. D

Recovering a DAG Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. DAGMan A Condor Job Queue Rescue File B C Since jobs A and B have already completed, DAGMan will start by re- submitting job C C D

Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. DAGMan A Condor Job Queue B C D D

Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. DAGMan A Condor Job Queue B C If job C fails, DAGMan will wait until job B completes, and then will exit, creating a rescue file. D