HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Slides:



Advertisements
Similar presentations
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Advertisements

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Introduction to Condor DMD/DFS J.Knudstrup December 2005.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Progress Report Barnett Chiu Glidein Code Updates and Tests (1) Major modifications to condor_glidein code are as follows: 1. Command Options:
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
Enabling Grids for E-sciencE Workload Management System on gLite middleware Matthieu Reichstadt CNRS/IN2P3 ACGRID School, Hanoi (Vietnam)
GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.
Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
TANYA LEVSHINA Monitoring, Diagnostics and Accounting.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor Advanced Job Submission John (TJ) Knoeller Center for High Throughput Computing.
An Introduction to Using
Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.
Taming Local Users and Remote Clouds with HTCondor at CERN
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.
Debugging Common Problems in HTCondor
More HTCondor Monday AM, Lecture 2 Ian Ross
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Condor Introduction and Architecture for Vanilla Jobs CERN Feb
Intermediate HTCondor: Workflows Monday pm
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Things you may not know about HTCondor
IW2D migration to HTCondor
High Availability in HTCondor
Things you may not know about HTCondor
Condor: Job Management
A Distributed Policy Scenario
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
HTCondor Command Line Monitoring Tool
Troubleshooting Your Jobs
Submitting Many Jobs at Once
Job Matching, Handling, and Other HTCondor Features
Basic Grid Projects – Condor (Part I)
Introduction to High Throughput Computing and HTCondor
Genre1: Condor Grid: CSECCR
The Condor JobRouter.
gLite Job Management Christos Theodosiou
Condor: Firewall Mirroring
Condor Administration in the Open Science Grid
Troubleshooting Your Jobs
Job Submission Via File Transfer
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019

Overview HTCondor Batch System Job Submission Investigating Failed Jobs Input And Output Files Exercises 1/16/2019

HTCondor Batch System 1/16/2019

Machine Ownership Submission of 100 jobs – 10 machines 1K * 100 1h jobs = 100K CPU hours  ≈ 11.4 job slots Submission of 100 jobs – 10 machines Running Jobs Waste of resources Serial Submission Batch takes some time Time (sec) 1/16/2019

Timesharing Submission of 100 jobs – 100 machines Parallel Submission Running Jobs Parallel Submission Batch runs quicker Can be used by others Time (sec) 1/16/2019

Batch Scheduling CERN Batch System Running Jobs Time (hours) 1/16/2019

CERN Batch Service Delivers computing resources To the experiments and departments for tasks e.g. Physics event reconstruction Data analysis Simulation It shares the resources fairly between all users Current capacity approximately 120,000 cores 1/16/2019

HTCondor Open-source batch system implementation It provides Center for High Throughput Computing University of Wisconsin–Madison. It provides Job queueing mechanism Scheduling policy Resource monitoring Resource management 1/16/2019 http://research.cs.wisc.edu/htcondor/manual/

HTCondor Workflow Worker node HTCondor System Worker node Worker node user job Worker node job HTCondor System job job user Worker node job job Worker node user 1/16/2019

HTCondor Service Components Central Manager Negotiator Collector AFS Submit machine Worker node EOS Scheduler File Transfer Mechanism condor_submit <name_of_file> 1/16/2019

The Central Manager Composed of the Collector and Negotiator daemons. Collects information on machines in the pool Collects information on the jobs in the queue Collecting information from all the daemons in the pool It accepts queries from other daemons and user-level command (e.g. condor_q) The Negotiator Negotiates between machines and machines requests (job) Asks for a list with all the available machines from the collector Matches jobs and machines considering the job requirements 1/16/2019

Negotiator: Matchmaking Schedd’s Daemon Worker node’s Daemon One shadow for every running job One starter for every claimed slot condor_schedd condor_startd shadow starter 1/16/2019

ClassAds The framework by which Condor matches jobs with machines They are analogous to the classified advertising section of the newspaper users submitting jobs are buyers of compute resources machine owners are sellers. Used for Describing and advertising Jobs Machines Matching jobs to machines Statistical purposes Debugging purposes 1/16/2019

Example ClassAds 1/16/2019 Job ClassAd AccountingGroup = “name of accounting group" ClusterId = 9 Cmd = "/afs/cern.ch/…/welcome.sh" CompletionDate = 0 CondorPlatform = "$CondorPlatform: x86_64_RedHat7 $" CondorVersion = "$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $" DiskUsage = 1 EnteredCurrentStatus = 1493728837 Err = "error/welcome.9.0.err" ExitBySignal = false ExitStatus = 0 FileSystemDomain = "cern.ch" GlobalJobId = "bigbird06.cern.ch#9.0#1493728837" Hostgroup = "$$(HostGroup:bi/condor/gridworker/share)" JobPrio = 0 JobStatus = 1 JobUniverse = 5 NumJobCompletions = 0 NumJobStarts = 0 NumRestarts = 0 Machine ClassAd COLLECTOR_HOST_STRING = “*.cern.ch, *.cern.ch" CondorLoadAvg = 0.0 CondorPlatform = "$CondorPlatform: x86_64_RedHat6 $" CondorVersion = "$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $" Cpus = 8 FileSystemDomain = "cern.ch“ JobStarts = 156 Machine = "b658ea5902.cern.ch“ Memory = 22500 RecentJobStarts = 0 SlotType = "Partitionable" SlotTypeID = 1 SlotWeight = Cpus Start = ( StartJobs =?= true ) && ( RecentJobStarts < 5 ) && ( SendCredential =?= true ) StartJobs = true TotalMemory = 22500 TotalSlotCpus = 8 TotalSlotDisk = 223032980.0 TotalSlotMemory = 22500 TotalSlots = 1 1/16/2019

Job Startup Machines periodically send their ClassAds to the Collector The user submits their job to the Schedd The Schedd informs the Collector about the job The Negotiator queries the Collector about waiting jobs and available machines The Negotiator queries the Schedd about the job for the requirements The Negotiator matches the job with a machine The Schedd contacts the machine and each other 1/16/2019

Job Submission 1/16/2019

Submit File Provides commands on how to execute the job Contains basic information about The executable The arguments Paths for the input and output files The number of the jobs in the queue The names of the jobs in the queue 1/16/2019

Condor Output And Logs These files are defined in the submit file The STDOUT of the command or script Error The STDERR of the command or script Log Information about job’s execution execution host the number of times this job has been submitted the exit value, etc Can use relative or absolute paths for all of them HTCondor will search for this directory So it should be already created 1/16/2019

condor_submit <name_of_submit_file> Requirements In the submit file job requirements can be set about Operating System Number of CPUs Memory Specific machines Also provide ClassAdd attributes for this job Defined by using “+Name_Of_Variable” E.g +JobFlavour = espresso The job is submitted by executing: condor_submit <name_of_submit_file> 1/16/2019

Progress of job submission To follow the progress of the job, execute: condor_wait  path_to_log-file [job ID] This watches the job event log file Created with the log command within a submit description file Returns when one or more jobs from the log have completed or aborted It will wait forever for jobs to finish unless a shorter wait time is specified condor_wait  [-wait seconds] log-file [job ID] 1/16/2019 http://research.cs.wisc.edu/htcondor/manual/current/condor_wait.html

Inspecting Queues The condor_q command queries the collector For information about the jobs in the queue Arguments can be used to filter the jobs of interest Possible filters cluster.process Matches jobs in the same cluster.process that are still in the queue owner Matches jobs that are in the queue and they belong to this owner -constraint expression Matches jobs that satisfy this ClassAd expression 1/16/2019

But which are the possible status of the jobs? Example -- Schedd: bigbird06.cern.ch : <128.142.194.67:9618?... @ 05/02/17 10:04:46 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS fprotops CMD: welcome.sh 5/2 10:04 _ - 1 8.0 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended But which are the possible status of the jobs? 1/16/2019

Job States Idle (I) The job is waiting in the queue to be executed Running (R) The job is running Completed (C) The job is exited Held (H) Something is wrong with the submission file The user executed condor_hold <job_id> Suspended (S) The user executed condor_suspend Removed (X) The job has been removed by the user. 1/16/2019

Investigating Failed Jobs 1/16/2019

Diagnostics With condor_q condor_q displays the current jobs in the queue To see the ClassAd of a specific Job Id, execute: condor_q –l <JobId> A very useful option for debugging when a jo stays in idle state is: condor_q –better <Job Id> or condor_q –analyze <Job Id> Both display the reason why a job is not running They perform an analysis with constraints, owner’s preferences about the machines, etc. It sometimes provides also suggestions about the solution of the problem For a more detailed analysis of complex requirements and the job ClassAd attributes, execute: condor_q –better-analyze <Job Id> 1/16/2019

Investigating Jobs The reasons for Held jobs can be found with: condor_q –hold <JobId> In the case where a machine is not accepting the job, execute: condor_q -better –reverse <name of machine> <JobId> After the completion of the job condor_history can be used It displays the information of the complete jobs from the history files To display the ClassAd of a specific completed Job, execute: condor_history –l <JobId> 1/16/2019

Diagnostics With condor_status condor_status queries the Collector asking for information about the machine Specifying the machine name as an argument Display the slots, their state and their activity Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@b6c70d5f39.cern.ch LINUX X86_64 Owner Idle 0.500 10500 0+00:02:47 slot1_1@b6c70d5f39.cern.ch LINUX X86_64 Claimed Busy 0.000 2000 0+00:18:55 To display the ClassAd of a specific machine, execute: condor_status –l <name of the machine> 1/16/2019

Input and Output Files 1/16/2019

transfer_input_files= path Only the executable is transferred To use other files the File Transfer Mechanism is required Input files are defined in the submit file by adding: transfer_input_files= path The path can be absolute or relative Multiple files can be specified using comma separation 1/16/2019

transfer_output_files= “name_of_file” Files created or modified during job execution will be transferred back The number of output files transferred can be filtered adding in the submit file: transfer_output_files= “name_of_file” This will search for the file in the scratch directory of the executing machine Multiple files can be specified using comma separation Note that this command is not related to the error, output and log files 1/16/2019

Spooling Files Another way to filter the number of output files is by using the spool option condor_submit –spool <name of submit file> The files are stored in the schedd’s spool directory after the completion of the job. To transfer all or some of them back to the submit, execute: condor_transfer_data <name of User> || <JobId> This acts upon all files including the error, log and output files 1/16/2019

Let’s play with HTCondor!! Exercises These will help you to understand HTCondor And how we can use it correctly Let’s play with HTCondor!! http://cern.ch/htcondor 1/16/2019