Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

More HTCondor 2014 OSG User School, Monday, Lecture 2 Greg Thain University of Wisconsin-Madison.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

Introduction to Condor DMD/DFS J.Knudstrup December 2005.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Content Contributed by the University of Wisconsin Condor.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

An Introduction to High-Throughput Computing Rob Quick OSG Operations Officer Indiana University Some Content Contributed by the University of Wisconsin.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Grid Computing I CONDOR.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Grid job submission using HTCondor Andrew Lahiff.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Five todos when moving an application to distributed HTC.

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.

UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Some Odds and Ends About Computational Infrastructure Kyle Gross – Operations Support Manager - Open Science Grid Research Technologies.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Development Environment

Operations Support Manager - Open Science Grid

Large Output and Shared File Systems

The Landscape of Academic Research Computing

A Simple Introduction to Git: a distributed version-control system

High Availability in HTCondor

The Linux Operating System

Building Grids with Condor

Condor: Job Management

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Troubleshooting Your Jobs

Accounting, Group Quotas, and User Priorities

Haiyan Meng and Douglas Thain

Job Matching, Handling, and Other HTCondor Features

Basic Grid Projects – Condor (Part I)

Genre1: Condor Grid: CSECCR

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

Sun Grid Engine.

Introduction to High Performance Computing Using Sapelo2 at GACRC

Condor-G Making Condor Grid Enabled

Troubleshooting Your Jobs

PU. Setting up parallel universe in your pool and when (not

Introduction to research computing using Condor

Presentation transcript:

Profiling Applications to Choose the Right Computing Infrastructure plus Batch Management with HTCondor Kyle Gross – Operations Support Manager - Open Science Grid Research Technologies -Indiana University Content Contributed by the University of Wisconsin Condor Team, Rob Quick, and Scot Kronenfeld

2012 Africa Grid School Follow Along at: n/ASP2016/ 2

2012 Africa Grid School Some thoughts on the exercises It’s okay to move ahead on exercises if you have time It’s okay to take longer on them if you need to If you move along quickly, try the “On Your Own” sections and “Challenges” 3

2012 Africa Grid School Most important! Please ask me questions! …during the lectures …during the exercises …during the breaks …during the meals …over dinner …via after we depart or If I don’t know, I’ll find the right person to answer your question. 4

2012 Africa Grid School Goals for this session Profiling your application Picking the appropriate resources Understand the basics of HTCondor 5

2012 Africa Grid School Let’s take one step at a time Can you run one job on one computer? Can you run one job on another computer? Can you run 10 jobs on a set of computers? Can you run a multiple job workflow? How do we put this all together? This is the path we’ll take in the school 6 Small Large Local Distributed

2012 Africa Grid School What does the user provide? A “headless job”  Not interactive/no GUI: how could you interact with 1000 simultaneous jobs? A set of input files A set of output files A set of parameters (command-line arguments) Requirements:  Ex: My job requires at least 2GB of RAM  Ex: My job requires Linux Control/Policy:  Ex: Send me when the job is done  Ex: Job 2 is more important than Job 1  Ex: Kill my job if it runs for more than 6 hours 7

2012 Africa Grid School What does the system provide? Methods to:  Submit/Cancel job  Check on state of job  Check on state of available computers Processes to:  Reliably track set of submitted jobs  Reliably track set of available computers  Decide which job runs on which computer  Manage a single computer  Start up a single job 8

2012 Africa Grid School Gedankenexperiment Let’s assume you have a ‘large job’  What factors could make it large? Large Data Input or Output or both Needs to do heavy calculation Needs a lot of memory Needs to communicate with other jobs Reads and writes a lot of data/files Heavy graphics processing Any combination of any of the above 9

2012 Africa Grid School There is no “One Size Fits All Solution” But some solutions are more ”Open” than others.  Local Laptop/Desktop  Local Cluster  HPC System  Shared HTC Resources  Clouds 10

2012 Africa Grid School Why is HTC hard? The HTC system has to keep track of:  Individual tasks (a.k.a. jobs) & their inputs  Computers that are available The system has to recover from failures  There will be failures! Distributed computers means more chances for failures. You have to share computers  Sharing can be within an organization, or between orgs  So you have to worry about security  And you have to worry about policies on how you share If you use a lot of computers, you have to handle variety:  Different kinds of computers (arch, OS, speed, etc..)  Different kinds of storage (access methodology, size, speed, etc…)  Different networks interacting (network problems are hard to debug!) 11

2012 Africa Grid School Surprise! Condor does this (and more) Methods to:  Submit/Cancel job. condor_submit/condor_rm  Check on state of job. condor_q  Check on state of avail. computers. condor_status Processes to:  Reliably track set of submitted jobs. schedd  Reliably track set of avail. computers. collector  Decide which job runs on where. negotiator  Manage a single computer startd  Start up a single job starter 12

2012 Africa Grid School But not only Condor You can use other systems:  PBS/Torque  Oracle Grid Engine (né Sun Grid Engine)  LSF  SLURM  … But I won’t cover them.  My experience is with Condor  My bias is with Condor  Overlays exist What should you learn at the school?  How do you think about Computing Resources?  How can you do your science with HTC?  … For now, learn it with Condor, but you can apply it to other systems. 13

2012 Africa Grid School A brief introduction to Condor Please note, we will only scratch the surface of Condor:  We won’t cover MPI, Master-Worker, advanced policies, site administration, security mechanisms, submission to other batch systems, virtual machines, cron, high-availability, computing on demand, containers. 14

2012 Africa Grid School …And matches them Condor Takes Computers… Dedicated Clusters Desktop Computers …And jobs I need a Mac! I need a Linux box with 2GB RAM! Matchmaker

2012 Africa Grid School Quick Terminology Cluster: A dedicated set of computers not for interactive use Pool: A collection of computers used by Condor  May be dedicated  May be interactive Remember:  Condor can manage a cluster in a machine room  Condor can use desktop computers  Condor can access remote computers  HTC uses all available resources

2012 Africa Grid School Matchmaking Matchmaking is fundamental to Condor Matchmaking is two-way  Job describes what it requires: I need Linux && 8 GB of RAM  Machine describes what it requires: I will only run jobs from the Physics department Matchmaking allows preferences  I need Linux, and I prefer machines with more memory but will run on any machine you provide me

2012 Africa Grid School Why Two-way Matching? Condor conceptually divides people into three groups:  Job submitters  Computer owners  Pool (cluster) administrator All three of these groups have preferences } May or may not be the same people

2012 Africa Grid School ClassAds ClassAds state facts  My job’s executable is analysis.exe  My machine’s load average is 5.6 ClassAds state preferences  I require a computer with Linux ClassAds are extensible  They say whatever you want them to say

2012 Africa Grid School Example ClassAd MyType = "Job" TargetType = "Machine" ClusterId = 1377 Owner = "roy“ Cmd = “analysis.exe“ Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && ((Memory * 1024)>=ImageSize) … String Number Expression

2012 Africa Grid School Schema-free ClassAds Condor imposes some schema  Owner is a string, ClusterID is a number… But users can extend it however they like, for jobs or machines  AnalysisJobType = “simulation”  HasJava_1_6 = TRUE  ShoeLength = 10 Matchmaking can use these attributes  Requirements = OpSys == "LINUX" && HasJava_1_6 == TRUE

2012 Africa Grid School Don’t worry You won’t write ClassAds (usually)  You’ll create a simple submit file  Condor will write the ClassAd  You can extend the ClassAd if you want to You won’t write requirements (usually)  Condor writes them for you  You can extend them  In some environments you provide attributes instead of requirements expressions 22

2012 Africa Grid School Matchmaking Service Job queue service Information service Matchmaking diagram condor_schedd Queue Matchmaker Collector Negotiator 1 2 3

2012 Africa Grid School Why do jobs fail? The computer running the job fails  Or the network, or the disk, or the OS, or… Your job might be preempted:  Condor decides your job is less important than another, so your job is stopped and another started.  Not a “failure” per se, but it may feel like it to you. 24

2012 Africa Grid School Reliability When a job fails or is preempted:  It stays in the queue (on the schedd)  A note is written to the job log file  It reverts to “idle” state  It is eligible to be matched again Relax! Condor will run your job again 25

2012 Africa Grid School Access to data in Condor Option #1: Shared filesystem  Simple to use, but make sure your filesystem can handle the load Option #2: Condor’s file transfer  Can automatically send back changed files  Atomic transfer of multiple files  Can be encrypted over the wire  Most common for small applications and data Option #3: Remote I/O 26

2012 Africa Grid School Universe = vanilla Executable = my_job Log = my_job.log ShouldTransferFiles = YES Transfer_input_files = dataset$(Process), common.data Queue 600 Condor File Transfer ShouldTransferFiles = YES  Always transfer files to execution site ShouldTransferFiles = NO  Rely on a shared filesystem ShouldTransferFiles = IF_NEEDED  Will automatically transfer the files if needed 27

2012 Africa Grid School Condor File Transfer with URLs Transfer_input_files can be a URL For example: 28 transfer_input_files =

2012 Africa Grid School Clusters & Processes One submit file can describe lots of jobs  All the jobs in a submit file are a cluster of jobs  Yeah, same term as a cluster of computers Each cluster has a unique “cluster number” Each job in a cluster is called a “process” A Condor “job ID” is the cluster number, a period, and the process number (“20.1”) A cluster is allowed to have one or more processes.  There is always a cluster for every job 29

2012 Africa Grid School The $(Process) macro The initial directory for each job can be specified as run_$(Process), and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once The $(Process) macro will be expanded to the process number for each job in the cluster ( ), so we’ll have “run_0”, “run_1”, … “run_599” directories All the input/output files will be in different directories! 30

2012 Africa Grid School Example of $(Process) # Example condor_submit input file that defines # a cluster of 600 jobs with different directories Universe = vanilla Executable = my_job Log = my_job.log Arguments = -arg1 –arg2 Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_$(Process) run_0 … run_599 Queue 600 Creates job 3.0 …

2012 Africa Grid School More $(Process) You can use $(Process) anywhere: Universe = vanilla Executable = my_job Log = my_job.$(Process).log Arguments = -randomseed $(Process) Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr InitialDir = run_$(Process) Queue

2012 Africa Grid School Sharing a directory You don’t have to use separate directories. $(Cluster) will help distinguish runs Universe = vanilla Executable = my_job Arguments = -randomseed $(Process) Input = my_job.input.$(Process) Output = my_job.stdout.$(Cluster).$(Process) Error = my_job.stderr.$(Cluster).$(Process) Log = my_job.$(Cluster).$(Process).log Queue

2012 Africa Grid School Not Only Programming Language You ran a C program yesterday You can also run scripting languages such as bash, python, and perl You can also execute programs via commands like R and root 34

2012 Africa Grid School Things to remember so far… There are several different computing environments There is a very diverse set of computing jobs Matching jobs to resources is key to not wasting resources Not all of the available environments are open environments Research Computing is Complex 35

2012 Africa Grid School Quick UNIX Refresher Before We Start $ nano, vi, emacs, cat >, etc. source, module, chmod, ls 36

2012 Africa Grid School That was a whirlwind tour! Enough with the presentation: let’s use Condor! 37 Goal: Extend the diversity of our jobs and add some data to the mix.

2012 Africa Grid School Questions? Questions? Comments?  Feel free to ask me questions now or later: Kyle Gross Rob Quick Exercises start here: ucation/ASP2016/ Presentations are also available from this URL. 38