Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Similar presentations


Presentation on theme: "High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor."— Presentation transcript:

1 High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor

2 What is Condor? What is High Throughput Computing? Why Condor? Why not Condor? Condor at Purdue Submitting and managing jobs Suitable jobs Topics

3 A product of the University of Wisconsin- Madison A job scheduler A resource manager A workflow management system Focused on High Throughput Computing What is Condor?

4 What is High Throughput Computing (HTC)? Large amounts of processing Long period of time

5 HTC v. HPC FLOPS extracted v. FLOPS Distributed Ownership v. Central Ownership Capturing Idle Cycles v. Losing Idle Cycles Throughput v. Response Time Distributed Memory v. Tightly-coupled Memory 1,000 Jobs v. 1 Job

6 Why Condor? Wasted compute cycles Scheduling of related jobs Access to more cores

7 Advantages of Condor Many tasks running at once Access to more powerful computers Using wasted cycles Minimal impact on remote computers Security Little or no code modification

8 Disadvantages of Condor Compete for access Task may take longer to complete Processing can be lost Parallel jobs aren’t available Large files can impact the remote computer Heterogeneity of the remote computers Few compatible compilers

9 Condor at Purdue Installed on large cyberinfrastructure clusters Installed in distributed desktops Used as a scavenger of free cycles Parallel jobs not supported ~27K Linux cores and 1K Windows cores Several more kilocores at DiaGrid partner sites

10 Condor at Purdue Jobs are vacated when a PBS job starts –Long running jobs may never complete Common home directory across clusters Scratch directories roughly per-cluster ~7 TB of checkpoint storage for standard universe jobs

11 Job Universes Vanilla universe –Doesn't require a recompile –No native checkpoint mechanism Standard universe –Streams I/O (can overload the submit node) –Supports checkpointing –No fork(), shared memory, pipes

12 File transfer A vanilla universe feature Allows jobs to flow to other sites

13 Compiling for Condor A standard universe requirement The condor_compile command wraps a limited compiler set. Links against Condor libraries to add support for I/O streaming and checkpointing

14 Checkpointing Saves all state information Transfers state information to Condor management Deletes job from processor Restarts interrupted job on another unused processor

15 Job lifecycle Job is submitted Scheduler process contacts negotiator process Negotiator matches job to an available slot If no slots are available, scheduler contacts remote negotiator Execute node runs job If job gets evicted, scheduler process contacts negotiator process again

16 Submitting a job Create a submit file: # Simple Condor job file Executable = bin/simpletest Arguments = 600 Universe = standard Log = log/$(Cluster).$(Process).log Error = log/$(Cluster).$(Process).err Output = log/$(Cluster).$(Process).out +TGProject = TG-STA060013N Queue 10

17 Submitting a job With file transfer: # Simple Condor job file Executable = bin/process_files.sh Universe = vanilla ShouldTransferFiles = if_needed Transfer_input_files = input.dat Transfer_output_files = output.png Log = log/$(Cluster).$(Process).log +TGProject = TG-STA060013N Queue

18 Submitting a job Job submitted with the condor_submit command: condor_submit myjobfile.condor

19 Managing jobs Get all jobs in queue: condor_q Get only user's jobs: condor_q user Why isn't my job running? condor_q -better-analyze jobid Remove a job: condor_rm jobid

20 Getting the most cores: Requirements =... Condor tries to be helpful by inserting automatic job requirements OpSys Arch FileSystemDomain Memory >= ImageSize This sometimes over-constrains jobs

21 Getting the most cores: Requirements =... The Requirements attribute gives you the flexibility to add or remove execute nodes Example: job files are in your home directory Requirements = regexp(“rcac.purdue.edu”,FilesystemD omain) Example: job executable is a Windows binary Requirements = (OpSys==“WINNT61”)

22 A special note about Memory Condor sometimes overestimates the memory usage of a job Condor reports totalmemory/cores, but jobs are not memory constrained It’s best to put a dummy memory requirement in the submission file

23 Getting the most out of your cores: Rank =... You can prefer a job land on particular nodes Example: prefer 64-bit nodes with lots of memory Rank = (ARCH==“X86_64”)*1000 + Memory

24 Workflow management with DAGman Directed Acyclic Graph Manager Defines parent-child relationships among jobs Allows pre- and post-execution hooks Submit with condor_submit_dag

25 Diamond DAG C A B1B2

26 Diamond DAG # Diamond-shaped DAG Job First p_00060.A.sub Job Second_1 p_00060.B1.sub Job Second_2 p_00060.B2.sub Job Third p_00060.C.sub PARENT First CHILD Second_1 Second_2 PARENT Second_1 Second_2 CHILD Third

27 More complex DAGs

28 Who Benefits from Condor? Monte Carlo simulations Parameter sweeps “Embarrassingly parallel” jobs

29 Purdue’s Condor Users Structural Biology Education Chemical Engineering Bioinformatics Climate Visualization Distributed Rendering High Energy Physics

30 For more information University of Wisconsin website: http://research.cs.wisc.edu/condor Email: bcotton@purdue.edu rcac-help@purdue.edu


Download ppt "High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor."

Similar presentations


Ads by Google