Download presentation
Presentation is loading. Please wait.
Published byAleesha Little Modified over 9 years ago
1
Condor Project Computer Sciences Department University of Wisconsin-Madison Advanced Condor mechanisms CERN Feb 14 2011
2
www.condorproject.org 2 a better title… “Condor Potpourri” › Igor feedback “Could be useful to people, but not Monday” › If not of interest, new topic in 1 minute
3
www.condorproject.org 3 Central Manager Failover › Condor Central Manager has two services › condor_collector Now a list of collectors is supported › condor_negotiator (matchmaker) If fails, election process, another takes over Contributed technology from Technion
4
www.condorproject.org 4 Submit node robustness: Job Progress continues if connection is interrupted › Condor supports reestablishment of the connection between the submitting and executing machines. If network outage between execute and submit machine If submit machine restarts › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: job_lease_duration = 1200
5
www.condorproject.org 5 Submit node robustness: Job Progress continues if submit machine fails Automatic Schedd Failover Condor can support a submit machine “hot spare” If your submit machine A is down for longer than N minutes, a second machine B can take over Requires shared filesystem (or just DRBD*?) between machines A and B *Distributed Replicated Block Device – www.drbd.org
6
www.condorproject.org 6 DRBD
7
www.condorproject.org 7 Interactive Debugging › Why is my job still running? Is it stuck accessing a file? Is it in an infinite loop? › condor_ssh_to_job Interactive debugging in UNIX Use ps, top, gdb, strace, lsof, … Forward ports, X, transfer files, etc.
8
www.condorproject.org 8 condor_ssh_to_job Example % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 einstein 4/15 06:52 1+12:10:05 R 0 10.0 cosmos 1 jobs; 0 idle, 1 running, 0 held % condor_ssh_to_job 1.0 Welcome to slot4@c025.chtc.wisc.edu! Your condor job is running with pid(s) 15603. $ gdb –p 15603 …
9
www.condorproject.org 9 How it works › ssh keys created for each invocation › ssh Uses OpenSSH ProxyCommand to use connection created by ssh_to_job › sshd runs as same user id as job receives connection in inetd mode So nothing new listening on network Works with CCB and shared_port
10
www.condorproject.org 10 What?? Ssh to my worker nodes?? › Why would any sysadmin allow this? › Because the process tree is managed Cleanup at end of job Cleanup at logout › Can be disabled by nonbelievers
11
www.condorproject.org 11 Concurrency Limits › Limit job execution based on admin- defined consumable resources E.g. licenses › Can have many different limits › Jobs say what resources they need › Negotiator enforces limits pool-wide 11
12
www.condorproject.org 12 Concurrency Example › Negotiator config file MATLAB_LIMIT = 5 NFS_LIMIT = 20 › Job submit file concurrency_limits = matlab,nfs:3 This requests 1 Matlab token and 3 NFS tokens 12
13
www.condorproject.org 13 Green Computing › The startd has the ability to place a machine into a low power state. (Standby, Hibernate, Soft-Off, etc.) HIBERNATE, HIBERNATE_CHECK_INTERVAL If all slots return non-zero, then the machine can powered down via condor_power hook A final acked classad is sent to the collector that contains wake-up information › Machines ads in “Offline State” Stored persistently to disk Ad updated with “demand” information: if this machine was around, would it be matched?
14
www.condorproject.org 14 Now what?
15
www.condorproject.org 15 condor_rooster › Periodically wake up based on ClassAd expression (Rooster_UnHibernate) › Throttling controls › Hook callouts make for interesting possibilities…
16
www.condorproject.org 16 Job Router › Automated way to let jobs run on a wider array of resources Transform jobs into different forms Reroute jobs to different destinations 16
17
www.condorproject.org 17 What is “job routing”? 17 Universe = “vanilla” Executable = “sim” Arguments = “seed=345” Output = “stdout.345” Error = “stderr.345” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” Universe = “grid” GridType = “gt2” GridResource = \ “cmsgrid01.hep.wisc.edu/jobmanager-condor” Executable = “sim” Arguments = “seed=345” Output = “stdout” Error = “stderr” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” JobRouter Routing Table: Site 1 … Site 2 … final status routed (grid) joboriginal (vanilla) job
18
www.condorproject.org 18 Routing is just site-level matchmaking › With feedback from job queue number of jobs currently routed to site X number of idle jobs routed to site X rate of recent success/failure at site X › And with power to modify job ad change attribute values (e.g. Universe) insert new attributes (e.g. GridResource) add a “portal” grid proxy if desired 18
19
www.condorproject.org 19 Condor-G Matchmaking › Use Condor’s match-making to select sites to send grid universe jobs to › You must create the “machine” ads yourself using condor_advertise › No claiming protocol › Each machine ad can match to multiple jobs
20
www.condorproject.org 20 Dynamic Slot Partitioning › Divide slots into chunks sized for matched jobs › Readvertise remaining resources › Partitionable resources are cpus, memory, and disk › See Matt Farrellee’s talk 20
21
www.condorproject.org 21 Dynamic Partitioning Caveats › Cannot preempt original slot or group of sub-slots Potential starvation of jobs with large resource requirements › Partitioning happens once per slot each negotiation cycle Scheduling of large slots may be slow 21
22
www.condorproject.org 22 High Throughput Parallel Computing › Parallel jobs that run on a single machine Today 8-16 cores, tomorrow 32+ cores › Use whatever parallel software you want It ships with the job MPI, OpenMP, your own scripts Optimize for on-board memory access
23
www.condorproject.org 23 Configuring Condor for HTPC › Two strategies: Suspend/drain jobs to open HTPC slots Hold empty cores until HTPC slot is open › We have a recipe for the former on the Condor Wiki http://condor-wiki.cs.wisc.edu http://condor-wiki.cs.wisc.edu › User accounting enabled by Condor’s notion of “Slot Weights”
24
www.condorproject.org 24 CPU Affinity Four core Machine running four jobs w/o affinity j1j2 j3j4 j3aj3bj3cj3d core1core2core3core4
25
www.condorproject.org 25 CPU Affinity to the rescue SLOT1_CPU_AFFINITY = 0 SLOT2_CPU_AFFINITY = 1 SLOT3_CPU_AFFINITY = 2 SLOT4_CPU_AFFINITY = 3
26
www.condorproject.org 26 Four core Machine running four jobs w/affinity j1j2 j3j4 j3a j3b j3c j3d core1core2core3core4
27
www.condorproject.org 27 Condor + Hadoop FS (HDFS) Condor+HDFS = 2 + 2 = 5 !!! A Synergy exists (next slide) Hadoop as distributed storage system Condor as cluster management system Large number of distributed disks in a compute cluster Managing disk as a resource
28
www.condorproject.org 28 condor_hdfs daemon › Main integration point of HDFS within Condor › Configures HDFS cluster based on existing condor_config files › Runs under condor_master and can be controlled by existing Condor utilities › Publish interesting parameters to Collector e.g IP address, node type, disk activity › Currently deployed at UW-Madison
29
www.condorproject.org 29 Condor + HDFS : Next Steps? › Integrate with File Transfer Mechanism › FileNode Failover › Management of HDFS › What about HDFS in a GlideIn environment?? › Online transparent access to HDFS??
30
www.condorproject.org 30 Remote I/O Socket › Job can request that the condor_starter process on the execute machine create a Remote I/O Socket › Used for online access of file on submit machine – without Standard Universe. Use in Vanilla, Java, … › Libraries provided for Java and for C, e.g. : Java: FileInputStream -> ChirpInputStream C : open() -> chirp_open() › Or use Parrot!
31
www.condorproject.org 31 Job Fork startershadow Home File System I/O Library I/O ServerI/O Proxy Secure Remote I/O Local System Calls Local I/O (Chirp) Execution Site Submission Site
32
www.condorproject.org 32
33
www.condorproject.org 33 DMTCP › Written at Northeastern U. and MIT › User-level process checkpoint/restart library › Fewer restrictions than Condor’s Standard Universe Handles threads and multiple processes No re-link of executable › DMTCP and Condor Vanilla Universe integration exists via a job wrapper script
34
www.condorproject.org 34 Questions? Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.