Download presentation
Presentation is loading. Please wait.
Published byHoward Stone Modified over 6 years ago
1
Talking Points: Deployment on Big Infrastructures INFN HTCondor Workshop Oct 2016
2
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Pool Size : ~15k slots Central Manager : 8 cores (loadavg of 2), 8GB RAM (5GB in use), no special config Submit Machines: ~80 submit machines, 3 "big" general purpose ones, each big one typically has Typical ~10k running / 100k queued, 32 cores, 96GB RAM, SSD Example: Global CMS Pool Pool Size: ~150k - 200k slots Central Manager: collector tree, no preemption Submit machines: 15 with ~15k running
3
Central Manager Planning
Memory: ~1GB of RAM per 4,000 slots plus RAM for other services (e.g. monitoring) …or even better, run them somewhere else CPU: 4 cores can work if < 20k slots; 8 cores if bigger or many users Speed per core (clock) helps 1 gig network connection is OK Create CCB brokers separate from Central Manager at > ~20k slots
4
Central Manager Planning, cont
Negotiator Top Level Collector Use a "collector tree" if using strong authentication / encryption, esp over the WAN Per 1500 execute nodes Hides latency, more parallelism See HOWTO at Child Child Child Execute Nodes
5
Submit Machine Planning
Memory: ~50KB per queued job, ~1MB per running job (actual is ~400KB, safety factor) CPU: 2 or 3 cores fine, BUT Base CPU decision on the needs of logged-in users (i.e. compiling, test jobs, etc) More than 5-10k jobs? Buy an SSD! Our setup typically has Dedicated, small, low-latency SSD for queue, AND Large high-throughput (striped) storage for user home/working directories Network: 1gig, or 10gig if primarily using HTCondor File Transfer
6
How to move files from submit machine to execute machine?
Shared File system (NFS, AFS, Gluster, …) Pro: Less work for users - no need to specify input files Con: No management often leads to meltdown HTCondor File Transfer Con: Users need to specify input and/or output files Pro: File transfers are managed Pro: Simpler to now run the job offsite
7
Note Brian B's warning…
8
Horizontal Scaling Submit node scaling problems? Add more
Pool can have an arbitrary number of schedds How many needed? Depends on many things Hertz rate of job ( schedd safe at ~ starts/sec) Submission one at a time vs big batches Amount of job I/O How to detect scaling problem? RecentDaemonCoreDutyCycle > 98% SCHEDD_HOST in .condor/user_config can point to a remote schedd Central manager scaling problems? Add more Add then federate via "Flocking" How to detect scaling problem? Metrics on dropped packets, negotiation cycle time (UW-Madison typically a couple minutes)
9
Some User/Admin Training
Train Users to submit jobs in large batches Instead of running condor_submit 5,000 times do: executable = /bin/foo.exe initialdir = run_$(Process) queue 5000 Train users re reasonable number of queued jobs? reasonable job runtime? Avoid constant polling with condor_[q|status] Consider job event log, DAGMan post Consider monitoring with condor_gangliad, Fifemon Use selection and projection Bad: condor_status -l | grep Busy Good: condor_status -cons 'Activity=="Busy"' -af Name Custom Print Formats ( )
10
Tuning and Customization for large scale
Kernel tuning Automatically done w/ HTCondor v8.4.x+ Enable Shared Port daemon Automatically done w/ HTCondor v8.5.x+ CCB required to let one schedd to have more than ~25k running jobs "Circuit Breaker" config knobs; we have lots of knobs Schedd: MAX_JOBS_PER_OWNER, MAX_JOBS_PER_SUBMISSION, MAX_JOBS_RUNNING, FILE_TRANSFER_DISK_LOAD_THROTTLE, MAX_CONCURRENT_UPLOADS/DOWNLOADS, … Central Manager: NEGOTIATOR_MAX_TIME_PER_SCHEDD, NEGOTIATOR_MAX_TIME_PER_SUBMITTER Schedd SUBMIT_REQUIREMENTS
11
Tuning, cont. Improve scalability by disabling unneeded features, e.g.
Preemption negotiator_consider_preemption = false Job Ranking of machines negotiator_ignore_job_ranks = true Durable commits in event of power failure condor_fsync = false Improve scalability by enabling experimental features
12
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.