HTCondor Project Plans Zach Miller OSG AHM 2013
Even More Lies Zach Miller OSG AHM 2013
chtc.cs.wisc.edu Outline › Fewer Lies Predictions › More Accomplishments › Some hints at Future Work
chtc.cs.wisc.edu HTCondor › As you’ve probably noticed, the project’s name has changed. › HTCondor specifically is the software that is developed by the Center for High- Throughput Computing at UW-Madison › However, the code, names of binaries, and any configuration file names and entries, have NOT changed.
chtc.cs.wisc.edu User Tools › condor_ssh_to_job If enabled by the admin, allows users to get a shell in the remote execution sandbox as the user that is running the job Great for debugging! % condor_ssh_to_job Welcome to Your condor job is running with pid(s) > ls condor_exec.exe _condor_stderr _condor_stdout > whoami zmiller
chtc.cs.wisc.edu User Tools › condor_q –analyze Now doesn’t need to (but can) fetch user priorities, removing the need to contact the negotiator daemon—this results in a much improved response time for busy pools Can also analyze the Requirements of the condor_startd, in addition to the Requirements of the job
chtc.cs.wisc.edu User Tools › condor_tail Allows users to view the output of their jobs while the job is still running Like the UNIX “tail -f” it will allow following the contents of a file (real time streaming) Not yet part of the HTCondor release (but should be there Real Soon Now™)
chtc.cs.wisc.edu User Tools › condor_qsub Allows a user to submit a PBS or SGE job to HTCondor directly Translates the command-line arguments, as well as inline (#PBS or #$) commands to their equivalent condor_submit commands This is in no way complete. We are not hoping to emulate every feature of qsub, but rather capture the main functionality that supports the majority of simple use cases.
chtc.cs.wisc.edu User Tools › condor_ping Tests the authentication, mapping, and authorization of a user submitting a job to the running HTCondor daemons Tries to provide helpful debugging info in the case of failure globusrun –a
chtc.cs.wisc.edu Admin Tools › condor_ping Tests the authentication, mapping, and authorization of daemon-to-daemon communications of running HTCondor daemons Helps assure administrators they have configured things correctly Tries to provide helpful debugging info in the case of failure
chtc.cs.wisc.edu Admin Tools › condor_who Shows what jobs by which user are running on the local machine Does not depend on contacting HTCondor daemons – it gets all info from logs and ‘ps’ % condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM ingwe.cs.wisc.edu :00: /scratch/
chtc.cs.wisc.edu Networking › IPv6 support in HTCondor has been around for some time, but we continue to test it and harden that code › The condor_shared_port daemon allows an HTCondor instance to listen on a single port, easing configuration in firewalled environments › We would like this to be the default, with port 9618 (registered by name with IANA)
chtc.cs.wisc.edu Security › Creation of official policy › Documented on HTCondor web site Reporting Process Release Process Known vulnerabilities › Running coverity nightly
chtc.cs.wisc.edu Security › Added ability to increase number of bits in delegated proxies › condor_ping already mentioned › Audit Log Will record the authenticated identity of each user and as much information as is practical about each job that runs In progress…
chtc.cs.wisc.edu Sandboxing › Per-job PID namespaces Makes it impossible for jobs to see outside of their process tree, and therefore unable to interfere with the system or with other jobs even if they are owned by the same user Allows filesystem namespaces, such that each job can have its own /tmp directory which is actually mounted mounted in HTCondor’s temporary execute directory
chtc.cs.wisc.edu Sandboxing › cgroups Allows for more accurate accounting of a job’s memory and CPU usage Guarantees proper cleanup of jobs
chtc.cs.wisc.edu Partionable Slots › p-slots Can contain “generic” resources, beyond CPU, Memory, Disk Now work in HTCondor’s “parallel” universe Support for “quick claiming” where a whole machine can be subdivided and claimed in a single negotiation cycle › Can lead to “fragmenting” of a pool
chtc.cs.wisc.edu Defrag › condor_defrag A daemon which periodically drains and recombines a certain portion of the pool Leads to the recreation of larger slots which can then be used for larger jobs Necessarily causes some “badput” › condor_drain is a command-line tool that does this on an individual machine
chtc.cs.wisc.edu Statistics › Condor daemons can now report a wide variety of statistics to the condor_collector Statistics about the daemons, like response times to incoming requests About jobs and quantities of data transferred › What is yet to be done is to include tools that help make sense of those statistics, as either a new and improved CondorView or as a Gratia probe
chtc.cs.wisc.edu Scalability › Working on reducing memory footprint of daemons, particularily the condor_shadow › Queued file transfers are now processed round-robin instead of FIFO, so individual users are not starved › ClassAd caching in the schedd and collector have resulted in 30-40% savings in memory
chtc.cs.wisc.edu HTCondor Version › Will contain all of the above goodness › Should be released approximately “during HTCondor week”, April 29 – May 3 › What lies beyond?
chtc.cs.wisc.edu Future Work › Scalability We always need to be improving this in an attempt to stay ahead of Igor the curve New tools will be needed – nobody wants to run condor_status and see individual state for 100,000 cores Reducing the amount of per-job memory used on the submit machine Collector hierarchies to deal with high-latency, wide area cloud pools
chtc.cs.wisc.edu Future Work › Weather report: 100% chance of clouds Support for more types of resource acquisition models: EC2 Spot Instance, Azure, OpenStack, The Next Big Thing™ Simple creation of single-purpose clusters Homogenous Ephemeral Single user, single job Seamless integration of cloud resources with locally deployed infrastructure
chtc.cs.wisc.edu Future Work › Dealing with more hardware complexity More and more cores GPUs › Simplifying deployment on widely disparate use cases › Improve support for black-box / commercial applications › Meeting increasing data challenges
chtc.cs.wisc.edu Conclusion › Many things accomplished… › Many more to do… › Questions? Ask me! Or me at › Thanks!