Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of Wisconsin) John Kewley (CCLRC Daresbury Laboratory)
What is Condor? A specialised, cross-platform, distributed batch scheduling system Often used for utilising idle CPU cycles on workstations Distributed systems architecture: Different components run on different machines Can provide greater resilience and improve performance… …at the expense of simplicity (particularly simplicity of its use of the network)
Main Condor machine roles Central Manager: Monitors all Condor nodes and matches jobs to execute nodes Submit nodes: Submit jobs to pool Execute nodes: Execute jobs Checkpoint server (optional): Stores checkpoints of jobs (for supported job types) Machines may have more than one role, and there may be multiple machines with each of the above roles (except there can only ever be one active central manager)
Diagrammatic overview Central Manager Execute Node Submit Node Condor daemons (Normally listen on ports 9614 and 9618) Condor daemons Start job on Execute Node Send results to Submit Node Users executable code Condor libraries Users job For some jobs, system calls performed as remote procedure calls back to Submit Node Spawns job and signals it when to abort, suspend, or checkpoint. Execute Node tells Central Manager about itself. Central Manager tells it when to accept a job from Submit Node. Submit Node tells Central Manager about job. Central Manager tells it to which Execute Node it should send job. Checkpoint Server Condor daemons (listen on ports 5651 – 5654) For some jobs, write checkpoints to Checkpoint Server Checkpoint server advertises itself to Central Manager For some jobs, check status of Checkpoint Server
Who? How? Machine communication: Which machine talks to which Protocol(s) used: (Does not include high availability daemon (Condor and later))
Firewalls: Basic problems Pattern of network communication: Many-to-many Often bidirectional Port usage: Large range of dynamic ports Checkpoint server ports not configurable Protocols used: TCP and UDP
Firewalls: Other problems Administrative overhead: Large pool may mean many exceptions Personal firewalls: Like having a different firewall for each machine(!) Condor does not handle certain network connectivity failures gracefully Inadequate/inaccurate documentation Bugs in Condor: Didnt always set SO_KEEPALIVE (now fixed) Machines disappearing from pool (although machine still has network connectivity) Problems with Windows Firewall (now resolved?)
Solutions: Identified requirements Respect the security boundary Reduce administrative overhead Minimal impact on firewall performance NAT/firewall traversal Allow incremental implementation Scalability Robustness (in the face of network problems) Fail gracefully Integrated into Condors security framework Logging Documentation
Types of solution Mitigation (avoidance): Mitigating the effects of firewalls Altering pattern of network communication: Reducing it from many-to-many to one-to-many, few-to-many, etc. NAT/firewall traversal: Traversing the security boundary
Current solutions CCLRCs Firewall Mirroring (FM) Using centralised submit nodes (CS) Remote job submission/Condor-C (C-C) Generic Connection Brokering (GCB) Dynamic Port Forwarding (DPF)
Firewall Mirroring Developed by John Kewley Ensures that jobs are never given to execute nodes that cannot run them because of network connectivity issues (e.g. personal firewalls) Achieved by duplicating firewall configuration in machines ClassAd and then modifying job requirements appropriately Works well with personal firewalls Some administrative overhead
Centralised submit nodes Reduce pattern of network communication (few-to-many or better) Lowers administrative overhead Can have minimal impact on firewall performance …but may impact performance of the Condor pool Ideal for centrally managed campus grid scenarios
Remote job submission/Condor-C Remote job submission: Submit node submits job to a different submit node, which then submits the job to the Condor pool Poorly documented Doesnt scale well Security implications Condor-C: New feature as of Condor Moves job submission queue of one submit node to another submit node scales gracefully when compared with Condors flocking mechanism Maintains only a single network connection between the two submit nodes Can use to reduce pattern of network communication
Generic Connection Brokering NAT/firewall traversal technique Developed by Se-Chang Son Transparent to application Can reverse direction of network connection… …or relay network packets between two machines that could not otherwise communicate Some scalability issues Not yet part of any official Condor release Not yet integrated into Condors security framework
Dynamic Port Forwarding NAT/firewall traversal technique Developed by Se-Chang Son Add-on to firewall: Currently only supports Linux netfilter-based firewalls Application asks DPF to open hole in firewall DPF closes hole when connection finished Highly scalable Not yet part of any official Condor release Not yet integrated into Condors security framework
Solutions v. Requirements See paper for notes and explanations
Conclusion No perfect solution (meets all requirements) Careful design of Condor pool can help Many solutions still experimental / not yet generally available Se-Chang working on further technical solutions not discussed here Some issues best addressed within Condor (e.g. failing gracefully if loss of network connectivity) Further development of Condor required to properly address many of these issues