Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs
Overview Quick description of the University of Liverpool Condor Pool Power saving at Liverpool A home-grown approach to dealing with power-saving PCs Power management using Condor 7.4.X Implementing Condor power management Results Future directions
University of Liverpool Condor Pool Contains around 300 machines running the University’s Managed Windows (XP, soon Windows 7) Service. Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine. Single combined submit host / central manager running on Sun V445 SMP server. Currently running Condor on execute hosts (moving to 7.2.x soon). Policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours Jobs are killed rather than suspended
Power saving at Liverpool We have around centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. Original power saving policy was to “power-off” machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity Policy has reduced wasteful inactivity time by ~ – hours per week (equivalent to MWh) leading to an estimated saving of approx. £ p.a. Makes extensive use of PowerMAN system from Data Synergy comprising: service which forces machines into a low-power state and reports machine activity to Management Reporting Platform Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser
Typical monthly Condor activity
A home grown approach to power management Two main problems to deal with: how to ensure Condor jobs are not evicted by hibernating PCs how to wake up dormant PCs to run Condor jobs on-demand PowerMAN service prevents job eviction: can provide PowerMAN with a list of “protected programs” which ensures that the machine remains active if running include condor_starter process as a protected program (only present while a Condor job is running). Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power: NICs must be remain powered-up during hibernation NICs must be capable of waking machines on receipt of a “magic packet” network must be able to route “magic packets” – not a problem for us but YMMV
Adapting Condor for use with power-saving PCs cron runs on the submit host which periodically examines the state of the queue ( condor_status -schedd ) and the pool ( condor_status ) if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines find out the number of powered up machines machines in each “teaching centre” (classroom) estimate the number of hibernating machines in each teaching centre from total number of machines in each sort centres from highest number of available machines to lowest wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up) MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)
Problems with the home-grown approach Assumes that any job can run on any machine: users cannot choose particular teaching centres or machines in their job Requirements ideally, pool needs to be homogenous errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate again) cron includes a “sanity check” for this Can only estimate number of hibernating machines in each centre Same machines get woken up first
Power management in Condor 7.4.X Condor daemons can now place an execute host in a low-power state according to a given policy Execute hosts signals it is about to enter low-power state to the Condor central manager Central manager records persistent offline ClassAds for hibernating machines Negotiator can perform matchmaking with offline ClassAds Matches are passed to condor_rooster condor_rooster pipes information to condor_power which wakes up machines using WoL
Implementing Condor power management Still use PowerMAN to power-down inactive PCs rather than using Condor Need a way of advertising available offline machines to the condor_collector If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are form the subset O = P – A cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay) Finding P (the total set of machines which are out there) turns out to be a very difficult problem
How do we determine which machines are available to Condor Try waking them up ! Wake up all machines in each teaching centre once a week using WoL After wakeup call, wait a few minutes and test each machine in turn with: condor_status –direct Sanity check similar to UNIX ping Record which machines respond and publish ClassAds for them
Unforeseen problems Not all woken up machines begin to run jobs number of wakeups is limited by our “roll-your-own” version of condor_power condor_rooster originally attempted to wake up all offline machines which matched job requirements Included another limit in our condor_power script (number of wakeups must be < no of idle jobs) Condor should fix this, adds ROOSTER_MAX_UNHIBERNATE configuration option Wanted to wake up machines in random order so same machines not used repeatedly Found that condor_negotiator ignored Rank values Used condor_power script to implement this (“shuffles the deck”) Should be fixed in using ROOSTER_UNHIBERNATE_RANK config option Need a way of advertising available offline machines to the condor_collector If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are the subset O = P – A cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay) Finding P (the machines which are out there) turns out to be a very difficult problem
Unforeseen problems / cont’d Condor continued to wakeup machines after jobs removed (or complete) Use Unhibernate = CurrentTime – MachineLastMatchTime < 300 not Unhibernate =!= Undefined Difficult to distinguish Unclaimed offline machines from online ones in condor_status: Also difficult to distinguish in Condor View graphs to see all offline machines $ condor_status –constraint Offline==True to see all powered-up machines $ condor_status –constraint Offline=!=True
Results – wakeup test
Future Directions Condor power management will allow us to expand the pool to include even low-spec machines If machines are not needed or are unsuitable they need not be woken up Rank can be used so that newer (more energy efficient machines) used first We would like a more accurate way of determining which machines are available. One possible method: Record the amount of time since each machine last appeared in the pool and/or ran a job Confidence in waking a PC can be described by a monotonically decreasing function of this May still need to wake machines for testing occasionally Encourage users to incorporate their own checkpointing code to reduce “badput” and energy wastage (see Liverpool Condor website for details).
Further Information