Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool.

Similar presentations


Presentation on theme: "Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool."— Presentation transcript:

1 Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool

2 University of Liverpool Condor Pool  Contains around 300 machines running the University’s Managed Windows (XP) Service.  Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.  Software updates via a weekly re-imaging process.  Single combined submit host / central manager running on Sun V440 SMP server.  Restricted access to submit host for registered Condor users.  Currently running Condor 7.0.2 (moving to 7.2.x soon).  Policy is to run jobs only if a least 10 minutes of inactivity and low load average during office hours and at anytime outside of office hours.

3 MATLAB advantages  Originally developed for linear algebra algorithm development but now contains many built-functions geared to different disciplines divided into toolboxes.  Intuitive interactive environment allows rapid code development.  Simple but powerful file I/O: save, load ( useful for checkpointing).  Allows users to create their own functions stored as M-files.  “Standalone” applications can be built from M-files:  can run on platforms without MATLAB installed  do not need a licence to be able to run  can include all toolbox functions  APIs available for FORTRAN and C codes (“MEX files”)

4 MATLAB disadvantages  Even standalone applications can run slower than equivalent C or FORTRAN implementations.  Standalone applications aren’t quite what they may seem:  more than just an.exe – several files need to be packaged and deployed  need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting.exe)  luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive)  Run-time errors can be difficult to trace when MATLAB jobs are run under Condor:  need to run under Condor on local PC  configure with USE_VISIBLE_DESKTOP=True to see pop-up messages  Jobs submitted in a UNIX environment but code developed under Windows.

5 Minor MATLAB irritations  Output files occasionally go missing:  specify all required files using transfer_output_files  identify problem jobs with condor_q –held  resubmit with condor_release –all  Jobs sometimes run “forever”:  use condor_vacate to move job to another machine  less of a problem during term time as jobs usually get evicted by logins  Difficult to reproduce these problems:  happen quite rarely ( < 1 in ~1000 jobs)  many jobs based on stochastic methods

6 MATLAB Research Applications  Predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science).  Modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science).  Testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics).  Simulation of the infection of a bacterial cell by a virus (Mathematical Sciences).  Modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy).

7 Power-saving at Liverpool  Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations.  Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity  Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a.  Makes extensive use of PowerMAN system from Data Synergy comprising:  service which forces machines into a low-power state and reports machine activity to Management Reporting Platform  Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

8 Adapting Condor for use with power-saving PCs  Two main problems:  how to ensure Condor jobs are not evicted by hibernating/powered-off PCs  how to wake up dormant PCs to run Condor jobs on-demand  Originally used Microsoft system service to power-down PCs after 30 min inactivity:  runs.bat file which checks if a user is logged in and shuts machine down if not  doesn’t detect owner of Condor job as a logged-in user  need to check for presence of condor_exe.bat  PowerMAN service now prevents job eviction:  can provide PowerMAN with a list of “protected programs”  ensures that system remains active if a protected program is running  include condor_starter process as a protected program (only present while a Condor job is running).

9 Adapting Condor for use with a power- saving PCs  Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power:  NICs must be remain powered-up during hibernation/power-off  NICs must be capable of waking machines on receipt of a “magic packet”  network must be able to route “magic packets”  cron runs on the submit host which examines state of queue ( condor_q ) and pool ( condor_status ):  if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines  find number of powered up machines machines in each “teaching centre” (classroom)  estimate the number of hibernating machines in each teaching centre from total number of machines in each  sort centres from highest number of available machines to lowest  wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up)  MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)

10 Automatic wake up issues  Assumes that any job can run on any machine:  users cannot choose particular teaching centres or machines in their job Requirements  ideally, pool needs to be homogenous  errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate)  cron now includes a “sanity check” for this  Large clusters of jobs can cause condor scheduler to become overloaded:  condor_q times out so cron cannot determine queue state  only a transient problem – load eventually drops off and condor_q responds again  Can only estimate number of hibernating machines in each centre  May wake up more machines than needed

11 Automatic wake up in action – Condor pool machine statistics

12 Automatic wake up in action – PowerMAN statistics

13 Recent and Future Developments  Recently moved to a policy of hibernating machines after 10 minutes of inactivity  submit host / central manager needs to work harder to get jobs running before recently woken machines go back to hibernation  move execute hosts from Owner to Unclaimed state after just 5 minutes idle  update activity timer every 1 minute (default is 5 minutes)  increase number of scheduler and negotiator cycles using SCHEDD_INTERVAL=60, NEGOTIATOR_INTERVAL=60  around 25 % machines still hibernate after first wakeup  see a ramp up in machines running Condor jobs over about an hour  little impact on Condor users  energy wastage offset by savings with user logouts

14 Recent and Future Developments  Migrating to Condor 7.2 shortly  Has some interesting power-management features  Automatic power-down on execute hosts could provide a useful “safety net” but PowerMAN likely to remain primary power management tool  Can retain records of ClassAds of machines in low-power state  could be useful in matchmaking jobs to powered-down machines  matchmaking logic already in Condor  nice if Condor could use this to provide a list of machines to wake-up on demand ... and wake them up with condor_wakeup ?  would like to ensure that powered-down machines are still out there (not broken, permanently turned off, not listening etc)  also useful to see powered-off machines represented in condor_status output  Couple of extra “wishes”  allow jobs to claim all slots on a machine (useful if they have large memory requirements)  provide a “logged-in user” machine ClassAd attribute

15 Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk


Download ppt "Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool."

Similar presentations


Ads by Google