Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
LiberRATE Estimating It thinks like you do! Edition 3 Instructions Click on buttons to advance or to repeat the previous slide PreviousNext.
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
Intelligent Power Management over Large Clusters Stephen McGough *, Clive Gerrard *, Paul Haldane *, Sindre Hamlander +, Paul Robinson +, Dave Sharples.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
PC Power Saving Update Okan Kibaroglu and John Shemilt 30 th September 2011.
14.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure.
Implementation from the IT manager's perspective Geoff Calvert IT Manager Oxford University Centre for the Environment.
Chapter 11 Operating Systems
Nicholas Sterling.  To create an efficient scheduling algorithm to dynamically start up and shut down servers. Based on: ◦ Current Server Load  If 30%
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Web Proxy Server Anagh Pathak Jesus Cervantes Henry Tjhen Luis Luna.
Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Using Virtualization in the Classroom. Using Virtualization in the Classroom Session Objectives Define virtualization Compare major virtualization programs.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Computer Programming My Home Page My Paper Job Description Computer programmers write, test, and maintain the detailed instructions, called programs,
Simulating Condor Stephen McGough, Clive Gerrard & Jonathan Noble Newcastle University Paul Robinson, Stuart Wheater Arjuna Technologies Limited Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Hands-On Microsoft Windows Server 2008
1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.
Systems Software & Operating systems
Using Virtualization in the Classroom. Using Virtualization in the Classroom Session Objectives Define virtualization Compare major virtualization programs.
Network Management Tool Amy Auburger. 2 Product Overview Made by Ipswitch Affordable alternative to expensive & complicated Network Management Systems.
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
Chapter Ten Safe, Legal, and Green Computer Usage Part II: Energy Efficiency.
Grid Computing I CONDOR.
Ian C. Smith The University of Liverpool Condor Pool.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Networked Computer Power Management Software Determining “Equivalency” to Surveyor RTF Meeting February 5, 2008.
Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Online Music Store. MSE Project Presentation III
Nicola Hogan, Project Manager JISC funded SUSTE-TECH project Sustainable ICT in Universities & Colleges.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
C3 confidentiality classificationIntegrated M2M Terminals Introduction Vodafone MachineLink 3G v1.0 1 Vodafone MachineLink 3G Connect on demand Feature.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Remote Power Manager (PowerMan)
HTCondor Private Cloud Integration Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Managing a growing campus pool Eric Sedore
IPEmotion License Management PM (V1.2).
UNIX U.Y: 1435/1436 H Operating System Concept. What is an Operating System?  The operating system (OS) is the program which starts up when you turn.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 Remote Installation Service Windows 2003 Server Prof. Abdul Hameed.
BY: SALMAN 1.
BY: SALMAN.
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
High Availability in HTCondor
Privilege Separation in Condor
HTCondor Security Basics HTCondor Week, Madison 2016
Basic Grid Projects – Condor (Part I)
Condor: Firewall Mirroring
GLOW A Campus Grid within OSG
Energy Saver Toolkit Alan Choi.
Introduction to research computing using Condor
Presentation transcript:

Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs

Overview  Quick description of the University of Liverpool Condor Pool  Power saving at Liverpool  A home-grown approach to dealing with power-saving PCs  Power management using Condor 7.4.X  Implementing Condor power management  Results  Future directions

University of Liverpool Condor Pool  Contains around 300 machines running the University’s Managed Windows (XP, soon Windows 7) Service.  Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine.  Single combined submit host / central manager running on Sun V445 SMP server.  Currently running Condor on execute hosts (moving to 7.2.x soon).  Policy is to run jobs only if a least 5 minutes of inactivity and low load average during office hours and at anytime outside of office hours  Jobs are killed rather than suspended

Power saving at Liverpool  We have around centrally managed PCs across campus which were powered up overnight, at weekends and during vacations.  Original power saving policy was to “power-off” machines after 30 minutes of inactivity, we now hibernate them after 15 minutes of inactivity  Policy has reduced wasteful inactivity time by ~ – hours per week (equivalent to MWh) leading to an estimated saving of approx. £ p.a.  Makes extensive use of PowerMAN system from Data Synergy comprising:  service which forces machines into a low-power state and reports machine activity to Management Reporting Platform  Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

Typical monthly Condor activity

A home grown approach to power management  Two main problems to deal with:  how to ensure Condor jobs are not evicted by hibernating PCs  how to wake up dormant PCs to run Condor jobs on-demand  PowerMAN service prevents job eviction:  can provide PowerMAN with a list of “protected programs” which ensures that the machine remains active if running  include condor_starter process as a protected program (only present while a Condor job is running).  Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power:  NICs must be remain powered-up during hibernation  NICs must be capable of waking machines on receipt of a “magic packet”  network must be able to route “magic packets” – not a problem for us but YMMV

Adapting Condor for use with power-saving PCs  cron runs on the submit host which periodically examines the state of the queue ( condor_status -schedd ) and the pool ( condor_status )  if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines  find out the number of powered up machines machines in each “teaching centre” (classroom)  estimate the number of hibernating machines in each teaching centre from total number of machines in each  sort centres from highest number of available machines to lowest  wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up)  MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)

Problems with the home-grown approach  Assumes that any job can run on any machine:  users cannot choose particular teaching centres or machines in their job Requirements  ideally, pool needs to be homogenous  errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate again)  cron includes a “sanity check” for this  Can only estimate number of hibernating machines in each centre  Same machines get woken up first

Power management in Condor 7.4.X  Condor daemons can now place an execute host in a low-power state according to a given policy  Execute hosts signals it is about to enter low-power state to the Condor central manager  Central manager records persistent offline ClassAds for hibernating machines  Negotiator can perform matchmaking with offline ClassAds  Matches are passed to condor_rooster  condor_rooster pipes information to condor_power which wakes up machines using WoL

Implementing Condor power management  Still use PowerMAN to power-down inactive PCs rather than using Condor  Need a way of advertising available offline machines to the condor_collector  If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are form the subset O = P – A  cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay)  Finding P (the total set of machines which are out there) turns out to be a very difficult problem

How do we determine which machines are available to Condor  Try waking them up !  Wake up all machines in each teaching centre once a week using WoL  After wakeup call, wait a few minutes and test each machine in turn with: condor_status –direct  Sanity check similar to UNIX ping  Record which machines respond and publish ClassAds for them

Unforeseen problems  Not all woken up machines begin to run jobs  number of wakeups is limited by our “roll-your-own” version of condor_power  condor_rooster originally attempted to wake up all offline machines which matched job requirements  Included another limit in our condor_power script (number of wakeups must be < no of idle jobs)  Condor should fix this, adds ROOSTER_MAX_UNHIBERNATE configuration option  Wanted to wake up machines in random order so same machines not used repeatedly  Found that condor_negotiator ignored Rank values  Used condor_power script to implement this (“shuffles the deck”)  Should be fixed in using ROOSTER_UNHIBERNATE_RANK config option Need a way of advertising available offline machines to the condor_collector  If we know which machines are currently active (A) and which machines make up the pool in total (P), then the offline machines are the subset O = P – A  cron periodically advertises the offline machines and updates the timestamps (ClockMin / ClockDay)  Finding P (the machines which are out there) turns out to be a very difficult problem

Unforeseen problems / cont’d  Condor continued to wakeup machines after jobs removed (or complete)  Use Unhibernate = CurrentTime – MachineLastMatchTime < 300 not Unhibernate =!= Undefined  Difficult to distinguish Unclaimed offline machines from online ones in condor_status:  Also difficult to distinguish in Condor View graphs  to see all offline machines  $ condor_status –constraint Offline==True  to see all powered-up machines  $ condor_status –constraint Offline=!=True

Results – wakeup test

Future Directions  Condor power management will allow us to expand the pool to include even low-spec machines  If machines are not needed or are unsuitable they need not be woken up  Rank can be used so that newer (more energy efficient machines) used first  We would like a more accurate way of determining which machines are available. One possible method:  Record the amount of time since each machine last appeared in the pool and/or ran a job  Confidence in waking a PC can be described by a monotonically decreasing function of this  May still need to wake machines for testing occasionally  Encourage users to incorporate their own checkpointing code to reduce “badput” and energy wastage (see Liverpool Condor website for details).

Further Information