Intelligent Power Management over Large Clusters Stephen McGough *, Clive Gerrard *, Paul Haldane *, Sindre Hamlander +, Paul Robinson +, Dave Sharples.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
SLA-Oriented Resource Provisioning for Cloud Computing
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
Moving Your Computer Lab(s) to the Cloud Rick O’Toole & Dave Hicking University of Connecticut Libraries.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Making HTCondor Energy Efficient by identifying miscreant jobs Stephen McGough, Matthew Forshaw & Clive Gerrard Newcastle University Stuart Wheater Arjuna.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Power-Saving for £0 Lisa Nelson University of Liverpool.
14.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
The B A B AR G RID demonstrator Tim Adye, Roger Barlow, Alessandra Forti, Andrew McNab, David Smith What is BaBar? The BaBar detector is a High Energy.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Introduction to LIMS Department Administrators Office of Collaborative Science July 2012.
HPCC Mid-Morning Break Powertools Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Research
Event Viewer Was of getting to event viewer Go to –Start –Control Panel, –Administrative Tools –Event Viewer Go to –Start.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Simulating Condor Stephen McGough, Clive Gerrard & Jonathan Noble Newcastle University Paul Robinson, Stuart Wheater Arjuna Technologies Limited Condor.
COMP 4923 Green IT audit for Registrar’s Office John, Michael, Sarah.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
THIN CLIENT COMPUTING USING ANDROID CLIENT for XYZ School.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
November , 2009SERVICE COMPUTATION 2009 Analysis of Energy Efficiency in Clouds H. AbdelSalamK. Maly R. MukkamalaM. Zubair Department.
Ian C. Smith Towards a greener Condor pool: adapting Condor for use with energy-efficient PCs.
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Royal Latin School. Spec Coverage: a) Explain the advantages of networking stand-alone computers into a local area network e) Describe the differences.
Service Request Desk How we can help each other, help each other.
John Kewley e-Science Centre CCLRC Daresbury Laboratory 28 th June nd European Condor Week Milano Heterogeneous Pools John Kewley
Smart Metering and the Smart Grid How does it work and what can it do? Will Chaney 1Energy Awareness Week, 3-8 May 2010.
Grid Computing I CONDOR.
HTCondor and Beyond: Research Computer at Syracuse University Eric Sedore ACIO – Information Technology and Services.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Installation and Development Tools National Center for Supercomputing Applications University of Illinois at Urbana-Champaign The SEASR project and its.
Computer Emergency Notification System (CENS)
Condor In Flight at The Hartford 2006 Transformations Condor Week 2007 Bob Nordlund.
Grid job submission using HTCondor Andrew Lahiff.
October 18, 2005 Charm++ Workshop Faucets A Framework for Developing Cluster and Grid Scheduling Solutions Presented by Esteban Pauli Parallel Programming.
Module 10 Administering and Configuring SharePoint Search.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
June 30 - July 2, 2009AIMS 2009 Towards Energy Efficient Change Management in A Cloud Computing Environment: A Pro-Active Approach H. AbdelSalamK. Maly.
The world leader in serving science Overview of Thermo 21 CFR Part 11 tools Overview of software used by multiple business units within the Spectroscopy.
Module 9 Planning and Implementing Monitoring and Maintenance.
Chapter 4- Part3. 2 Implementing User Profiles A local user profile is automatically created at the local computer when you log on with an account for.
Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Mobile Analyzer A Distributed Computing Platform Juho Karppinen Helsinki Institute of Physics Technology Program May 23th, 2002 Mobile.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
The Fallacy Behind “There’s Nothing to Hide” Why End-to-End Encryption Is a Must in Today’s World.
1 Remote Installation Service Windows 2003 Server Prof. Abdul Hameed.
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Our Experience with Desktop Virtualization
Condor – A Hunter of Idle Workstation
Example of usage in Micron Italy (MIT)
Basic Grid Projects – Condor (Part I)
Bethesda Cybersecurity Club
GRID Workload Management System for CMS fall production
Kajornsak Piyoungkorn,
Presentation transcript:

Intelligent Power Management over Large Clusters Stephen McGough *, Clive Gerrard *, Paul Haldane *, Sindre Hamlander +, Paul Robinson +, Dave Sharples *, Dan Swan *, Stuart Wheater + * Newcastle University + Arjuna Technologies Ltd Condor Week 2011

Overview What is the DI? Background Motivation Power saving in Condor Intelligent Power Management Conclusion

What is the DI? Society is increasingly dependent on digital technologies energy, transport, healthcare, commerce, engineering…. How can Newcastle researchers develop and harness these technologies to shape the future for everyone's benefit? Research is increasingly dependent on digital technologies record, store, analyse, model, share, visualise.. How can Newcastle researchers harness these technologies to do new, innovative research? Institute Goals Facilitate Internationally-Leading Research – innovative, exciting, pioneering – impact on researchers & society Thriving, Inter-Disciplinary Research Community

Background

Condor at Newcastle Using open access cluster computers around campus – ~ 1300 Computers Windows Based – ~ 250 Computers Linux Based Using Staff Desktops – ~ 3000 Computers (not all Condor enabled – yet!) Using dedicated resources – ~ 200 Computers in research groups around Campus All computers at least dual core, moving to quad – ~ 6000 – core cluster Condor originally installed 5 years ago as ‘unsupported’ service. Now we’re making it a supported service

Power Efficiency of Computers Power Efficiency: efficiency = flops/(PUE ∗ watts) Flops/watts – pre-determined – Average value used Power Usage Effectiveness (PUE) – depends on location of computer (and time) TypeflopsWattage Ultraslim Desktop941M54 CAD Computer1265M87 Legacy600M100

Cluster Locations Old Library Basement Cluster room Needs heating all year PUE < 1 (offset heat from computers against room heating) Old Library Basement Cluster room Needs heating all year PUE < 1 (offset heat from computers against room heating) MSc Computing Cluster South facing cluster room in High tower. PUE > 1 (needs air-con all year) MSc Computing Cluster South facing cluster room in High tower. PUE > 1 (needs air-con all year)

Motivation

University has strong desire to power down computers to save energy (money) If a computer is not ‘working’ it should be powered down. Windows XP ClusterWindows 7 Cluster

Motivation DescriptionNumber% Total number of jobs submitted to Condor % - Which Completed % - Without wasted time % - With wasted time533778% - Which were removed % - Before execution % - After some execution % We have five years of condor logging to mine - Other computers were used -though this is the main log -Initial analysis looks promising…

Total Time used by Condor51 years, 29 days, 20h, 2m, 52s100% Total Wasted time33 years, 129 days, 4h, 6m, 13s65% Total time wasted on jobs removed26 years, 33 days, 22h, 36m, 28s51% Total time wasted on completed jobs7 years, 95 days, 5h, 29m, 45s14% Total real job execution time17 years, 265 days, 15h, 56m, 39s35% Motivation Further analysis didn’t look as good… (Data from main Condor Submit computer – others exist) For every 1 second of useful time we require ~3 seconds through Condor

Average Wasted Time (minutes)

Rank Rank used to express preference Vast majority of people don’t use Rank – Some who do get it wrong! Rank Statement%Meaning Null99.87%No Preference LoadAvg0.082%Prefer computers with higher loads ((OpSys == "WINNT51")) + ((OpSys == "LINUX") * 2) 0.047%Prefer Linux to Windows Memory0.0015%Prefer computers with more memory !!!

Aims Reduce Power Consumption Produce no impact on interactive machine users Produce no impact on Condor users Provide auditing on computing time used On a cycle-scavenging Condor system. All These Things can be done using Condor – So What’s new?

Power Saving in Condor

What’s Wrong With Condor? Nothing – Condor is “too good”, but… – With a good administrator all this can be done – With plenty of time this can be done Put another way – Administrators are busy people – They have plenty of other tasks to do other than monitor and tweak Condor which gets along with things just fine Most users don’t know enough Condor – Can’t specify preferences over resources – Only 3 users had done this at Newcastle – and one got it wrong

Power Saving in Condor Power rating (watts) and a PUE added to each worker description Rank equation needs to be added to each job submission Rank = flops/(PUE ∗ watts) – Automatically added by our system – And merged with user rank - if present

Powering up and down We use our own computer power-down script – When no-one is using the computer – When no Condor jobs are running A persistent ClassAd is sent to Condor – Indicates computer is powered down Rooster can power up the computer – If a job description requests this – This calls our wakeup script

Powering down Windows We suspend computers Windows can (wake up and) suspend for many reasons – Not all under our control Anti-Virus updates, software updates We can catch many of these – But ~20% we can’t We monitor the cluster for these – Monitor condor_status for nodes which don’t update – If a node fails to update ping it If it’s no longer there copy it’s last classad and post a fake replacement for a hibernating computer

Intelligent Power Management

Aims Reduce Power Consumption Produce no impact on interactive machine users Produce no impact on Condor users Provide auditing on computing time used On a cycle-scavenging Condor system. All These Things can be done using Condor – So What’s new? Do this all in a more automatic ‘intelligent’ way

Arjuna's Agility is a federation tool – Often used with Cloud Computing Platforms Support structuring of the Cloud into federated sub-clouds through Service Agreements User configured (or implemented) policies can be installed to manage the interaction via Service Agreements

Using Agility to provide intelligent management between parts of Condor Adding in Rank Deciding if a user has the right to power up computers Monitoring activity and looking for anomalous behavior Behavior is modified by the addition of new Agility policies – A Service Agreement change will be accepted only if no Policy rejects it and at least one Policy accepts it – Modification of Agility policies can lead to modification of the condor policies and/or modification of incoming jobs Audit individuals' worker usage for potential billing

General Architecture

Job Submission Route

Defined Policy Favor Energy efficient computers – Unless user states otherwise Prioritize submissions – Professors over PhD’s Mark and manage Rogue jobs – Jobs that have executed for too long – Jobs that have been restarted too many times – Jobs incorrectly submitted Backlog reduction – Modify above policy to deal with backlogs Auditing – Provide auditing for all jobs (per –user, -group, -school, -faculty, university)

Auditing Agility stores all attempted job runs. We can use this to provide audits of computer use A job which takes n attempts to run consumes power of: – Do we include evicted job time? Can produce audit on user, group, school or university

Future Policy Federating between Condor Pools Federating with the Cloud – Cloud Bursting Resource profiling – Working out when resources are idle and for how long Job profiling – Try to estimate job lengths for users Dedicated workers for evictees Running jobs on computers with active users Education of users

Conclusion Condor is good – Most users aren’t as good – System administrators are busy – They need tooling to automatically monitor and tweak the setup This along with preferring power efficient computers and turning off computers can save a lot of energy (and money)

Questions?