High-Performance Power-Aware Computing Vincent W. Freeh Computer Science NCSU vin@csc.ncsu.edu
Acknowledgements NCSU Tyler K. Bletsch Mark E. Femal Nandini Kappiah Feng Pan Daniel M. Smith U of Georgia Robert Springer Barry Rountree Prof. David K. Lowenthal
The case for power management Eric Schmidt, Google CEO: “it’s not speed but power—low power, because data centers can consume as much electricity as a small city.” Power/energy consumption becoming key issue Power limitations Energy = Heat; Heat dissipation is costly Non-trivial amount of money Consequence Excessive power consumption limits performance Fewer nodes can operate concurrently Goal Increase power/energy efficiency More performance per unit power/energy
application throughput CPU scaling power frequency x voltage2 How: CPU scaling Reduce frequency & voltage Reduce power & performance Energy/power gears Frequency-voltage pair Power-performance setting Energy-time tradeoff Why CPU scaling? Large power consumer Mechanism exists power frequency/voltage application throughput frequency/voltage
Is CPU scaling a win? ECPU Eother T full power PCPU Psystem Pother time full
Is CPU scaling a win? benefit ECPU cost Eother T T+DT full reduced power benefit cost PCPU ECPU PCPU Psystem Eother Psystem Pother Pother T T+DT time full reduced
Our work Exploit bottlenecks Application waiting on bottleneck resource Reduce power consumption (non-critical resource) Generally CPU not on critical path Bottlenecks we exploit Intra-node (memory) Inter-node (load imbalance) Contributions Impact studies [HPPAC ’05] [IPDPS ’05] Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)] Leveraging load imbalance [SC ’05]
Methodology Cluster used: 10 nodes, AMD Athlon-64 Processor supports 7 frequency-voltage settings (gears) Frequency (MHz) 2000 1800 1600 1400 1200 1000 800 Voltage (V) 1.5 1.4 1.35 1.3 1.2 1.1 1.0 Measure Wall clock time (gettimeofday system call) Energy (external power meter)
NAS
CG – 1 node Not CPU bound: Little time penalty Large energy savings 2000MHz 800MHz +1% -17% Not CPU bound: Little time penalty Large energy savings
EP – 1 node CPU bound: Big time penalty No (little) energy savings +11% -3% CPU bound: Big time penalty No (little) energy savings
Operation per miss SP: 49.5 CG: 8.60 BT: 79.6 EP: 844
Multiple nodes – EP Perfect speedup: E constant as N increases
Multiple nodes – LU Good speedup: E-T tradeoff as N increases S8 = 5.3 Gear 2 S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases
Phases
Phases: LU
Phase detection First, divide program into blocks All code in block execute in same gear Block boundaries MPI operation Expect OPM change Then, merge adjacent blocks into phases Merge if similar memory pressure Use OPM | OPMi – OPMi+1 | small Merge if small (short time) Note, in future: Leverage large body of phase detection research [Kennedy & Kremer 1998] [Sherwood, et al 2002]
Data collection Use MPI-jack Pre and post hooks For example application MPI library Use MPI-jack Pre and post hooks For example Program tracing Gear shifting Gather profile data during execution Define MPI-jack hook for every MPI operation Insert pseudo MPI call at end of loops Information collected: Type of call and location (PC) Status (gear, time, etc) Statistics (uops and L2 misses for OPM calculation) MPI-jack code
Example: bt
Comparing two schedules What is the “best” schedule? Depends on user User supplies “better” function bool better(i, j) Several metrics can be used Energy-delay Energy-delay squared [Cameron et al. SC2004]
Slope metric Project uses slope Energy-time tradeoff limit i j Project uses slope Energy-time tradeoff Slope = -1 energy savings = time delay User-defines the limit Limit = 0 minimize energy Limit = -∞ minimize time If slope < limit, then better We do not advocate this metric over others
Example: bt Solutions Slope < -1.5? 1 00 01 -11.7 true 2 01 02 -1.78 3 02 03 -1.19 false 4 02 12 -1.44 02 is the best
Benefit of multiple gears: mg
Current work: no. of nodes, gear/phase
Load imbalance
Node bottleneck Best course is to keep load balanced Load balancing is hard Slow down if not critical node How to tell if not critical node? Suppose a barrier All nodes must arrive before any leave No benefit to arriving early Measure block time Assume it is (mostly) the same between iterations Assumptions Iterative application Past predicts future
Example Reduced performance & power Energy savings predicted synch pt synch pt synch pt slack predicted t performance = 1 performance = (t-slack)/t iteration k iteration k+1 Reduced performance & power Energy savings
Measuring slack Blocking operations Receive Wait Barrier Measure with MPI_Jack Too frequent Can be hundreds or thousands per second Aggregate slack for one or more iterations Computing slack, S Measure times for computing and blocking phases T= C1 + B1 + C2 + B2 + …+ Cn + Bn Compute aggregate slack S = (B1+B2+…+Bn)/T
Slack Slack Varies between nodes Varies between applications Communication slack Aztec Sweep3d CG Slack Varies between nodes Varies between applications Use net slack Each node individually determines slack Reduction to find min slack
Shifting When to reduce performance? When there is enough slack When to increase performance? When application performance suffers Create high and low limit for slack Need damping Dynamically learn Not the same for all applications Range starts small Increase if necessary reduce gear slack same gear increase gear T
Aztec gears
Performance Aztec Sweep3d
Synthetic benchmark
Summary Contributions Improved energy efficiency of HPC applications Found simple metric for phase boundary location Developed simple, effective linear time algorithm for determining proper gears Leveraged load imbalance Future work Reduce sampling interval to handful of iterations Reduce algorithm time w/ modeling and prediction Develop AMPERE a message passing environment for reducing energy http://fortknox.csc.ncsu.edu:osr/ vin@csc.ncsu.edu dkl@cs.uga.edu
End
Shifting test NAS LU – 1 node 7.7% 1% 1% 4.5%
Beta Hsu & Kremer [PLDI ‘03] Relates application slowdown to CPU slowdown b = b=1 time is CPU dependent b=0 time is independent of CPU OPM vs. b Correlated Log(OPM) b
OPM and b and slack OPM not strongly correlated to b in multi-node Why? There is another bottleneck Communication slack Waiting time Eg, MPI_Receive, MPI_Wait, MPI_Barrier MG: OPM = 70.6; slack = 25% LU: OPM = 73.5; slack = 11% Can predict b with Log(OPM) and slack
Energy savings (synthetic)
Normalized – MG With communication bottleneck E-T tradeoff improves as N increases
SPEC FP
SPEC INT
Single node – MG Modest memory pressure: Gears offer E-T tradeoff +6% -7% +12% -8% Modest memory pressure: Gears offer E-T tradeoff
Dynamically adjust performance net slack 2 time 1 2
Adjust performance net slack time 1 1 1
Dampening net slack time 1 1 1
Power consumption Average for NAS suite
Related work: Energy conservation Goal: conserve energy Performance degradation acceptable Usually in mobile environments (finite energy source, battery) Primary goal: Extend battery life Secondary goal: Re-allocate energy Increase “value” of energy use Tertiary goal: Increase energy efficiency More tasks per unit energy Example Feedback-driven, energy conservation Control average power usage Pave= (E0 – Ef)/T E0 Ef T power freq
Related work: Realtime DVS Goal: Reduce energy consumption With no performance degradation Mechanism: Eliminate slack time in system Savings Eidle with F scaling Additional Etask – Etask’ with V scaling P P Pmax Pmax Etask deadline Etask’ deadline Eidle T T
Related work Previous studies in power-aware HPC Cameron et al., SC 2004 & IPDPS 2005, Freeh et al., IPDPS 2005 Energy-aware server clusters Many projects; e.g., Heath PPoPP 2005 Low-power supercomputer design Green Destiny (Warren et al., 2002) Orion Multisystems
Related work: Fixed installations Goal: Reduce cost (in heat generation or $) Goal is not to conserve a battery Mechanisms Scaling Fine-grain – DVS Coarse-grain – power down Load balancing
Memory pressure Why different tradeoffs? CG is memory bound: CPU not on critical path EP is CPU bound: CPU is on critical path Operations per miss Metric of memory pressure Indicates criticality of CPU Use performance counters Count micro operations and cache misses
Single node – MG
Single node – LU