High-Performance Power-Aware Computing

High-Performance Power-Aware Computing
Vincent W. Freeh Computer Science NCSU

Acknowledgements NCSU Tyler K. Bletsch Mark E. Femal Nandini Kappiah
Feng Pan Daniel M. Smith U of Georgia Robert Springer Barry Rountree Prof. David K. Lowenthal

The case for power management
Eric Schmidt, Google CEO: “it’s not speed but power—low power, because data centers can consume as much electricity as a small city.” Power/energy consumption becoming key issue Power limitations Energy = Heat; Heat dissipation is costly Non-trivial amount of money Consequence Excessive power consumption limits performance Fewer nodes can operate concurrently Goal Increase power/energy efficiency More performance per unit power/energy

application throughput
CPU scaling power  frequency x voltage2 How: CPU scaling Reduce frequency & voltage Reduce power & performance Energy/power gears Frequency-voltage pair Power-performance setting Energy-time tradeoff Why CPU scaling? Large power consumer Mechanism exists power frequency/voltage application throughput frequency/voltage

Is CPU scaling a win? ECPU Eother T full power PCPU Psystem Pother
time full

Is CPU scaling a win? benefit ECPU cost Eother T T+DT full reduced
power benefit cost PCPU ECPU PCPU Psystem Eother Psystem Pother Pother T T+DT time full reduced

Our work Exploit bottlenecks
Application waiting on bottleneck resource Reduce power consumption (non-critical resource) Generally CPU not on critical path Bottlenecks we exploit Intra-node (memory) Inter-node (load imbalance) Contributions Impact studies [HPPAC ’05] [IPDPS ’05] Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)] Leveraging load imbalance [SC ’05]

Methodology Cluster used: 10 nodes, AMD Athlon-64
Processor supports 7 frequency-voltage settings (gears) Frequency (MHz) Voltage (V) Measure Wall clock time (gettimeofday system call) Energy (external power meter)

CG – 1 node Not CPU bound: Little time penalty Large energy savings
2000MHz 800MHz +1% -17% Not CPU bound: Little time penalty Large energy savings

EP – 1 node CPU bound: Big time penalty No (little) energy savings
+11% -3% CPU bound: Big time penalty No (little) energy savings

Operation per miss SP: 49.5 CG: 8.60 BT: 79.6 EP: 844

Multiple nodes – EP Perfect speedup: E constant as N increases

Multiple nodes – LU Good speedup: E-T tradeoff as N increases S8 = 5.3
Gear 2 S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases

Phases

Phases: LU

Phase detection First, divide program into blocks
All code in block execute in same gear Block boundaries MPI operation Expect OPM change Then, merge adjacent blocks into phases Merge if similar memory pressure Use OPM | OPMi – OPMi+1 | small Merge if small (short time) Note, in future: Leverage large body of phase detection research [Kennedy & Kremer 1998] [Sherwood, et al 2002]

Data collection Use MPI-jack Pre and post hooks For example
application MPI library Use MPI-jack Pre and post hooks For example Program tracing Gear shifting Gather profile data during execution Define MPI-jack hook for every MPI operation Insert pseudo MPI call at end of loops Information collected: Type of call and location (PC) Status (gear, time, etc) Statistics (uops and L2 misses for OPM calculation) MPI-jack code

Example: bt

Comparing two schedules
What is the “best” schedule? Depends on user User supplies “better” function bool better(i, j) Several metrics can be used Energy-delay Energy-delay squared [Cameron et al. SC2004]

Slope metric Project uses slope Energy-time tradeoff
limit i j Project uses slope Energy-time tradeoff Slope = -1  energy savings = time delay User-defines the limit Limit = 0  minimize energy Limit = -∞  minimize time If slope < limit, then better We do not advocate this metric over others

Example: bt Solutions Slope < -1.5? 1 00  01 -11.7 true 2 01  02
-1.78 3 02  03 -1.19 false 4 02  12 -1.44 02 is the best

Benefit of multiple gears: mg

Current work: no. of nodes, gear/phase

Load imbalance

Node bottleneck Best course is to keep load balanced
Load balancing is hard Slow down if not critical node How to tell if not critical node? Suppose a barrier All nodes must arrive before any leave No benefit to arriving early Measure block time Assume it is (mostly) the same between iterations Assumptions Iterative application Past predicts future

Example Reduced performance & power  Energy savings predicted
synch pt synch pt synch pt slack predicted t performance = 1 performance = (t-slack)/t iteration k iteration k+1 Reduced performance & power  Energy savings

Measuring slack Blocking operations Receive Wait Barrier
Measure with MPI_Jack Too frequent Can be hundreds or thousands per second Aggregate slack for one or more iterations Computing slack, S Measure times for computing and blocking phases T= C1 + B1 + C2 + B2 + …+ Cn + Bn Compute aggregate slack S = (B1+B2+…+Bn)/T

Slack Slack Varies between nodes Varies between applications
Communication slack Aztec Sweep3d CG Slack Varies between nodes Varies between applications Use net slack Each node individually determines slack Reduction to find min slack

Shifting When to reduce performance? When there is enough slack
When to increase performance? When application performance suffers Create high and low limit for slack Need damping Dynamically learn Not the same for all applications Range starts small Increase if necessary reduce gear slack same gear increase gear T

Aztec gears

Performance Aztec Sweep3d

Synthetic benchmark

Summary Contributions Improved energy efficiency of HPC applications
Found simple metric for phase boundary location Developed simple, effective linear time algorithm for determining proper gears Leveraged load imbalance Future work Reduce sampling interval to handful of iterations Reduce algorithm time w/ modeling and prediction Develop AMPERE a message passing environment for reducing energy

Shifting test NAS LU – 1 node 7.7% 1% 1% 4.5%

Beta Hsu & Kremer [PLDI ‘03]
Relates application slowdown to CPU slowdown b = b=1  time is CPU dependent b=0  time is independent of CPU OPM vs. b Correlated Log(OPM)  b

OPM and b and slack OPM not strongly correlated to b in multi-node
Why? There is another bottleneck Communication slack Waiting time Eg, MPI_Receive, MPI_Wait, MPI_Barrier MG: OPM = 70.6; slack = 25% LU: OPM = 73.5; slack = 11% Can predict b with Log(OPM) and slack

Energy savings (synthetic)

Normalized – MG With communication bottleneck E-T tradeoff improves
as N increases

SPEC FP

SPEC INT

Single node – MG Modest memory pressure: Gears offer E-T tradeoff +6%
-7% +12% -8% Modest memory pressure: Gears offer E-T tradeoff

Dynamically adjust performance
net slack 2 time 1 2

Adjust performance net slack time 1 1 1

Dampening net slack time 1 1 1

Power consumption Average for NAS suite

Related work: Energy conservation
Goal: conserve energy Performance degradation acceptable Usually in mobile environments (finite energy source, battery) Primary goal: Extend battery life Secondary goal: Re-allocate energy Increase “value” of energy use Tertiary goal: Increase energy efficiency More tasks per unit energy Example Feedback-driven, energy conservation Control average power usage Pave= (E0 – Ef)/T E0 Ef T power freq

Related work: Realtime DVS
Goal: Reduce energy consumption With no performance degradation Mechanism: Eliminate slack time in system Savings Eidle with F scaling Additional Etask – Etask’ with V scaling P P Pmax Pmax Etask deadline Etask’ deadline Eidle T T

Related work Previous studies in power-aware HPC
Cameron et al., SC 2004 & IPDPS 2005, Freeh et al., IPDPS 2005 Energy-aware server clusters Many projects; e.g., Heath PPoPP 2005 Low-power supercomputer design Green Destiny (Warren et al., 2002) Orion Multisystems

Related work: Fixed installations
Goal: Reduce cost (in heat generation or $) Goal is not to conserve a battery Mechanisms Scaling Fine-grain – DVS Coarse-grain – power down Load balancing

Memory pressure Why different tradeoffs?
CG is memory bound: CPU not on critical path EP is CPU bound: CPU is on critical path Operations per miss Metric of memory pressure Indicates criticality of CPU Use performance counters Count micro operations and cache misses

Single node – MG

Single node – LU

High-Performance Power-Aware Computing

Similar presentations

Presentation on theme: "High-Performance Power-Aware Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Performance Power-Aware Computing

Similar presentations

Presentation on theme: "High-Performance Power-Aware Computing"— Presentation transcript:

Similar presentations

About project

Feedback