Download presentation
Presentation is loading. Please wait.
1
High-Performance Power-Aware Computing
Vincent W. Freeh Computer Science NCSU
2
Acknowledgements NCSU Tyler K. Bletsch Mark E. Femal Nandini Kappiah
Feng Pan Daniel M. Smith U of Georgia Robert Springer Barry Rountree Prof. David K. Lowenthal
3
The case for power management
Eric Schmidt, Google CEO: “it’s not speed but power—low power, because data centers can consume as much electricity as a small city.” Power/energy consumption becoming key issue Power limitations Energy = Heat; Heat dissipation is costly Non-trivial amount of money Consequence Excessive power consumption limits performance Fewer nodes can operate concurrently Goal Increase power/energy efficiency More performance per unit power/energy
4
application throughput
CPU scaling power frequency x voltage2 How: CPU scaling Reduce frequency & voltage Reduce power & performance Energy/power gears Frequency-voltage pair Power-performance setting Energy-time tradeoff Why CPU scaling? Large power consumer Mechanism exists power frequency/voltage application throughput frequency/voltage
5
Is CPU scaling a win? ECPU Eother T full power PCPU Psystem Pother
time full
6
Is CPU scaling a win? benefit ECPU cost Eother T T+DT full reduced
power benefit cost PCPU ECPU PCPU Psystem Eother Psystem Pother Pother T T+DT time full reduced
7
Our work Exploit bottlenecks
Application waiting on bottleneck resource Reduce power consumption (non-critical resource) Generally CPU not on critical path Bottlenecks we exploit Intra-node (memory) Inter-node (load imbalance) Contributions Impact studies [HPPAC ’05] [IPDPS ’05] Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)] Leveraging load imbalance [SC ’05]
8
Methodology Cluster used: 10 nodes, AMD Athlon-64
Processor supports 7 frequency-voltage settings (gears) Frequency (MHz) Voltage (V) Measure Wall clock time (gettimeofday system call) Energy (external power meter)
9
NAS
10
CG – 1 node Not CPU bound: Little time penalty Large energy savings
2000MHz 800MHz +1% -17% Not CPU bound: Little time penalty Large energy savings
11
EP – 1 node CPU bound: Big time penalty No (little) energy savings
+11% -3% CPU bound: Big time penalty No (little) energy savings
12
Operation per miss SP: 49.5 CG: 8.60 BT: 79.6 EP: 844
13
Multiple nodes – EP Perfect speedup: E constant as N increases
14
Multiple nodes – LU Good speedup: E-T tradeoff as N increases S8 = 5.3
Gear 2 S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases
15
Phases
16
Phases: LU
17
Phase detection First, divide program into blocks
All code in block execute in same gear Block boundaries MPI operation Expect OPM change Then, merge adjacent blocks into phases Merge if similar memory pressure Use OPM | OPMi – OPMi+1 | small Merge if small (short time) Note, in future: Leverage large body of phase detection research [Kennedy & Kremer 1998] [Sherwood, et al 2002]
18
Data collection Use MPI-jack Pre and post hooks For example
application MPI library Use MPI-jack Pre and post hooks For example Program tracing Gear shifting Gather profile data during execution Define MPI-jack hook for every MPI operation Insert pseudo MPI call at end of loops Information collected: Type of call and location (PC) Status (gear, time, etc) Statistics (uops and L2 misses for OPM calculation) MPI-jack code
19
Example: bt
20
Comparing two schedules
What is the “best” schedule? Depends on user User supplies “better” function bool better(i, j) Several metrics can be used Energy-delay Energy-delay squared [Cameron et al. SC2004]
21
Slope metric Project uses slope Energy-time tradeoff
limit i j Project uses slope Energy-time tradeoff Slope = -1 energy savings = time delay User-defines the limit Limit = 0 minimize energy Limit = -∞ minimize time If slope < limit, then better We do not advocate this metric over others
22
Example: bt Solutions Slope < -1.5? 1 00 01 -11.7 true 2 01 02
-1.78 3 02 03 -1.19 false 4 02 12 -1.44 02 is the best
23
Benefit of multiple gears: mg
24
Current work: no. of nodes, gear/phase
25
Load imbalance
26
Node bottleneck Best course is to keep load balanced
Load balancing is hard Slow down if not critical node How to tell if not critical node? Suppose a barrier All nodes must arrive before any leave No benefit to arriving early Measure block time Assume it is (mostly) the same between iterations Assumptions Iterative application Past predicts future
27
Example Reduced performance & power Energy savings predicted
synch pt synch pt synch pt slack predicted t performance = 1 performance = (t-slack)/t iteration k iteration k+1 Reduced performance & power Energy savings
28
Measuring slack Blocking operations Receive Wait Barrier
Measure with MPI_Jack Too frequent Can be hundreds or thousands per second Aggregate slack for one or more iterations Computing slack, S Measure times for computing and blocking phases T= C1 + B1 + C2 + B2 + …+ Cn + Bn Compute aggregate slack S = (B1+B2+…+Bn)/T
29
Slack Slack Varies between nodes Varies between applications
Communication slack Aztec Sweep3d CG Slack Varies between nodes Varies between applications Use net slack Each node individually determines slack Reduction to find min slack
30
Shifting When to reduce performance? When there is enough slack
When to increase performance? When application performance suffers Create high and low limit for slack Need damping Dynamically learn Not the same for all applications Range starts small Increase if necessary reduce gear slack same gear increase gear T
31
Aztec gears
32
Performance Aztec Sweep3d
33
Synthetic benchmark
34
Summary Contributions Improved energy efficiency of HPC applications
Found simple metric for phase boundary location Developed simple, effective linear time algorithm for determining proper gears Leveraged load imbalance Future work Reduce sampling interval to handful of iterations Reduce algorithm time w/ modeling and prediction Develop AMPERE a message passing environment for reducing energy
35
End
36
Shifting test NAS LU – 1 node 7.7% 1% 1% 4.5%
37
Beta Hsu & Kremer [PLDI ‘03]
Relates application slowdown to CPU slowdown b = b=1 time is CPU dependent b=0 time is independent of CPU OPM vs. b Correlated Log(OPM) b
38
OPM and b and slack OPM not strongly correlated to b in multi-node
Why? There is another bottleneck Communication slack Waiting time Eg, MPI_Receive, MPI_Wait, MPI_Barrier MG: OPM = 70.6; slack = 25% LU: OPM = 73.5; slack = 11% Can predict b with Log(OPM) and slack
39
Energy savings (synthetic)
40
Normalized – MG With communication bottleneck E-T tradeoff improves
as N increases
41
SPEC FP
42
SPEC INT
43
Single node – MG Modest memory pressure: Gears offer E-T tradeoff +6%
-7% +12% -8% Modest memory pressure: Gears offer E-T tradeoff
44
Dynamically adjust performance
net slack 2 time 1 2
45
Adjust performance net slack time 1 1 1
46
Dampening net slack time 1 1 1
47
Power consumption Average for NAS suite
48
Related work: Energy conservation
Goal: conserve energy Performance degradation acceptable Usually in mobile environments (finite energy source, battery) Primary goal: Extend battery life Secondary goal: Re-allocate energy Increase “value” of energy use Tertiary goal: Increase energy efficiency More tasks per unit energy Example Feedback-driven, energy conservation Control average power usage Pave= (E0 – Ef)/T E0 Ef T power freq
49
Related work: Realtime DVS
Goal: Reduce energy consumption With no performance degradation Mechanism: Eliminate slack time in system Savings Eidle with F scaling Additional Etask – Etask’ with V scaling P P Pmax Pmax Etask deadline Etask’ deadline Eidle T T
50
Related work Previous studies in power-aware HPC
Cameron et al., SC 2004 & IPDPS 2005, Freeh et al., IPDPS 2005 Energy-aware server clusters Many projects; e.g., Heath PPoPP 2005 Low-power supercomputer design Green Destiny (Warren et al., 2002) Orion Multisystems
51
Related work: Fixed installations
Goal: Reduce cost (in heat generation or $) Goal is not to conserve a battery Mechanisms Scaling Fine-grain – DVS Coarse-grain – power down Load balancing
52
Memory pressure Why different tradeoffs?
CG is memory bound: CPU not on critical path EP is CPU bound: CPU is on critical path Operations per miss Metric of memory pressure Indicates criticality of CPU Use performance counters Count micro operations and cache misses
53
Single node – MG
54
Single node – LU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.