High-Performance Power-Aware Computing

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.
1 EE5900 Advanced Embedded System For Smart Infrastructure Energy Efficient Scheduling.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Power Management (Application of Autonomic Computing Concepts) Omer Rana.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Scheduling for Reduced CPU Energy M. Weiser, B. Welch, A. Demers, and S. Shenker.
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Computer Science Department University of Pittsburgh 1 Evaluating a DVS Scheme for Real-Time Embedded Systems Ruibin Xu, Daniel Mossé and Rami Melhem.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Last Time Performance Analysis It’s all relative
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
Green HPC: Power Aware Scheduling of Bag-of-Tasks Applications with Deadline Constraints on DVS-Enabled Data Centers Kyong Hoon Kim 1, Rajkumar Buyya 1,
1 EE5900 Advanced Embedded System For Smart Infrastructure Energy Efficient Scheduling.
Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.
1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Power-Aware Parallel Job Scheduling
NUS.SOC.CS5248 A Time Series-based Approach for Power Management in Mobile Processors and Disks X. Liu, P. Shenoy and W. Gong Presented by Dai Lu.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Multimedia Computing and Networking Jan Reduced Energy Decoding of MPEG Streams Malena Mesarina, HP Labs/UCLA CS Dept Yoshio Turner, HP Labs.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Sunpyo Hong, Hyesoon Kim
LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
OPERATING SYSTEMS CS 3502 Fall 2017
Gwangsun Kim, Jiyun Jeong, John Kim
Introduction to Operating Systems
Jacob R. Lorch Microsoft Research
Jacob R. Lorch Microsoft Research
Green cloud computing 2 Cs 595 Lecture 15.
Resource Aware Scheduler – Initial Results
Wayne Wolf Dept. of EE Princeton University
Parallel Programming in C with MPI and OpenMP
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Some challenges in heterogeneous multi-core systems
Hui Chen, Shinan Wang and Weisong Shi Wayne State University
Department of Computer Science University of California, Santa Barbara
Haishan Zhu, Mattan Erez
Computer Architecture
Architectural Interactions in High Performance Clusters
Shreeni Venkataramaiah
Energy Efficient Scheduling in IoT Networks
Dynamic Voltage Scaling
Multiprocessor and Real-Time Scheduling
COMP60621 Fundamentals of Parallel and Distributed Systems
CS703 – Advanced Operating Systems
Parallel Programming in C with MPI and OpenMP
Maximizing Speedup through Self-Tuning of Processor Allocation
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
MapReduce: Simplified Data Processing on Large Clusters
What Are Performance Counters?
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

High-Performance Power-Aware Computing Vincent W. Freeh Computer Science NCSU vin@csc.ncsu.edu

Acknowledgements NCSU Tyler K. Bletsch Mark E. Femal Nandini Kappiah Feng Pan Daniel M. Smith U of Georgia Robert Springer Barry Rountree Prof. David K. Lowenthal

The case for power management Eric Schmidt, Google CEO: “it’s not speed but power—low power, because data centers can consume as much electricity as a small city.” Power/energy consumption becoming key issue Power limitations Energy = Heat; Heat dissipation is costly Non-trivial amount of money Consequence Excessive power consumption limits performance Fewer nodes can operate concurrently Goal Increase power/energy efficiency More performance per unit power/energy

application throughput CPU scaling power  frequency x voltage2 How: CPU scaling Reduce frequency & voltage Reduce power & performance Energy/power gears Frequency-voltage pair Power-performance setting Energy-time tradeoff Why CPU scaling? Large power consumer Mechanism exists power frequency/voltage application throughput frequency/voltage

Is CPU scaling a win? ECPU Eother T full power PCPU Psystem Pother time full

Is CPU scaling a win? benefit ECPU cost Eother T T+DT full reduced power benefit cost PCPU ECPU PCPU Psystem Eother Psystem Pother Pother T T+DT time full reduced

Our work Exploit bottlenecks Application waiting on bottleneck resource Reduce power consumption (non-critical resource) Generally CPU not on critical path Bottlenecks we exploit Intra-node (memory) Inter-node (load imbalance) Contributions Impact studies [HPPAC ’05] [IPDPS ’05] Varying gears/nodes [PPoPP ’05] [PPoPP ’06 (submitted)] Leveraging load imbalance [SC ’05]

Methodology Cluster used: 10 nodes, AMD Athlon-64 Processor supports 7 frequency-voltage settings (gears) Frequency (MHz) 2000 1800 1600 1400 1200 1000 800 Voltage (V) 1.5 1.4 1.35 1.3 1.2 1.1 1.0 Measure Wall clock time (gettimeofday system call) Energy (external power meter)

NAS

CG – 1 node Not CPU bound: Little time penalty Large energy savings 2000MHz 800MHz +1% -17% Not CPU bound: Little time penalty Large energy savings

EP – 1 node CPU bound: Big time penalty No (little) energy savings +11% -3% CPU bound: Big time penalty No (little) energy savings

Operation per miss SP: 49.5 CG: 8.60 BT: 79.6 EP: 844

Multiple nodes – EP Perfect speedup: E constant as N increases

Multiple nodes – LU Good speedup: E-T tradeoff as N increases S8 = 5.3 Gear 2 S8 = 5.8 E8 = 1.28 S4 = 3.3 E4 = 1.15 S2 = 1.9 E2 = 1.03 Good speedup: E-T tradeoff as N increases

Phases

Phases: LU

Phase detection First, divide program into blocks All code in block execute in same gear Block boundaries MPI operation Expect OPM change Then, merge adjacent blocks into phases Merge if similar memory pressure Use OPM | OPMi – OPMi+1 | small Merge if small (short time) Note, in future: Leverage large body of phase detection research [Kennedy & Kremer 1998] [Sherwood, et al 2002]

Data collection Use MPI-jack Pre and post hooks For example application MPI library Use MPI-jack Pre and post hooks For example Program tracing Gear shifting Gather profile data during execution Define MPI-jack hook for every MPI operation Insert pseudo MPI call at end of loops Information collected: Type of call and location (PC) Status (gear, time, etc) Statistics (uops and L2 misses for OPM calculation) MPI-jack code

Example: bt

Comparing two schedules What is the “best” schedule? Depends on user User supplies “better” function bool better(i, j) Several metrics can be used Energy-delay Energy-delay squared [Cameron et al. SC2004]

Slope metric Project uses slope Energy-time tradeoff limit i j Project uses slope Energy-time tradeoff Slope = -1  energy savings = time delay User-defines the limit Limit = 0  minimize energy Limit = -∞  minimize time If slope < limit, then better We do not advocate this metric over others

Example: bt Solutions Slope < -1.5? 1 00  01 -11.7 true 2 01  02 -1.78 3 02  03 -1.19 false 4 02  12 -1.44 02 is the best

Benefit of multiple gears: mg

Current work: no. of nodes, gear/phase

Load imbalance

Node bottleneck Best course is to keep load balanced Load balancing is hard Slow down if not critical node How to tell if not critical node? Suppose a barrier All nodes must arrive before any leave No benefit to arriving early Measure block time Assume it is (mostly) the same between iterations Assumptions Iterative application Past predicts future

Example Reduced performance & power  Energy savings predicted synch pt synch pt synch pt slack predicted t performance = 1 performance = (t-slack)/t iteration k iteration k+1 Reduced performance & power  Energy savings

Measuring slack Blocking operations Receive Wait Barrier Measure with MPI_Jack Too frequent Can be hundreds or thousands per second Aggregate slack for one or more iterations Computing slack, S Measure times for computing and blocking phases T= C1 + B1 + C2 + B2 + …+ Cn + Bn Compute aggregate slack S = (B1+B2+…+Bn)/T

Slack Slack Varies between nodes Varies between applications Communication slack Aztec Sweep3d CG Slack Varies between nodes Varies between applications Use net slack Each node individually determines slack Reduction to find min slack

Shifting When to reduce performance? When there is enough slack When to increase performance? When application performance suffers Create high and low limit for slack Need damping Dynamically learn Not the same for all applications Range starts small Increase if necessary reduce gear slack same gear increase gear T

Aztec gears

Performance Aztec Sweep3d

Synthetic benchmark

Summary Contributions Improved energy efficiency of HPC applications Found simple metric for phase boundary location Developed simple, effective linear time algorithm for determining proper gears Leveraged load imbalance Future work Reduce sampling interval to handful of iterations Reduce algorithm time w/ modeling and prediction Develop AMPERE a message passing environment for reducing energy http://fortknox.csc.ncsu.edu:osr/ vin@csc.ncsu.edu dkl@cs.uga.edu

End

Shifting test NAS LU – 1 node 7.7% 1% 1% 4.5%

Beta Hsu & Kremer [PLDI ‘03] Relates application slowdown to CPU slowdown b = b=1  time is CPU dependent b=0  time is independent of CPU OPM vs. b Correlated Log(OPM)  b

OPM and b and slack OPM not strongly correlated to b in multi-node Why? There is another bottleneck Communication slack Waiting time Eg, MPI_Receive, MPI_Wait, MPI_Barrier MG: OPM = 70.6; slack = 25% LU: OPM = 73.5; slack = 11% Can predict b with Log(OPM) and slack

Energy savings (synthetic)

Normalized – MG With communication bottleneck E-T tradeoff improves as N increases

SPEC FP

SPEC INT

Single node – MG Modest memory pressure: Gears offer E-T tradeoff +6% -7% +12% -8% Modest memory pressure: Gears offer E-T tradeoff

Dynamically adjust performance net slack 2 time 1 2

Adjust performance net slack time 1 1 1

Dampening net slack time 1 1 1

Power consumption Average for NAS suite

Related work: Energy conservation Goal: conserve energy Performance degradation acceptable Usually in mobile environments (finite energy source, battery) Primary goal: Extend battery life Secondary goal: Re-allocate energy Increase “value” of energy use Tertiary goal: Increase energy efficiency More tasks per unit energy Example Feedback-driven, energy conservation Control average power usage Pave= (E0 – Ef)/T E0 Ef T power freq

Related work: Realtime DVS Goal: Reduce energy consumption With no performance degradation Mechanism: Eliminate slack time in system Savings Eidle with F scaling Additional Etask – Etask’ with V scaling P P Pmax Pmax Etask deadline Etask’ deadline Eidle T T

Related work Previous studies in power-aware HPC Cameron et al., SC 2004 & IPDPS 2005, Freeh et al., IPDPS 2005 Energy-aware server clusters Many projects; e.g., Heath PPoPP 2005 Low-power supercomputer design Green Destiny (Warren et al., 2002) Orion Multisystems

Related work: Fixed installations Goal: Reduce cost (in heat generation or $) Goal is not to conserve a battery Mechanisms Scaling Fine-grain – DVS Coarse-grain – power down Load balancing

Memory pressure Why different tradeoffs? CG is memory bound: CPU not on critical path EP is CPU bound: CPU is on critical path Operations per miss Metric of memory pressure Indicates criticality of CPU Use performance counters Count micro operations and cache misses

Single node – MG

Single node – LU