1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.

Slides:

Advertisements

Similar presentations

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Advertisements

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Power Management (Application of Autonomic Computing Concepts) Omer Rana.

Memory System Characterization of Big Data Workloads

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

Chia-Yen Hsieh Laboratory for Reliable Computing Microarchitecture-Level Power Management Iyer, A. Marculescu, D., Member, IEEE IEEE Transaction on VLSI.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?

Memory Management 2010.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

PRASHANTHI NARAYAN NETTEM.

Memory Access Scheduling and Binding Considering Energy Minimization in Multi- Bank Memory Systems Chun-Gi Lyuh, Taewhan Kim DAC 2004, June 7-11, 2004.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Rice01, slide 1 Characterizing NAS Benchmark Performance on Shared Heterogeneous Networks Jaspal Subhlok Shreenivasa Venkataramaiah Amitoj Singh University.

Cloud Data Center/Storage Power Efficiency Solutions Junyao Zhang 1.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.

IT The Relational DBMS Section 06. Relational Database Theory Physical Database Design.

OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.

Low-Power Wireless Sensor Networks

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

1 High-Performance, Power-Aware Computing Vincent W. Freeh Computer Science NCSU

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

A User-Lever Concurrency Manager Hongsheng Lu & Kai Xiao.

ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

XI HE Computing and Information Science Rochester Institute of Technology Rochester, NY USA Rochester Institute of Technology Service.

| nectar.org.au NECTAR TRAINING Module 4 From PC To Cloud or HPC.

Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Sunpyo Hong, Hyesoon Kim

Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.

LACSI 2002, slide 1 Performance Prediction for Simple CPU and Network Sharing Shreenivasa Venkataramaiah Jaspal Subhlok University of Houston LACSI Symposium.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.

Ramya Kandasamy CS 147 Section 3

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

Green Software Engineering Prof

Spare Register Aware Prefetching for Graph Algorithms on GPUs

High-Performance Power-Aware Computing

Overview Introduction VPS Understanding VPS Architecture

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Architectural Interactions in High Performance Clusters

CARLA Buenos Aires, Argentina - Sept , 2017

Andy Wang COP 5611 Advanced Operating Systems

Operating Systems: Internals and Design Principles, 6/E

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Presentation transcript:

1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented by: Huaxia Xia CSAG, CSE of UCSD

2 of 17 Introduction Power-aware Computing HPC Uses Large-scale Systems, Has High Power Consumption Two extremes: Performance-at-all-costs Low-performance but more energy efficient This paper targets to save energy with little performance penalty

3 of 17 Related Work Server/Desktop Systems Minimize the number of servers needed to handle the load, and set other servers into low-energy state (standby or power-off) Set node voltage independently Disk: Modulate the speed of disks dynamically Improve cache policy Aggregate disk accesses to have burst requests Mobile Systems Energy-aware OS Voltage-changeable CPU Disk spindown Memory Network

4 of 17 Assumptions HPC Applications Performance is the Primary Concern Highly Regular and Predictable CPU has Multiple “ Gears ” Variable Frequency Variable Voltage CPU is a Major Power Consumer Energy consumption of disks/memory/network is not considered

5 of 17 Methodology: Profile-Directed 1. Get Program Trace 2. Divide the Program into Blocks 3. Merge the Blocks into Phases 4. Search the Best Gear for Each Phase Heuristically

6 of 17 Divide Codes into “Blocks” Rule 1: Any MPI operation demarcates a block boundary. Rule 2: If the memory pressure changes abruptly, a block boundary occurs at this change. Use operations per miss (OPM) as a measure of the memory pressure

7 of 17 Merge “Blocks” into “Phases” Two adjacent blocks are merged into a phase if their corresponding memory pressure is within the same threshold OPM in Trace of LU (Class C):

8 of 17 Data Collection Use MPI-jack Intercept any MPI call transparently Can execute arbitrary codes before/after an intercepted call Insert pseudo MPI calls at non-MPI phase boundaries Collect information of time, operations, L2 misses Question: Mutual Dependence? Trace data  Block boundaries

9 of 17 Solution Search (1) Metrics: Energy-Time Tradeoff Normalized energy and time Total system energy A larger negative number indicates a near vertical slope and a significant energy saving Question: How to measure energy consumption accurately?

10 of 17 Solution Search (2) Phase Prioritization Sort the phases in the order of OPM (low  high) Question: why is sorting necessary? “Novel” Heuristic Search Find the local optimal gear for each phase one by one Running time is at most n×g

11 of 17 Solution Search (3)

12 of 17 Experiments 10 AMD Athlon-64 CPUs Frequency-scalable: MHz Voltage-scalable: V 1GB main memory 128KB L1 cache, 512KB L2 cache 100Mb/s network CPU Consumes 45-55% of Overall System Energy Benchmarks: NAS Parallel Benchmarks (NPB)

13 of 17 Results: Multiple Gear Benefit IS: 16% energy saving with 1% extra time BT: 10% energy saving with 5% extra time MG: 11% energy saving with 4% extra time

14 of 17 Results: Single Gear Benefit The order of phases matters! CG: 8% energy saving with 3% extra time SP: 15% energy saving with 7% extra time

15 of 17 Results: No Benefit

16 of 17 Conclusions and Future Work Use Profile-directed Method to Achieve Good Energy-Time Tradeoff for HPC Applications Future work: Enhance profile-directed techniques Consider Inter-node bottlenecks Automate the entire process

17 of 17 Discussion How important is power consumption to HPC? 10% energy  ?  5% time Is Profile-directed method practical? Effective for applications that run repeatedly How much degree of automatic? Is OPM (Operations Per Miss) a good metric to find phases? Key Purpose: to identify CPU utilization Other options: Instructions Per Second, CPU Usage Is OPM a good metric to sort phases?