Studying the Impact of Bit Switching on CPU Energy Ghassan Shobaki, California State Univ., Sacramento Najm Eldeen Abu Rmaileh, Princess Sumaya Univ. for.

Slides:

Advertisements

Similar presentations

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Advertisements

Computer Architecture Instruction-Level Parallel Processors

CSCI 4717/5717 Computer Architecture

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Constraint Programming for Compiler Optimization March 2006.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 Optimization Algorithms on a Quantum Computer A New Paradigm for Technical Computing Richard H. Warren, PhD Optimization.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

1 Soft Timers: Efficient Microsecond Software Timer Support For Network Processing Mohit Aron and Peter Druschel Rice University Presented By Jonathan.

© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt

An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

RISC Architecture RISC vs CISC Sherwin Chan.

Doshisha Univ., Kyoto, Japan CEC2003 Adaptive Temperature Schedule Determined by Genetic Algorithm for Parallel Simulated Annealing Doshisha University,

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

Metaheuristics for the New Millennium Bruce L. Golden RH Smith School of Business University of Maryland by Presented at the University of Iowa, March.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

Computer Architecture & Operations I

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Computer Architecture & Operations I

Contents Introduction Bus Power Model Related Works Motivation

CS161 – Design and Architecture of Computer Systems

Evaluating Register File Size

Pipeline Implementation (4.6)

Improving java performance using Dynamic Method Migration on FPGAs

Department of Electrical & Computer Engineering

Improving Program Efficiency by Packing Instructions Into Registers

Instruction Scheduling for Instruction-Level Parallelism

Tosiron Adegbija and Ann Gordon-Ross+

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Ann Gordon-Ross and Frank Vahid*

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

CS161 – Design and Architecture of Computer Systems

Computer Organization and Design Chapter 4

Presentation transcript:

Studying the Impact of Bit Switching on CPU Energy Ghassan Shobaki, California State Univ., Sacramento Najm Eldeen Abu Rmaileh, Princess Sumaya Univ. for Technology, Jordan Jafar Jamal, IDSIA Research Institute, Switzerland SCOPES 2016 Wednesday, May 25, 2016

Acknowledgment This research was partially supported by a Google Faculty Research Award granted in August 2013

Outline Background and Algorithms Experimental Setup Experimental Results Conclusions and Future Work

Background Many compiler optimizations for power/energy reductions have been proposed Limited number of experimental studies using real hardware measurements Most research results are based on simulation, energy models and theoretical calculations Most production compilers, such as GCC and LLVM, don’t offer energy-specific and energy-aware optimizations

Relevant Optimizations All performance optimizations Reducing execution time reduces energy consumption Energy-specific optimizations, such as switching energy minimization, and voltage-frequency scaling Optimizations that balance multiple conflicting objectives Is the best balance for performance the same as the best balance for energy? Examples: pre-allocation scheduling (balances ILP and register pressure), loop unrolling and function inlining (balance dynamic and static instruction count)

Switching Energy It has been proposed that a compiler may reorder instructions to minimize switching energy Example Instr1 encoding: 1010 Instr2 encoding: bits are different (Hamming distance = 3) Fetching Instr2 after Instr1 will require switching three bits on the bus Switching energy minimization problem: Given an instruction stream find the order that minimizes the total switching energy

Instruction Scheduling Compilers do instruction scheduling before and after register allocation (pre-allocation and past-allocation scheduling). In pre-allocation, a compiler needs to balance register pressure and ILP In post-allocation, a compiler schedules spill code and does fine tuning Scheduling for minimum switching energy must be done in post- allocation, because it needs to know all the instructions, including spill code it needs to know complete encoding, including operands In theory, if the hardware does good out-of-order execution, post allocation scheduler may focus on switching energy

Switching Energy Algorithms Multiple algorithms have been proposed for switching energy minimization Earliest algorithm is Cold Scheduling (Su et al. 1994) Equivalent to the Nearest Neighbor (NN) Heuristic Simulation-based results report energy reductions up to 30% (Parikh et al. 2003) No experimental results using real hardware measurements In this work, we evaluate the performance of previously proposed algorithms, including our exact algorithm (Shobaki and Jamal, 2015)

9 Cycle Instr Sw. Energy 1: A 2: B 2.0 3: C 2.0 4: D 0.5 Switching energy in first 4 cycles= 4.5 ABC D E F G Switching Cost Matrix A B C D E, F A B C D E, F Critical Path (CP) Algorithm

10 Cycle Instr Sw. Energy 1: A 2: C 0.5 3: D 0.5 4: B 0.5 Switching energy in first 4 cycles= 1.5 ABC D E F G Switching Cost Matrix A B C D E, F A B C D E, F Nearest Neighbor (NN) Algorithm Cold Scheduling, Su et al. 1994

Combinatorial Algorithm Shobaki and Jamal 2015 Formulate the problem as a Precedence-constrained Traveling Salesman Problem (PCTSP), aka, the Sequential Ordering Problem (SOP) Search for an exact solution using a Branch-and-Bound Approach With a time limit of 10 ms per instr, it optimally solves 99.8% of the basic blocks in MiBench (over 30 thousand blocks) It optimally schedules a blocks with hundreds of instructions within a few seconds On average, switching cost is 16% less than CP and 5% less than NN B&B algorithm and COMPILER instances are of interest to the operations research community

Experimental Setup OMAP5432 EVM board with a dual ARM® Cortex™-A15 MPCore™ processor OMAP5432 board has shunt resistors and connections that allow measuring CPU energy and memory energy Only measured CPU energy Out-of-order execution, but processor does not reorder instructions until the execution stage. So, instructions are fetched in the order determined by the compiler Energy measurements were performed using an ARM Energy Probe

Experimental Setup

Compiler and Benchmarks Algorithms implemented as post-allocation schedulers in LLVM 3.3 CP_NN NN_CP Combinatorial Base algorithm is LLVM’s default post-allocation scheduler LLVM does local scheduling (within the basic block) 12 Benchmarks selected from MiBench and SPEC CPU2006 Cross compiled on an Intel machine for the ARM target

Extreme Switching Experiment Explore the limits of switching energy Instruction order that gives maximum switching BIC, ANDS, BIC, ANDS, BIC, ANDS, …. Instruction order that gives minimum switching BIC, BIC, BIC, …, BIC, ANDS, ANDS, ANDS, …, ANDS Similar experiment done by Zhurikhin et al. (2009)

Block SizeMax SwitchingMin Switching%Diff Time (s)CPU Energy (J)Time (s)CPU Energy (J)TimeCPU Energy %-0.30% %1.48% %6.02% %7.68% %7.58% %7.64% %4.90% %3.89% %4.02% %4.27% Extreme Switching Results

BenchmarkCP_NNNN_CPCombinatorial Susan_s3.99%7.90%15.18% Susan_e4.01%8.10%15.32% Jpeg_c2.34%5.29%11.61% Lbm11.94%19.03%26.92% Bzip23.26%5.52%11.19% Hmmer5.06%7.90%15.92% Mcf2.76%7.03%12.08% Bwaves5.81%12.86%22.95% Gobmk6.30%9.51%15.90% Astar2.81%6.04%12.45% Sjeng5.62%9.01%15.32% Leslie4.98%10.66%21.54% AVG4.91%9.07%16.37% Computed Switching Cost Reductions

BenchmarkTime (s) Energy (J) Time Var. Energy Var. Susan_s %0.62% Susan_e %0.42% Jpeg_c %0.22% Lbm %0.63% Bzip %0.46% Hmmer %4.21% Mcf %0.59% Bwaves %1.31% Gobmk %1.00% Astar %0.53% Sjeng %0.52% Leslie %0.74% Time and Energy Variation

BenchmarkCP_NNNN_CPCombinatorial TimeEnergyTimeEnergyTimeEnergy Jpeg_c-0.16%-0.44%0.02%0.29%-0.04%0.61% Lbm0.24%-1.10%-0.81%-0.61%-0.77%-0.43% Bwaves-0.38%-0.03%-3.67%-0.37%-4.06%-0.95% Gobmk-0.42%-2.56%-0.20%0.11%-0.19%0.12% Astar0.11%-2.28%-0.45%-0.17%-0.22%-0.13% Sjeng-0.94%-2.54%-1.09%-1.33%-0.63%0.00% Leslie-1.78%-3.32%-2.59%-1.82%-2.10%-1.40% Average %-0.59%-0.38% Algorithm Comparison

Observations Impact of post-allocation scheduling is limited on time and energy On avg., all algorithms degrade performance, probably because LLVM does a better job at handling hardware restrictions This leads to increasing energy consumption Reduction in switching energy appears to partially compensate for that On average, energy-first algorithms (NN_CP and comb) reduced energy by 1% relative to CP_NN although they caused slightly more performance degradation This 1% is believed to be real And it is free!

Conclusions The statement that Compiling for Performance is equivalent to compiling for energy is not strictly true Switching energy is measurable Impact of switching on CPU energy is not as high as that of execution time Scheduling algorithm that minimizes switching must avoid increasing execution time This is easier on out-of-order processors Energy savings by compiler optimizations are interesting, because they are essentially free

Future Work Develop more effective algorithms for balancing energy and performance Conduct similar experiments on a wider range of processors, including in-order processors Switching energy could be more significant on other processors Study the energy impact of other compiler optimizations, such as pre- allocation scheduling, loop unrolling and function inlining

Questions?