Studying the Impact of Bit Switching on CPU Energy Ghassan Shobaki, California State Univ., Sacramento Najm Eldeen Abu Rmaileh, Princess Sumaya Univ. for Technology, Jordan Jafar Jamal, IDSIA Research Institute, Switzerland SCOPES 2016 Wednesday, May 25, 2016
Acknowledgment This research was partially supported by a Google Faculty Research Award granted in August 2013
Outline Background and Algorithms Experimental Setup Experimental Results Conclusions and Future Work
Background Many compiler optimizations for power/energy reductions have been proposed Limited number of experimental studies using real hardware measurements Most research results are based on simulation, energy models and theoretical calculations Most production compilers, such as GCC and LLVM, don’t offer energy-specific and energy-aware optimizations
Relevant Optimizations All performance optimizations Reducing execution time reduces energy consumption Energy-specific optimizations, such as switching energy minimization, and voltage-frequency scaling Optimizations that balance multiple conflicting objectives Is the best balance for performance the same as the best balance for energy? Examples: pre-allocation scheduling (balances ILP and register pressure), loop unrolling and function inlining (balance dynamic and static instruction count)
Switching Energy It has been proposed that a compiler may reorder instructions to minimize switching energy Example Instr1 encoding: 1010 Instr2 encoding: bits are different (Hamming distance = 3) Fetching Instr2 after Instr1 will require switching three bits on the bus Switching energy minimization problem: Given an instruction stream find the order that minimizes the total switching energy
Instruction Scheduling Compilers do instruction scheduling before and after register allocation (pre-allocation and past-allocation scheduling). In pre-allocation, a compiler needs to balance register pressure and ILP In post-allocation, a compiler schedules spill code and does fine tuning Scheduling for minimum switching energy must be done in post- allocation, because it needs to know all the instructions, including spill code it needs to know complete encoding, including operands In theory, if the hardware does good out-of-order execution, post allocation scheduler may focus on switching energy
Switching Energy Algorithms Multiple algorithms have been proposed for switching energy minimization Earliest algorithm is Cold Scheduling (Su et al. 1994) Equivalent to the Nearest Neighbor (NN) Heuristic Simulation-based results report energy reductions up to 30% (Parikh et al. 2003) No experimental results using real hardware measurements In this work, we evaluate the performance of previously proposed algorithms, including our exact algorithm (Shobaki and Jamal, 2015)
9 Cycle Instr Sw. Energy 1: A 2: B 2.0 3: C 2.0 4: D 0.5 Switching energy in first 4 cycles= 4.5 ABC D E F G Switching Cost Matrix A B C D E, F A B C D E, F Critical Path (CP) Algorithm
10 Cycle Instr Sw. Energy 1: A 2: C 0.5 3: D 0.5 4: B 0.5 Switching energy in first 4 cycles= 1.5 ABC D E F G Switching Cost Matrix A B C D E, F A B C D E, F Nearest Neighbor (NN) Algorithm Cold Scheduling, Su et al. 1994
Combinatorial Algorithm Shobaki and Jamal 2015 Formulate the problem as a Precedence-constrained Traveling Salesman Problem (PCTSP), aka, the Sequential Ordering Problem (SOP) Search for an exact solution using a Branch-and-Bound Approach With a time limit of 10 ms per instr, it optimally solves 99.8% of the basic blocks in MiBench (over 30 thousand blocks) It optimally schedules a blocks with hundreds of instructions within a few seconds On average, switching cost is 16% less than CP and 5% less than NN B&B algorithm and COMPILER instances are of interest to the operations research community
Experimental Setup OMAP5432 EVM board with a dual ARM® Cortex™-A15 MPCore™ processor OMAP5432 board has shunt resistors and connections that allow measuring CPU energy and memory energy Only measured CPU energy Out-of-order execution, but processor does not reorder instructions until the execution stage. So, instructions are fetched in the order determined by the compiler Energy measurements were performed using an ARM Energy Probe
Experimental Setup
Compiler and Benchmarks Algorithms implemented as post-allocation schedulers in LLVM 3.3 CP_NN NN_CP Combinatorial Base algorithm is LLVM’s default post-allocation scheduler LLVM does local scheduling (within the basic block) 12 Benchmarks selected from MiBench and SPEC CPU2006 Cross compiled on an Intel machine for the ARM target
Extreme Switching Experiment Explore the limits of switching energy Instruction order that gives maximum switching BIC, ANDS, BIC, ANDS, BIC, ANDS, …. Instruction order that gives minimum switching BIC, BIC, BIC, …, BIC, ANDS, ANDS, ANDS, …, ANDS Similar experiment done by Zhurikhin et al. (2009)
Block SizeMax SwitchingMin Switching%Diff Time (s)CPU Energy (J)Time (s)CPU Energy (J)TimeCPU Energy %-0.30% %1.48% %6.02% %7.68% %7.58% %7.64% %4.90% %3.89% %4.02% %4.27% Extreme Switching Results
BenchmarkCP_NNNN_CPCombinatorial Susan_s3.99%7.90%15.18% Susan_e4.01%8.10%15.32% Jpeg_c2.34%5.29%11.61% Lbm11.94%19.03%26.92% Bzip23.26%5.52%11.19% Hmmer5.06%7.90%15.92% Mcf2.76%7.03%12.08% Bwaves5.81%12.86%22.95% Gobmk6.30%9.51%15.90% Astar2.81%6.04%12.45% Sjeng5.62%9.01%15.32% Leslie4.98%10.66%21.54% AVG4.91%9.07%16.37% Computed Switching Cost Reductions
BenchmarkTime (s) Energy (J) Time Var. Energy Var. Susan_s %0.62% Susan_e %0.42% Jpeg_c %0.22% Lbm %0.63% Bzip %0.46% Hmmer %4.21% Mcf %0.59% Bwaves %1.31% Gobmk %1.00% Astar %0.53% Sjeng %0.52% Leslie %0.74% Time and Energy Variation
BenchmarkCP_NNNN_CPCombinatorial TimeEnergyTimeEnergyTimeEnergy Jpeg_c-0.16%-0.44%0.02%0.29%-0.04%0.61% Lbm0.24%-1.10%-0.81%-0.61%-0.77%-0.43% Bwaves-0.38%-0.03%-3.67%-0.37%-4.06%-0.95% Gobmk-0.42%-2.56%-0.20%0.11%-0.19%0.12% Astar0.11%-2.28%-0.45%-0.17%-0.22%-0.13% Sjeng-0.94%-2.54%-1.09%-1.33%-0.63%0.00% Leslie-1.78%-3.32%-2.59%-1.82%-2.10%-1.40% Average %-0.59%-0.38% Algorithm Comparison
Observations Impact of post-allocation scheduling is limited on time and energy On avg., all algorithms degrade performance, probably because LLVM does a better job at handling hardware restrictions This leads to increasing energy consumption Reduction in switching energy appears to partially compensate for that On average, energy-first algorithms (NN_CP and comb) reduced energy by 1% relative to CP_NN although they caused slightly more performance degradation This 1% is believed to be real And it is free!
Conclusions The statement that Compiling for Performance is equivalent to compiling for energy is not strictly true Switching energy is measurable Impact of switching on CPU energy is not as high as that of execution time Scheduling algorithm that minimizes switching must avoid increasing execution time This is easier on out-of-order processors Energy savings by compiler optimizations are interesting, because they are essentially free
Future Work Develop more effective algorithms for balancing energy and performance Conduct similar experiments on a wider range of processors, including in-order processors Switching energy could be more significant on other processors Study the energy impact of other compiler optimizations, such as pre- allocation scheduling, loop unrolling and function inlining
Questions?