An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Systems Architecture Lecture 5: MIPS Instruction Set
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Allocating Memory.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
‏ Adaptive Reduced Bit-width Instruction Set Architecture (adapt-rISA) Sandro Neves Soares – UCS Ashok Halambi – UCI Aviral Shrivastava – ASU Flávio Rech.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Educational Computer Architecture Experimentation Tool Dr. Abdelhafid Bouhraoua.
Final Exam Review Instructor : Yuan Long CSC2010 Introduction to Computer Science Apr. 23, 2013.
A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
Pipelining and Parallelism Mark Staveley
Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.
Architecture Selection of a Flexible DSP Core Using Re- configurable System Software July 18, 1998 Jong-Yeol Lee Department of Electrical Engineering,
CS 211: Computer Architecture Lecture 2 Instructor: Morris Lancaster.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,
Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.
Operation Tables for Scheduling in the presence of Partial Bypassing Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded.
Instruction Set Architectures Continued. Expanding Opcodes & Instructions.
Addressing modes, memory architecture, interrupt and exception handling, and external I/O. An ISA includes a specification of the set of opcodes (machine.
Advanced Architectures
A Closer Look at Instruction Set Architectures
The compilation process
Announcements MP 3 CS296 (Chase Geigle
A Closer Look at Instruction Set Architectures
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Improving Program Efficiency by Packing Instructions Into Registers
Computer Architecture (CS 207 D) Instruction Set Architecture ISA
Methodology of a Compiler that Compresses Code using Echo Instructions
Central Processing Unit
Systems Architecture I (CS ) Lecture 5: MIPS Instruction Set*
Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.
Systems Architecture Lecture 5: MIPS Instruction Set
Instruction Set Architectures Continued
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
* From AMD 1996 Publication #18522 Revision E
COMS 361 Computer Organization
THUMB INSTRUCTION SET.
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Lecture 4: Instruction Set Design/Pipelining
Systems Architecture I (CS ) Lecture 5: MIPS Instruction Set*
COMP755 Advanced Operating Systems
Presentation transcript:

An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine, USA

2 Outline Introduction to rISA Challenges Problem definition Existing approach Our approach Architectural Model for rISA Compiling for rISA Summary Future directions

3 Introduction Code Size is a critical design factor for many Embedded Applications. “reduced bit-width Instruction Set Architecture” is a promising architectural feature for code size reduction. Support for a “reduced Bit-width Instruction Set”, along with normal IS. Many contemporary processors use this feature ARM7TDMI, MIPS, ST100, ARC-Tangent.

4 reduced Bit-width Instruction Set The “reduced Bit-width Instruction Set” along with the supporting hardware is termed “reduced Bit- width Instruction Set Architecture (rISA)”. rISA Features Instructions from both the IS reside in the memory. rIS are dynamically expanded to normal instructions before or during decode stage. Execution of only normal instructions.

5 rISA Most frequently occurring instructions are compressed to make reduced Bit-width Instruction Set. Each rISA instruction maps to a unique normal instruction. Simple and fast lookup table based “translator” logic. Can be implemented without increasing cycle length or cycle penalty. Achieve good code size reduction, without much architectural modification. Best Case : 50 % code size reduction

6 Architectures supporting rISA ARM7TDMI 32-bit normal IS, and 16-bit rIS. Switching between normal and rISA instructions is done by BX (Branch Exchange) instruction. –Basic block level granularity. Kwon et. al made each rISA instruction to write to a partition of register file. MIPS 32-bit normal IS, and 16-bit rIS. Switching between normal and rISA instructions is done implicitly by code alignment. –Routine not aligned to word bounday  rISA Instructions. –Routine level granularity. ST100 from STMicro and Tangent ARC core also support rISA

7 Bit-width Restrictions Only a few instructions in rIS. Not all normal instructions can be converted to rISA instructions. 7-bit opcodes in a 3-address ARM Thumb instruction. Operands of rISA instructions can access only a part of register file. Code in terms of rISA instructions has high register pressure causing extra move/load/store instructions. 3-address instructions in ARM Thumb have accessibility to only 8 registers (out of 16).

8 Challenges in code generation Register pressure increases in the block which contains rISA instructions, resulting in Increased code size because of spilling. Performance degradation. Estimating code size increase due to spilling, before register allocation is difficult. A heuristic to estimate spill code because of rISA might be useful. 7-bit3-bit Fewer opcodesAccessibility to only 8 registers 16-bit rISA instruction format

9 Problem Definition Compile for rISA to achieve – Maximum code size reduction. Least degradation in performance.

10 Existing Compilers for rISA Work on routine level or basic-block level granularity. Convert to reduced bit-width instructions only if all the instructions in the routine/basic-block have mappings to rISA instructions. Code generation for rISA is done as a post- assembly pass or a pre-instruction selection pass.

11 Our Approach rISA architectural model contains a mode exchange instruction to change mode at an instruction level granularity. Code generation for rISA is done as a part of instruction selection Tightly coupled with the compiler flow. Use rISA instructions whenever profitable even within a function. We term the process of code generation for rISA, rISAization.

12 Advantage of Our Approach 32 bit 16 bitFunction 1 Function 2 Function 3 Function 1 Function 2 Function 3 Existing approach Function level granularity Higher Code density Instruction level granularity Our approach

13 Architectural Model rISA instructions to normal instructions mapping. Explicit mode exchange instructions (mx and rISA_mx). Allow instruction level granularity for Conversion to rISA instructions. Useful rISA instructions: rISA_nop: To align the code to word boundary. rISA_move: To access all the registers in the register file and minimize spills in rISA code. rISA_extend: To increase the length of the immediate in the successive instruction. The bit-width restrictions for the above three rISA instructions are relaxed because they have lesser number of operands.

14 Compiling for rISA Source File C/C++ Assembly Instruction Selection - I gcc Front End Instruction Selection - II Profitability Analysis Register Allocation Generic Instruction Set 3-address code Augmented Instruction Set (with rISA Blocks) Target Instruction Set (Normal + rISA)

15 Compiling for rISA – An Example G_ADD GR1 GR2 4 G_MUL GR3 GR1 GR2 G_ADD GR4 GR3 1 G_SUB GR4 GR4 16 G_LI GR4 200 G_ADD GR5 GR6 GR7 G_MUL GR9 GR8 GR6 G_ADD GR10 GR5 GR9 G_SUB GR11 GR10 R7 Source File C/C++ gcc Front End Generic Instruction Set 3-address code

16 Compiling for rISA – An Example G_ADD GR1 GR2 4 G_MUL GR3 GR1 GR2 G_ADD GR4 GR3 1 G_SUB GR4 GR4 16 G_LI GR4 200 G_ADD GR5 GR6 GR7 G_MUL GR9 GR8 GR6 G_ADD GR10 GR5 GR9 G_SUB GR11 GR10 GR7 Source File C/C++ Instruction Selection - I gcc Front End Generic Instruction Set 3-address code Augmented Instruction Set (with rISA Blocks) 1. Mark Instructions that can be converted to rISA instructions. Candidates for rISA instructions

17 Compiling for rISA – An Example G_ADD GR1 GR2 4 G_MUL GR3 GR1 GR2 G_ADD GR4 GR3 1 G_SUB GR4 GR4 16 G_LI GR4 200 G_ADD GR5 GR6 GR7 G_MUL GR9 GR8 GR6 G_ADD GR10 GR5 GR9 G_SUB GR11 GR10 GR7 Source File C/C++ Instruction Selection - I gcc Front End Generic Instruction Set 3-address code Augmented Instruction Set (with rISA Blocks) Profitability Analysis 2. Decide whether it is profitable to convert a rISA Block.

18 Compiling for rISA – An Example T_ADD_R GR1 GR2 4 T_MUL_R GR3 GR1 GR2 T_ADD_R GR4 GR3 1 T_SUB_R GR4 GR4 16 T_MX_R T_LI GR4 200 T_ADD GR5 GR6 GR7 T_MUL GR9 GR8 GR6 T_ADD GR10 GR5 GR9 T_SUB GR11 GR10 GR7 Source File C/C++ Instruction Selection - I gcc Front End Generic Instruction Set 3-address code Augmented Instruction Set (with rISA Blocks) Instruction Selection - II Profitability Analysis Target Instruction Set (Normal + rISA) 3. Replace marked instructions with rISA instructions.

19 Compiling for rISA – An Example Source File C/C++ Instruction Selection - I gcc Front End Generic Instruction Set 3-address code Augmented Instruction Set (with rISA Blocks) Instruction Selection - II Profitability Analysis Target Instruction Set (Normal + rISA) Assembly Register Allocation 4. Perform register allocation. T_ADD_R TR1 TR2 4 T_MUL_R TR3 TR1 TR2 T_ADD_R TR4 TR3 1 T_SUB_R TR4 TR4 16 T_MX_R T_LI TR4 200 T_ADD TR5 TR6 TR7 T_MUL TR9 TR8 TR6 T_ADD TR10 TR5 TR9 T_SUB TR11 TR10 TR7

20 1. Mark Instructions that can be converted to rISA instructions. Contiguous marked instructions form a “rISA Block”. 2. Decide whether it is profitable to convert a rISA Block. 3. Replace marked instructions with rISA instructions. 4. Perform register allocation. Compilation for rISA Source File C/C++ Assembly Instruction Selection - I gcc Front End Instruction Selection - II Profitability Analysis Register Allocation Generic Instruction Set 3-address code Generic Instruction Set (with rISA Blocks) Target Instruction Set (Normal + rISA)

21 Profitability Heuristic Decides whether or not to convert a rISA Block to rISA Instructions. Ideal decrease in code size –rISA_block_size(normalMode) – rISA_block_size(rISAMode) Increase in code size –CS1 : due to mode change instructions. –CS2 : due to NOPs. –CS3 : due to extra rISA load/store/move instructions.

22 Register Pressure Heuristic Estimate the extra spill/load/move instructions. CS3 = Spill/Reload code needed if block is converted to rISA Instructions – Spill/Reload code needed if block is converted to normal instructions Spill code for a block is a function of average register pressure number of instructions average live length

23 Spill Code Estimation Estimate extra average register pressure: average register pressure – K1*number of registers Estimate the number of spills needed to reduce the register pressure by 1 for the block: number of instructions / average live length Estimate number of spills: average extra register pressure * number of spills needed to reduce the register pressure by 1

24 Register Pressure Heuristic Spill code if converted to rISA = (1) + (2) (1) Estimated spill code for rISA variables in block number of available registers = rISA RF size (2) Estimated spill code for non-rISA variables in block. number of available registers = RF size – rISA RF size – average extra rISA register pressure Spill code if converted to normal IS Estimated spill code for all variables in block number of available registers = RF size Reload code is estimated as: K2 * Spill code * average number of uses per variable definition

25 Experimental Set-up Platform : MIPS 32/16 architecture Benchmarks : Livermore loops Baseline Compiler: GCC for MIPS32 and MIPS16 optimized for code size %age code size reduction in MIPS16 over MIPS32 Our Compiler : Retargetable EXPRESS compiler for MIPS 32/16 %age code size reduction %age Performance degradation

26 Experiments  EXPRESS achieves 38% while GCC 14% average code size reduction.  Performance impact: average 6% (worst case: 24%)

27 Summary rISA is an architectural feature that can potentially achieve huge code size reduction with minimal hardware alterations. We presented a compiler technique to achieve code size reduction using rISA. Ability to operate at instruction level granularity. Integration of this technique in the compiler flow. A heuristic to estimate the amount of spills/reloads/moves due to restricted availability of registers by some instructions. On an average 38% improvement in code size.

28 Future directions The profitability heuristic for code generation can be modified to account for the performance degradation due to rISA. Design space exploration for choosing the best rISA suitable for a given embedded application.