Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu 35042.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Using Instruction Block Signatures to Counter Code Injection Attacks Milena Milenković, Aleksandar Milenković, Emil Jovanov The University of Alabama in.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Low-Power and Temperature-Aware Compilation for Embedded Processors José L. Ayala Politecnica University of Madrid
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.
1 Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des architectures VLIW Thèse de doctorat Gilles POKAM* 15 Juillet.
Chapter 5 Basic Processing Unit
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Automated Design of Custom Architecture Tulika Mitra
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.
Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
NISC set computer no-instruction
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
1 Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July.
CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Variable Word Width Computation for Low Power
Design-Space Exploration
Multiscalar Processors
SECTIONS 1-7 By Astha Chawla
Morgan Kaufmann Publishers
Lecture: Out-of-order Processors
Improving Program Efficiency by Packing Instructions Into Registers
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Computer Organization “Central” Processing Unit (CPU)
Lecture 8: Dynamic ILP Topics: out-of-order processors
Chapter Six.
The Processor Lecture 3.6: Control Hazards
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Loop-Level Parallelism
Lecture 9: Dynamic ILP Topics: out-of-order processors
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Introduction to Computer Systems Engineering
Chapter 4 The Von Neumann Model
Presentation transcript:

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu Rennes Cedex, France

2 Context Embedded applications use to operate on 8-/16-bit data > 50% of program instructions in some case New opportunities for energy reduction … clock-gating at finer granularity, i.e. operand level

3 Exploiting narrow-width operands Dynamic approachCompiler approach 1. cycle-by-cycle operand gating 2. complex hardware mechanisms required 1. based on static data flow analysis 2. must be overly conservative to preserve program correctness Brooks, et al. HPCA-99Stephenson, et al. PLDI 2000

4 Our approach Don’t want to pay the cost of a hardware scheme to detect when to clock-gate Don’t want to rely on static data flow analysis to discover bit-width ranges Dynamic approachCompiler approach narrow-width execution mode is speculative : exception management allows to recover to the correct mode Take advantage of dynamic approach to expose dynamic narrow-width operands to the compiler (via profiling) Use compiler approach to switch from normal to narrow-width mode and vice-versa (via a reconfiguration instruction)

5 Bit-width distribution analysis Cumulative distribution [Powerstone benchmarks] one operandtwo operands Narrow-width operands occurrence

6 Bit-width distribution analysis Dynamic distribution of narrow-width operands at basic block level (adpcm)

7 Outline Motivation Micro-architectural support Narrow-width regions formation Simulation platform Evaluation Conclusions

8 Register file model We address a new dimension: –reduce register file activity by reducing register file width We propose the byte-slice register file approach Tag bits Slice enable signal Row decoder 8bits 16bits 32bits logically splitted Prior work to reduce the energy consumption in register file –limited port connectivity –limited number of registers 2. low-power mode via drowsy technique (allows to preserve register cells content) Flautner et al. ISCA

9 Reconfigurable data-path data-path resizable to accommodate to the bit- width execution mode (via clock-gating) –pipeline latches –ALU clock-gating at coarser granularity Slice-enable signal (8/16/32 mode) Write-back (8/16/32 mode) Bypass (8/16/32 mode) (8/16/32 mode) ALU LSU

10 Exception management Data-path width misprediction may occur due to a dynamic event Simple recovery scheme –the tag bits indicate the true data-width –upon a misprediction: trigger an exception recover to the correct execution mode

11 Address instructions Special care must be taken with address instructions –separate address calculation from memory access Use of dedicated registers for address computation –accumulator registers with additional ISA support (see paper for details)

12 Outline Motivation Micro-architectural support Narrow-width regions formation Simulation platform Evaluation Conclusions

13 A two steps process machine input data sets annotated.s file address transformation modified.s file Step 1 Step 2

14 Profiling Bit-width characteristics of selected regions 32 bits otherLD/ST with 32 bits8/16 bits Narrow-width operands 0% 20% 40% 60% 80% 100% weight of regions in program

15 Address instructions transformation Problem transform memory instructions into equivalent accumulator- based instructions add1 A graph partitioning formulation: –G, DDG of a BB – iff there is def-use relation between n and m load add2 add1 add -> Rx mov Rx -> ACC LDACC Ry add2 Select (n,m) such that n has a 32-bit width operand and m is a LD/ST instr Replace m with accumulator- based instructions Minimize cut-size, number of instructions to move data from regfile to accumulators

16 Instructions reordering Problem: –reorder instructions in a basic block such that operations with 32-bits operands are move around 8/16 bits operations

17 Outline Motivation Micro-architectural support Narrow-width regions formation Evaluation Conclusions

18 Lx processor platform –in-order –4-issue width –64 32-bit GPR –8 1-bit CBR –6 stages pipeline –4 ALUs, 1 LSU –2 MULs Simulation platform Tools –CACTI : register file energy access –HotLeakage: leakage energy

19 Analytical energy model Dynamic energy Static energy CACTI to determine HotLeakage to determine

20 Summary of results IPC degradation with varying misprediction penalty and varying bit-width convergence

21 Summary of results Dynamic energy reduction

22 Summary of results Register file static energy savings

23 Outline Motivation Micro-architectural support Narrow-width regions formation Evaluation Conclusions

24 Conclusions Contribution to power-aware compilation –speculative management of processor data-path in software –simple exception management scheme to repair a software misprediction Evaluation results –17% data-path dynamic energy savings –22% register file static energy savings –performance impact varies with implementation cost of the recovery scheme Future work –evaluation with larger granularity (e.g. trace) can reduce number of mispredictions can reduce amount of reconfiguration instructions

Thanks ! Questions …