PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev,

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

09/16/2002 ICCD 2002 A Circuit-Level Implementation of Fast, Energy-Efficient CMOS Comparators for High-Performance Microprocessors* *supported in part.
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Performance of Cache Memory
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Energy-efficient Instruction Dispatch Buffer Design for Superscalar Processors* Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev Department of Computer Science.
ISLPED 2003 Power Efficient Comparators for Long Arguments in Superscalar Processors *supported in part by DARPA through the PAC-C program and NSF Dmitry.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
ICCD’03 1 Distributed Reorder Buffer Schemes for Low Power * *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.
UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-8 Memory Management (2) Department of Computer Science and Software.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Virtual Memory 1 1.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
PATMOS’02 Energy-Efficient Design of the Reorder Buffer* *supported in part by DARPA through the PAC-C program and NSF Dmitry Ponomarev, Gurhan Kucuk,
Page Table Implementation. Readings r Silbershatz et al:
12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.
Dynamic Associative Caches:
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CSC 4250 Computer Architectures
CS203 – Advanced Computer Architecture
Cache Memory Presentation I
Continuous, Low Overhead, Run-Time Validation of Program Executions
Address-Value Delta (AVD) Prediction
Out-of-Order Commit Processor
Computer Architecture
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Patrick Akl and Andreas Moshovos AENAO Research Group
Code Transformation for TLB Power Reduction
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Virtual Memory 1 1.
Presentation transcript:

PATMOS 2003 Energy Efficient Register Renaming *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Oguz Ergin, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS’03), September 11 th 2003

PATMOS 2003 Outline Motivations The Register Alias Table (RAT) Complexity Complexity-Effective RAT Designs Exploiting the intra-group dependencies Buffering recent address tranlations Results and discussions Conclusions

PATMOS 2003 Motivation RAT maintains the register address tranlations needed for handling the true data dependencies High Power Dissipation 14% of the overall power is attributed to the RAT in the global power analysis performed in [FG 01] High Power Density

PATMOS 2003 The RAT Complexity (W-way CPU) RAT W write ports to update W RAT entries for the destinations of the co- dispatched instructions 2W read ports to translate the source register addresses W read ports for checkpointing the old mapping of the destination register

PATMOS 2003 Register Renaming Steps Step 1. The following substeps are performed in parallel: RAT reads for the sources of each of the co-dispatched instructions are performed in parallel, assuming that no dependencies exist among the instructions. New physical registers are allocated for the destination registers of all of the co-dispatched instructions. Data dependencies among the instructions are noted, using a set of comparators. The address of each destination register in a group of instructions is compared against the sources of all following instructions in the group and if a match occurs, the dependency is detected. Step 2. If a data dependency is detected among a pair of instructions, the source physical register for the dependent instruction as read out from the RAT is replaced with the allocated destination register address of the instruction producing the source to preserve the true dependencies.

PATMOS 2003 Conditional Sensing (CSense) – Exploiting the Intra-Group Dependencies CSense disables parts of the RAT read accesses if the intra-group data dependency is noted ADDR1, R2, R3 LOADR4, R1, R3 SUBR5, R4, R2 ==

PATMOS 2003 Conditional Sensing (CSense) – Exploiting the Intra-Group Dependencies CSense disables parts of the RAT read accesses if the intra-group data dependency is noted ADDR1, R2, R3 LOADR4, R1, R3 SUBR5, R4, R2 == Disable sense amp 01 0

PATMOS 2003 Conditional Sensing (CSense) – Exploiting the Intra-Group Dependencies CSense disables parts of the RAT read accesses if the intra-group data dependency is noted ADDR1, R2, R3 LOADR4, R1, R3 SUBR5, R4, R2 ==== Enable sense amp 00 1

PATMOS 2003 Percentage of source operands that are produced by the co-dispatch instructions in the same cycle %

PATMOS 2003 Buffering Recent Address Translations SPEC 2000 simulations show that dependent instructions are usually very close in proximity to each other If the register needed as a source is defined by an earlier co-dispatched instruction, CSense scheme becomes useful If the register needed as a source is defined by an instruction that is dispatched in previous cycles, we utilize External Latches (ELs) to faster serve recent register mappings

PATMOS 2003 Buffering Recent Address Translations The RAT access for a source register now proceeds as follows: Start accessing the RAT and at the same time address the ELs to see if the desired entry is located in one of the ELs If a matching entry is found, discontinue the access from the RAT

PATMOS 2003 Renaming Logic with Four External Latches

PATMOS 2003 The Hit Ratio to External Latches %

PATMOS 2003 Experimental Setup (AccuPower, DATE’02) Compiled SPEC benchmarks Datapath specs Performance stats VLSI layout data SPICE deck SPICE Microarchitectural Simulator Energy/Power Estimator Power/energy stats SPICE measures of Energy per transition Transition counts, Context information

PATMOS 2003 Energy of the Baseline and Proposed RAT Designs pJ 30%27%19%15% Energy Savings

PATMOS 2003 Conclusions Two techniques to reduce RAT power have been proposed: CSense Buffering Recent Register Mappings 30% energy savings No performance penalty Little additional complexity No increase in the processor’s cycle time