Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Slides:

Advertisements

Similar presentations

Topics Left Superscalar machines IA64 / EPIC architecture

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone,

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Alpha Microarchitecture Onur/Aditya 11/6/2001.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

UPC Trace-Level Speculative Multithreaded Architecture Carlos Molina Universitat Rovira i Virgili – Tarragona, Spain Antonio González.

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Physical Register Inlining (PRI)

Lecture: Out-of-order Processors

Microprocessor Microarchitecture Dynamic Pipeline

Out-of-Order Commit Processors

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Milad Hashemi, Onur Mutlu, Yale N. Patt

Tolerating Long Latency Instructions

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Sampoorani, Sivakumar and Joshua

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Avoiding Initialization Misses to the Heap

Conceptual execution on a processor which exploits ILP

Sizing Structures Fixed relations Empirical (simulation-based)

Presentation transcript:

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium on Microarchitecture

Exploiting Value Locality in Physical Register Files2 Motivation Need more physical registers Increase in area, access time, power More in-flight instructions (ILP, SMT) Access locality: Hierarchical and register caches Communication locality: Banked and clustered design Late allocation: Virtual-physical registers Value locality: Physical register reuse Optimized Design Optimized Storage Reduce storage requirements: 1.Exploit register value locality 2.Simplify for common values

Exploiting Value Locality in Physical Register Files3 Outline The property: Value locality in register file Optimized storage schemes Results Conclusion

Exploiting Value Locality in Physical Register Files4 Value Locality Locality of the results produced by all dynamic instructions 1.Identify the most common results 2.Locality in the results produced (register writes) 3.Duplication in register file

Exploiting Value Locality in Physical Register Files5 Value Locality bzip2 crafty gap gcc gzip mcf ammp art e e most common values in some SPEC CPU2000 benchmarks 0 and 1 are most common results on all benchmarks

Exploiting Value Locality in Physical Register Files6 Value Locality Consider valid registers only Many registers could hold the same value Duplication in register file Locality in all results produced (register writes) Percentage of values written to registers already present in the physical register file (80, 128, 160 regs.)

Exploiting Value Locality in Physical Register Files7 Value Locality Duplication of values in physical register file # of duplicate registers = # of valid physical registers – # of unique values in the register file Value being held in n registers  (n-1) duplicate registers Number of duplicate registers depends on register lifetime 60% to 85% of duplication because of 0 and 1

Exploiting Value Locality in Physical Register Files8 Observations 1. Many physical registers hold same value 2. 0’s and 1’s are most common results Reuse physical registers Instructions with same result mapped to one physical register First proposed by [Jourdan et al. MICRO-31] Reduced storage requirements Optimized storage for common values Registerless Storage Extended Registers and Tags Simplified micro-architectural changes

Exploiting Value Locality in Physical Register Files9 Outline Value locality in register file Exploiting it: Optimized storage schemes Physical Register Reuse Registerless Storage Extended Registers and Tags Results Conclusion

Exploiting Value Locality in Physical Register Files10 Physical Register Reuse 1.Conventional renaming of destination register 2.On execution, detect if result present in register file 3.Free assigned physical register 4.Remap instruction’s destination register 5.Handle dependent instructions Many instructions with same result map to one physical register Register file with unique values  More free registers

Exploiting Value Locality in Physical Register Files11 Physical Register Reuse Fetch Decode Rename Queue Schedule Reg. Read Writeback Commit Value Cache Value cache – to detect duplicate results  Maps physical register tags to values  CAM structure, returns tag for a value  Actions to invalidate / add entries Execute

Exploiting Value Locality in Physical Register Files12 Physical Register Reuse Fetch Decode Rename Queue Schedule Writeback Commit Value Cache  Reference counter for every register  Increment when mapped  Decrement when freed  Return to free list when 0 Reference counting in Register File Reference counters Physical Register File ExecuteReg. Read

Exploiting Value Locality in Physical Register Files13 Physical Register Reuse Fetch Decode Queue Schedule Reg. Read Execute Writeback Commit Value Cache Reference counters Physical Register File  Update rename map. Remap instruction’s destination register Handling dependent instructions Rename

Exploiting Value Locality in Physical Register Files14 Physical Register Reuse Fetch Decode Rename Schedule Execute Writeback Commit Value Cache Reference counters Physical Register File Alias Table Queue  Re-broadcast new destination register tag  Re-broadcast. Lookup alias table on register read Handling dependent instructions Reg. Read

Exploiting Value Locality in Physical Register Files15 Physical Register Reuse Reduced register requirements Avoids register write of duplicate values Non-trivial micro-architectural changes Value Cache lookup, Alias Table indirection, Reference counts Recovering from exceptions Remapping of destination register requires re-broadcast

Exploiting Value Locality in Physical Register Files16 Outline Value locality in register file Optimized storage schemes Physical register reuse Registerless Storage (0’s & 1’s) Extended Registers and Tags (0’s & 1’s) Results Conclusion

Exploiting Value Locality in Physical Register Files17 Registerless Storage Exploit common value locality – state bits for 0 and 1 Register file without 0s and 1s  More free registers 1.Conventional renaming of destination register 2.On execution, detect if result is 0 or 1 3.Free assigned physical register 4.Remap instruction’s destination register to reserved tags 5.Communicate value directly to dependent instructions

Exploiting Value Locality in Physical Register Files18 Registerless Storage Simplified micro-architectural changes No Value Cache, Alias Table, Reference counts No registers for 0 and 1: Reduced register requirements Eliminates register reads and writes of 0 and 1 Optimize storage for common values. But, avoid remapping destination register tag Remapping of destination register requires re-broadcast

Exploiting Value Locality in Physical Register Files19 Extended Registers and Tags Associate physical register with 2-bit extension V: Valid and D: data = {0, 1} Most 0’s and 1’s in register extensions Physical registerVD Rename: Assign physical register with its extension (if available) Execute: If result is 0 or 1 Use extension, if available. Free physical register. Physical register can be assigned to some other instruction

Exploiting Value Locality in Physical Register Files20 Extended Register and Tags Extended tagging scheme eliminates remapping Increase tag ( N -bits) namespace by 1 -bit (MSB) Unmodified free list To assign a tag: 1.Get tag from free list 2.Get MSB from the corresponding extension’s valid bit MSB = 0  {register, extension} MSB = 1  {register} Tag management

Exploiting Value Locality in Physical Register Files21 Extended Registers and Tags Physical registerVD MSB = 1 Physical register1D MSB = 0 Result = {0, 1} Physical registerVD MSB = 1 Register read operation Physical register1D MSB = 0 Physical register0D MSB = 0 Register write operation Physical register0D MSB = 0 Result != {0, 1} Never used

Exploiting Value Locality in Physical Register Files22 Extended Registers and Tags Simplified micro-architectural changes Extended registers hold 0’s and 1’s Better design Some common values use physical registers

Exploiting Value Locality in Physical Register Files23 Outline Value locality Optimized storage schemes Results Performance – more in-flight instructions Benefit – smaller register file Reduced register traffic Conclusion

Exploiting Value Locality in Physical Register Files24 Performance Relative performance improvement with increased instruction window Speedup

Exploiting Value Locality in Physical Register Files25 Performance Relative performance improvement with increased instruction window Speedup

Exploiting Value Locality in Physical Register Files26 Performance Relative performance improvement with increased instruction window Speedup

Exploiting Value Locality in Physical Register Files27 Performance Relative performance improvement with increased instruction window Speedup

Exploiting Value Locality in Physical Register Files28 Register Traffic Register read reduction with Registerless storage Register write reduction with Registerless storage

Exploiting Value Locality in Physical Register Files29 Conclusions Value locality Observe duplication in physical register file Significant common value locality Optimization schemes Physical Register Reuse 0’s and 1’s: Registerless Storage and Extended Registers and Tags Benefits Reduced register requirements Reduced register traffic Power savings, better design

Questions

Exploiting Value Locality in Physical Register Files31 Comparison Value cache Identify {0, 1} Reuse register holding the same value. Free reg. Free register Write 0 or 1 to extension, if available. Free register Update rename map. Alias table or Re-broadcast Update rename map. Re-broadcast {0, 1} Recover ref. counts, alias table, value cache Reg. File with unique values Reg. File without 0, 10, 1 in extensions Exploit Detect Handle dep instns Handle exception Outcome Physical Reg. Reuse Reg.less storage ERT

Exploiting Value Locality in Physical Register Files32 Simulation Methodology Alpha ISA. SimpleScalar based. Nops removed at the front-end Register r31 not included in value locality measurements Detailed out-of-order simulator 4-wide machine, 14-stage pipeline, 1-cycle register file access Base case: 128 physical registers, 256 entry instruction window Instruction and data caches  64KB, 2-way set associative, 64-byte line size, 3-cycle access  L2 is 2MB unified with 128-byte line size, 6-cycle access Benchmarks SPEC CPU2000 suite (12 int and 14 fp benchmarks) Reference inputs run to completion

Exploiting Value Locality in Physical Register Files33 Performance Relative performance improvement with increased instruction window