1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

2/36 Outline ● Motivation: – Observations on registers ● Idea – Virtual Context Architecture ● Evaluation in 2 types of applications

3/36 Some definitions ● Activation record: Data structure { ● variables belonging to one particular scope (e.g. a procedure body) ● links to other activation records }; Synonyms: "data frame", "stack frame" ● Context: – Activation record of a thread of execution A register is only meaningful to the current activation record

4/36 Key observation ● Virtual Memory: – For the ISA standpoint: each process has an 'infinite' amount of memory available – Memory is managed in caches, RAM and disk – Memory is context free ● This is not true for registers – Limited resource Need to virtualize registers

5/36 How registers are used Compiler Pipeline Source code: variables IR: virtual registers Binary: logical registers Data path: physical registers Register allocation Decode/Rename

6/36 Registers are useful ● Can't get rid of registers: – Efficient address encoding in instructions – Unambiguous data dependences – Efficient integration in the micro-architecture

7/36 Attach a memory address to the content of the register! Dawn of a New Idea

8/36 Virtualizing registers

9/36 Mapping registers to memory ● Registers are virtualized because they hold the content of a memory location ● 2 options – At register allocation, map compiler virtual registers to memory ● Memory to memory operations ● Doesn't make use of ISA registers – Map ISA registers to memory ● Key Idea of the Virtual Context Architecture

10/36 Programming the VCA ● Where are the registers mapped in memory? ● The Stack Pointer is the Reference – Allows to 'allocate' memory dynamically – Efficient way of passing parameters to a a function – Need some architectural support to address with offsets to the stack pointer

11/36 Renaming ● To get the register memory address, combine: – the source/destination register index of the binary program – base pointer (stack pointer) ● ISA register index  register memory address  physical register

12/36 Register memory address  physical reg. ● The address = base pointer + offset ● Exploit locality of the addresses to compress the number of bits in the conversion, low probability of capacity miss

13/36 Register File is a Cache ● Hardware controlled cache ● An instruction requires its source operands and destination register to execute What happens on a “cache” miss? We need some hardware control!

14/36 Some additional HW ● Each register has 3 new attributes: 1) A reference count: ● Incremented when instruction using it goes through rename ● Decremented when instruction is committed ● Non zero value means that register cannot be reallocated to other logical registers ● Guarantees instruction correct execution

15/36 Some additionnal HW (ctnd) 2) A 'committed' bit ● Valid, non speculative value 3) A 'dirty' bit ● Value more up-to-date than memory Using those attributes, a state machine controls which registers are available or not Branch recovery works by having a duplicate renaming table containing the committed architectural state

16/36 Source operand to physical register conversion

17/36 Destination logical register to physical register conversion

18/36 Allocation of an entry for destination register ● Replacement policy in rename table

19/36 Pipeline modifications ● Changes in the renaming ● ATSQ: architectural state transfer queue – Adds to the queue upon fills and spills – Has priority on the instruction to execute – Addresses for fills and spills are pre-calculated – No memory disambiguation required – No data dependences

20/36 Outline ● Motivation: – Observations on registers ● Idea – Virtual Context Architecture ● Evaluation in 2 types of applications – Baseline & Methodology – Register windows w/ results – SMT w/ results – Combined register windows + SMT

21/36 Baseline machine

22/36 More on methodology ● Uses SimPoints to find representative simulation intervals ● SPEC CPU 2000 ● Baseline doesn't have register windows – (Alpha’s register remapping with issue queues) ● Window overflow/underflow: 10 cycles

23/36 Applications ● Register windows ● Multithreading http://en.wikipedia.org/wiki/Register_window http://www.sics.se/~psm/sparcstack.html

24/36 Register Windows ● Global register allocation – How many registers should we reserve for the current procedure versus the rest of the program? – SPARC example: ● usually contains as many as 128 GPRs ● At any point only 32 are available: – 8 global, 8 params in, 8 params out, 8 local values – Up to 32 windows – Windows changed by an instruction usually along with 'call' and 'return' – Partial overlap: 'params out' of caller are 'params in' of callee – Also used in Itanium (variable sized window) – Alternative is e.g.: renaming with reservation stations Save some memory (stack) traffic on function calls

25/36 Register Windows Caveats ● Problem: – Overflow of windows: call depth too deep – Underflow of window: need to restore a window from memory ● Solution – Operating system handler – typical scheme saves and restores windows – VCA handles registers individually Performance Advantage of the Register Stack in Intel® Itanium™ Processors

26/36 Register windows evaluation  ‘Ideal’: fills and spills are free  VCA is especially good with few registers  Close to ideal at 256 registers  VCA 4% faster than baseline @256 regs  Less registers means less in-flight instructions and less branch misprediction  increase  For others  decrease

27/36 Single data cache port experiment ● Normalized to 2-port baseline ● 7% faster than baseline @ 256 regs ● 0.5 % slower than ideal @ 256 regs

28/36 2 nd App: multithreading

29/36 SMT: simultaneous multi-threading ● Lots of replicated resources (larger register file) ● VCA: renaming table is not replicated, only base thread pointer ● VCA: – # of in-flight instructions determine number of registers required – not # of threads

30/36 SMT: 2 and 4 threads ● Normalized to single thread baseline 256 regs (not shown) ● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%) ● @192 regs, VCA 4T is at 98.7% of baseline @448 regs

31/36 Combined SMT w/ register windows ● Normalized to single thread baseline @ 256 regs ● VCA 4T: 98% of peak performance @ 192 regs

32/36 SMT + register windows ● Register window reduces cache accesses while SMT increases them ● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline

33/36 VCA summarized ● unifies support for both multiple independent threads and register windowing within each thread; ● backwards compatible with existing ISAs at the application level for multithreaded contexts; ● requires only minimal ISA changes for register windowing; ● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop; ● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;

34/36 VCA summarized (ctnd) ● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file; ● does not involve speculation or prediction, avoiding the need for recovery mechanisms.

35/36 Conclusions ● A VCA-based implementation of register windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation. ● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.

36/36 Conclusions (ctnd) ● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture. ● VCA allows SMT to be combined with register windows with no additional physical registers. ● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

Similar presentations

Presentation on theme: "1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

Similar presentations

Presentation on theme: "1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005."— Presentation transcript:

Similar presentations

About project

Feedback