Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

Similar presentations


Presentation on theme: "1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005."— Presentation transcript:

1 1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005

2 2/36 Outline ● Motivation: – Observations on registers ● Idea – Virtual Context Architecture ● Evaluation in 2 types of applications

3 3/36 Some definitions ● Activation record: Data structure { ● variables belonging to one particular scope (e.g. a procedure body) ● links to other activation records }; Synonyms: "data frame", "stack frame" ● Context: – Activation record of a thread of execution A register is only meaningful to the current activation record

4 4/36 Key observation ● Virtual Memory: – For the ISA standpoint: each process has an 'infinite' amount of memory available – Memory is managed in caches, RAM and disk – Memory is context free ● This is not true for registers – Limited resource Need to virtualize registers

5 5/36 How registers are used Compiler Pipeline Source code: variables IR: virtual registers Binary: logical registers Data path: physical registers Register allocation Decode/Rename

6 6/36 Registers are useful ● Can't get rid of registers: – Efficient address encoding in instructions – Unambiguous data dependences – Efficient integration in the micro-architecture

7 7/36 Attach a memory address to the content of the register! Dawn of a New Idea

8 8/36 Virtualizing registers

9 9/36 Mapping registers to memory ● Registers are virtualized because they hold the content of a memory location ● 2 options – At register allocation, map compiler virtual registers to memory ● Memory to memory operations ● Doesn't make use of ISA registers – Map ISA registers to memory ● Key Idea of the Virtual Context Architecture

10 10/36 Programming the VCA ● Where are the registers mapped in memory? ● The Stack Pointer is the Reference – Allows to 'allocate' memory dynamically – Efficient way of passing parameters to a a function – Need some architectural support to address with offsets to the stack pointer

11 11/36 Renaming ● To get the register memory address, combine: – the source/destination register index of the binary program – base pointer (stack pointer) ● ISA register index  register memory address  physical register

12 12/36 Register memory address  physical reg. ● The address = base pointer + offset ● Exploit locality of the addresses to compress the number of bits in the conversion, low probability of capacity miss

13 13/36 Register File is a Cache ● Hardware controlled cache ● An instruction requires its source operands and destination register to execute What happens on a “cache” miss? We need some hardware control!

14 14/36 Some additional HW ● Each register has 3 new attributes: 1) A reference count: ● Incremented when instruction using it goes through rename ● Decremented when instruction is committed ● Non zero value means that register cannot be reallocated to other logical registers ● Guarantees instruction correct execution

15 15/36 Some additionnal HW (ctnd) 2) A 'committed' bit ● Valid, non speculative value 3) A 'dirty' bit ● Value more up-to-date than memory Using those attributes, a state machine controls which registers are available or not Branch recovery works by having a duplicate renaming table containing the committed architectural state

16 16/36 Source operand to physical register conversion

17 17/36 Destination logical register to physical register conversion

18 18/36 Allocation of an entry for destination register ● Replacement policy in rename table

19 19/36 Pipeline modifications ● Changes in the renaming ● ATSQ: architectural state transfer queue – Adds to the queue upon fills and spills – Has priority on the instruction to execute – Addresses for fills and spills are pre-calculated – No memory disambiguation required – No data dependences

20 20/36 Outline ● Motivation: – Observations on registers ● Idea – Virtual Context Architecture ● Evaluation in 2 types of applications – Baseline & Methodology – Register windows w/ results – SMT w/ results – Combined register windows + SMT

21 21/36 Baseline machine

22 22/36 More on methodology ● Uses SimPoints to find representative simulation intervals ● SPEC CPU 2000 ● Baseline doesn't have register windows – (Alpha’s register remapping with issue queues) ● Window overflow/underflow: 10 cycles

23 23/36 Applications ● Register windows ● Multithreading http://en.wikipedia.org/wiki/Register_window http://www.sics.se/~psm/sparcstack.html

24 24/36 Register Windows ● Global register allocation – How many registers should we reserve for the current procedure versus the rest of the program? – SPARC example: ● usually contains as many as 128 GPRs ● At any point only 32 are available: – 8 global, 8 params in, 8 params out, 8 local values – Up to 32 windows – Windows changed by an instruction usually along with 'call' and 'return' – Partial overlap: 'params out' of caller are 'params in' of callee – Also used in Itanium (variable sized window) – Alternative is e.g.: renaming with reservation stations Save some memory (stack) traffic on function calls

25 25/36 Register Windows Caveats ● Problem: – Overflow of windows: call depth too deep – Underflow of window: need to restore a window from memory ● Solution – Operating system handler – typical scheme saves and restores windows – VCA handles registers individually Performance Advantage of the Register Stack in Intel® Itanium™ Processors

26 26/36 Register windows evaluation  ‘Ideal’: fills and spills are free  VCA is especially good with few registers  Close to ideal at 256 registers  VCA 4% faster than baseline @256 regs  Less registers means less in-flight instructions and less branch misprediction  increase  For others  decrease

27 27/36 Single data cache port experiment ● Normalized to 2-port baseline ● 7% faster than baseline @ 256 regs ● 0.5 % slower than ideal @ 256 regs

28 28/36 2 nd App: multi- threading

29 29/36 SMT: simultaneous multi-threading ● Lots of replicated resources (larger register file) ● VCA: renaming table is not replicated, only base thread pointer ● VCA: – # of in-flight instructions determine number of registers required – not # of threads

30 30/36 SMT: 2 and 4 threads ● Normalized to single thread baseline 256 regs (not shown) ● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%) ● @192 regs, VCA 4T is at 98.7% of baseline @448 regs

31 31/36 Combined SMT w/ register windows ● Normalized to single thread baseline @ 256 regs ● VCA 4T: 98% of peak performance @ 192 regs

32 32/36 SMT + register windows ● Register window reduces cache accesses while SMT increases them ● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline

33 33/36 VCA summarized ● unifies support for both multiple independent threads and register windowing within each thread; ● backwards compatible with existing ISAs at the application level for multithreaded contexts; ● requires only minimal ISA changes for register windowing; ● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop; ● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;

34 34/36 VCA summarized (ctnd) ● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file; ● does not involve speculation or prediction, avoiding the need for recovery mechanisms.

35 35/36 Conclusions ● A VCA-based implementation of register windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation. ● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.

36 36/36 Conclusions (ctnd) ● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture. ● VCA allows SMT to be combined with register windows with no additional physical registers. ● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.


Download ppt "1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005."

Similar presentations


Ads by Google