Download presentation
Presentation is loading. Please wait.
Published byMelissa McDowell Modified over 9 years ago
1
1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Nov @ Micro 2005
2
2/36 Outline ● Motivation: – Observations on registers ● Idea – Virtual Context Architecture ● Evaluation in 2 types of applications
3
3/36 Some definitions ● Activation record: Data structure { ● variables belonging to one particular scope (e.g. a procedure body) ● links to other activation records }; Synonyms: "data frame", "stack frame" ● Context: – Activation record of a thread of execution A register is only meaningful to the current activation record
4
4/36 Key observation ● Virtual Memory: – For the ISA standpoint: each process has an 'infinite' amount of memory available – Memory is managed in caches, RAM and disk – Memory is context free ● This is not true for registers – Limited resource Need to virtualize registers
5
5/36 How registers are used Compiler Pipeline Source code: variables IR: virtual registers Binary: logical registers Data path: physical registers Register allocation Decode/Rename
6
6/36 Registers are useful ● Can't get rid of registers: – Efficient address encoding in instructions – Unambiguous data dependences – Efficient integration in the micro-architecture
7
7/36 Attach a memory address to the content of the register! Dawn of a New Idea
8
8/36 Virtualizing registers
9
9/36 Mapping registers to memory ● Registers are virtualized because they hold the content of a memory location ● 2 options – At register allocation, map compiler virtual registers to memory ● Memory to memory operations ● Doesn't make use of ISA registers – Map ISA registers to memory ● Key Idea of the Virtual Context Architecture
10
10/36 Programming the VCA ● Where are the registers mapped in memory? ● The Stack Pointer is the Reference – Allows to 'allocate' memory dynamically – Efficient way of passing parameters to a a function – Need some architectural support to address with offsets to the stack pointer
11
11/36 Renaming ● To get the register memory address, combine: – the source/destination register index of the binary program – base pointer (stack pointer) ● ISA register index register memory address physical register
12
12/36 Register memory address physical reg. ● The address = base pointer + offset ● Exploit locality of the addresses to compress the number of bits in the conversion, low probability of capacity miss
13
13/36 Register File is a Cache ● Hardware controlled cache ● An instruction requires its source operands and destination register to execute What happens on a “cache” miss? We need some hardware control!
14
14/36 Some additional HW ● Each register has 3 new attributes: 1) A reference count: ● Incremented when instruction using it goes through rename ● Decremented when instruction is committed ● Non zero value means that register cannot be reallocated to other logical registers ● Guarantees instruction correct execution
15
15/36 Some additionnal HW (ctnd) 2) A 'committed' bit ● Valid, non speculative value 3) A 'dirty' bit ● Value more up-to-date than memory Using those attributes, a state machine controls which registers are available or not Branch recovery works by having a duplicate renaming table containing the committed architectural state
16
16/36 Source operand to physical register conversion
17
17/36 Destination logical register to physical register conversion
18
18/36 Allocation of an entry for destination register ● Replacement policy in rename table
19
19/36 Pipeline modifications ● Changes in the renaming ● ATSQ: architectural state transfer queue – Adds to the queue upon fills and spills – Has priority on the instruction to execute – Addresses for fills and spills are pre-calculated – No memory disambiguation required – No data dependences
20
20/36 Outline ● Motivation: – Observations on registers ● Idea – Virtual Context Architecture ● Evaluation in 2 types of applications – Baseline & Methodology – Register windows w/ results – SMT w/ results – Combined register windows + SMT
21
21/36 Baseline machine
22
22/36 More on methodology ● Uses SimPoints to find representative simulation intervals ● SPEC CPU 2000 ● Baseline doesn't have register windows – (Alpha’s register remapping with issue queues) ● Window overflow/underflow: 10 cycles
23
23/36 Applications ● Register windows ● Multithreading http://en.wikipedia.org/wiki/Register_window http://www.sics.se/~psm/sparcstack.html
24
24/36 Register Windows ● Global register allocation – How many registers should we reserve for the current procedure versus the rest of the program? – SPARC example: ● usually contains as many as 128 GPRs ● At any point only 32 are available: – 8 global, 8 params in, 8 params out, 8 local values – Up to 32 windows – Windows changed by an instruction usually along with 'call' and 'return' – Partial overlap: 'params out' of caller are 'params in' of callee – Also used in Itanium (variable sized window) – Alternative is e.g.: renaming with reservation stations Save some memory (stack) traffic on function calls
25
25/36 Register Windows Caveats ● Problem: – Overflow of windows: call depth too deep – Underflow of window: need to restore a window from memory ● Solution – Operating system handler – typical scheme saves and restores windows – VCA handles registers individually Performance Advantage of the Register Stack in Intel® Itanium™ Processors
26
26/36 Register windows evaluation ‘Ideal’: fills and spills are free VCA is especially good with few registers Close to ideal at 256 registers VCA 4% faster than baseline @256 regs Less registers means less in-flight instructions and less branch misprediction increase For others decrease
27
27/36 Single data cache port experiment ● Normalized to 2-port baseline ● 7% faster than baseline @ 256 regs ● 0.5 % slower than ideal @ 256 regs
28
28/36 2 nd App: multi- threading
29
29/36 SMT: simultaneous multi-threading ● Lots of replicated resources (larger register file) ● VCA: renaming table is not replicated, only base thread pointer ● VCA: – # of in-flight instructions determine number of registers required – not # of threads
30
30/36 SMT: 2 and 4 threads ● Normalized to single thread baseline 256 regs (not shown) ● @ 192 regs, VCA 2T is 97% of baseline @ 320 regs (baseline is at 88%) ● @192 regs, VCA 4T is at 98.7% of baseline @448 regs
31
31/36 Combined SMT w/ register windows ● Normalized to single thread baseline @ 256 regs ● VCA 4T: 98% of peak performance @ 192 regs
32
32/36 SMT + register windows ● Register window reduces cache accesses while SMT increases them ● VCA 4T non-windowed @192 regs is 98% perf. of baseline, it still has 24% more cache accesses, adding windows makes cache accesses 5% below baseline
33
33/36 VCA summarized ● unifies support for both multiple independent threads and register windowing within each thread; ● backwards compatible with existing ISAs at the application level for multithreaded contexts; ● requires only minimal ISA changes for register windowing; ● requires no changes to the physical register file design and the performance-critical schedule/execute/writeback loop; ● builds on existing rename logic to map logical registers to physical registers and handles register cache misses in the decode/rename stages;
34
34/36 VCA summarized (ctnd) ● completely decouples physical register file size from the number of logical registers by using memory as a backing store, rather than another larger register file; ● does not involve speculation or prediction, avoiding the need for recovery mechanisms.
35
35/36 Conclusions ● A VCA-based implementation of register windows in an out-of-order processor reduces execution time by 4% while reducing data cache accesses by nearly 20% compared to a non-windowed machine, with an even larger performance advantage over a conventional register-window implementation. ● VCA's data cache traffic reduction is large enough that it can achieve the same performance with one cache port as an otherwise similar conventional machine would with two cache ports.
36
36/36 Conclusions (ctnd) ● VCA is also able to manage thread contexts efficiently, enabling effective implementation of simultaneous multithreading (SMT) using as few as half the registers of a standard architecture. ● VCA allows SMT to be combined with register windows with no additional physical registers. ● a 4-thread VCA machine with 192 registers can achieve higher performance than a conventional non-windowed SMT machine with twice as many registers.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.