Download presentation
Presentation is loading. Please wait.
1
Cost-Effective Physical Register Sharing
Arthur Perais, INRIA André Seznec, INRIA This talk is about the physical registers of a processor, and how to share them between instructions to achieve various effects Cost-Effective Physical Register Sharing 11/19/2018
2
Register Renaming is Clever
Modern processors execute instructions as soon as their dependencies are satisfied. Read-after-Write: read of rax by first store must take place after write of rax by add. Write-after-Write: writes to rax by add and sub must be sequential. Write-after-Read: First store must read rax before sub overwrites it. RAW are true dependencies. WAW and WAR are false dependencies. add rax, rbx //Defines rax st rax, [rbp + 8] //Reads rax sub rax, rcx //Redefines rax RAW WAW WAR As you know, out-of-order processors execute instructions as soon as their dependencies are satisfied. Dependencies include Read after write Write after write Write after read RAW are true dependencies The other kinds are false dependencies introduced by the limited number of register defined by the ISA Cost-Effective Physical Register Sharing 11/19/2018
3
Register Renaming is Clever
‘‘ All problems in computer science can be solved by another level of indirection […]’’ David Wheeler Provide each instruction with a unique location to store its result: Can do add and sub in any order (no WAW). Can do sub before first store (no WAR). Much more Instruction-Level Parallelism (ILP) for out-of-order execution. add rax, rbx //Defines rax st rax, [rbp + 8] //Reads rax sub rax, rcx //Redefines rax add pr0, pr42 //Defines rax(pr0) st pr0, [pr43 + 8] //Reads rax sub pr1, pr44 //Redefines rax(pr1) RAW WAW WAR Fortunately, all problems in computer science can be solved by another level of indirection, hence register renaming. The idea… Cost-Effective Physical Register Sharing 11/19/2018
4
Register Renaming is Genius
Decoupling architectural name from physical location: Two architectural names can share a physical location. For instance: Zero-cycle reg. copy (move elimination): mov rax <- rbx Rename Map Register renaming is actually more powerful than that. Since it decouples the architectural name from the physical location, it can also allow two architectural names to share a physical location. This in turns allow various optimizations such as move elimination rax inv rax pr0 Rename arch. dest to arch. src rbx pr0 rbx pr0 Cost-Effective Physical Register Sharing 11/19/2018
5
Register Renaming is Genius
Decoupling architectural name from physical location: Two architectural names can share a physical location. For instance: Zero -cycle load ([speculative] memory bypassing). rax pr0 rax pr0 add rax <- 42 store rax, load rbx, add rax <- rbx Rename ld dest to st source rbx inv rbx pr0 Cost-Effective Physical Register Sharing 11/19/2018
6
Too Good to be True? Assume two instructions share a physical register. Who frees it? Intuitively: Youngest owner because older owners have committed. What about branch mispredictions? Ownership goes back to older instruction. Need for some form of reference counting. Now, let’s assume that we can share physical register between instructions. The question is: when can we reclaim a register? Cost-Effective Physical Register Sharing 11/19/2018
7
Reference Counting for Physical Registers
Usually: One counter per physical register: Increase when referenced by another instruction. Decrease when overwritten in the rename map. Con: Impractical to checkpoint: checkpointed state must be modified when decreased the counter. Reasonable for lower complexity cores w/o checkpointing. Cost-Effective Physical Register Sharing 11/19/2018
8
Reference Counting for Physical Registers
#(preg) * (ROB + Arch_regs) bit-matrix [Roth, CAL’08] : Pro: Checkpointable. Con: Big (~8KB for Haswell-like processor) and most likely not scalable due to its matrix nature. ROB0 ROB1 ROB2 ROB3 rax … pr0 1 1 pr1 1 1 … A refinement to the previous scheme is to use a bit matrix. There are as many rows as physical registers, and as many columns as entities that can own a register, so ROB entries, commit rename map. If a bit is set it means that the physical register corresponding to the row is owned by the entity corresponding to the column, and a register is free if Oring a whole row returns 0. On the one hand, this scheme is checkpointable, but on the other hand, it requires a significant amount of storage due to its matrix nature. Cost-Effective Physical Register Sharing 11/19/2018
9
Reference Counting for Physical Registers
#(preg) bit-vector array [Battle et al. HPCA’12]. Each entry has n bits with n the number of times the register can be shared. Pro: Checkpointable. Con: Checkpoint size is large (~800 bits just to implement two sharers for ~400 registers). pr0 01 //Shared once: regular allocation pr1 11 //Shared twice: e.g., move elimination A second refinement consists in using an array of bitvectors. Each vector has as many bits as the maximum number of sharers per register. It is more scalable since it grows linearly with the number of physical registers, and also checkpointable. However, checkpoint space is significant since the whole structure is checkpointed. Cost-Effective Physical Register Sharing 11/19/2018
10
One Counter is not Enough…
Checkpointing counters is impractical. Our solution: Two counters per register. Referenced: Speculative, numbers of re-references. Committed: Architectural, numbers of committed mappings. Register can be freed if committed > referenced. Only checkpoint speculative state (referenced). If we go back to reference counters, then I said that they are impractical in the presence of checkpointing. Referenced is the number of additional references to the physical register since it was first allocated. Committed is the number of times a mapping containing the physical register was overwritten. Cost-Effective Physical Register Sharing 11/19/2018
11
Example: Move Elimination
Rename Map Ref. counters areg preg preg ref’d com’d rax pr0 pr0 move rax <- rbx //Share preg Move Elimination: point rax to the preg mapped to rbx. Increase number of reference for preg1. rbx pr1 pr1 rax pr1 pr0 rbx pr1 pr1 1 add rax <- 18 //Overwrite rax Attribute new preg to rax. Increase committed for pr1 (at commit). rax pr2 pr0 rbx pr1 pr1 1 1 Cost-Effective Physical Register Sharing 11/19/2018
12
Example: Move Elimination
areg preg preg ref’d com’d rax pr2 pr0 rbx pr1 pr1 1 1 add rbx <- 42 //Overwrite rbx Attribute new preg to rbx. Increase committed for pr1 (at commit). committed > referenced, free register pr1, reset counters. rax pr2 pr0 rbx pr3 pr1 1 2 Now, I mentioned that this scheme was checkpointable pr0 pr1 Cost-Effective Physical Register Sharing 11/19/2018
13
Example: Now With Checkpointing
jz rax //Checkpoint ref’d move rax <- rbx //ME Branch is mispredicted Restore ref’d and Rmap. committed = referenced Inst overwritting rbx will free pr1. areg preg preg ref’d com’d rax pr2 pr0 rbx pr1 pr1 1 1 rax pr1 pr0 rbx pr1 pr1 2 1 If after restoring referenced, committted had been greater, then we would have freed the register and the entry. But here, they are equal, so the next instruction redefining rbx will eventually free physical register one. So we have an alternative scheme to share registers that is checkpointable. Nonetheless, if we checkpoint one counter per register, checkpoint storage is still significant. rax pr12 pr0 pr1 1 1 Cost-Effective Physical Register Sharing 11/19/2018
14
Many Counters Are not Necessary
If we take a snapshot of the allocated registers: Many will be allocated once. A few will be allocated more than once. Reference counting is only required for the latter ones. Allocate an entry in a small structure (Inflight Shared Registers Buffer) when a new register requires sharing. Our intuition is that many counters are not necessary. In particular, if we take a snapshot of the allocated registers… Cost-Effective Physical Register Sharing 11/19/2018
15
The Inflight Shared Registers Buffer
Fully-associative, tagged by the physical register ID. Management: Allocate entries/update referenced at Rename. Free entries/update committed at Commit. Checkpoint referenced counters and tags*. *Small modifications to the scheme presented in the paper allow to not checkpoint the tags. Cost-Effective Physical Register Sharing 11/19/2018
16
Evaluating the ISRB Using Typical Mechanisms
Move Elimination (ME): When a register-to-register move is renamed, just rename the destination to the source. Particularly interesting in x86 where only a few architectural registers are available. Speculative memory bypassing (SMB): Predict store-load pairs, and rename the destination of the load to the source of the store. Move elimination is pretty straightforward because it is non speculative. When decode detects a sharing candidate, it just tells rename and that’s it. SMB, as its name suggest, is speculative, so we have to provide a way to identify store-load pairs. Cost-Effective Physical Register Sharing 11/19/2018
17
SMB Through the ROB Distance Prediction [Sha et al., MICRO’06]:
Use Commit Sequence Number (CSN) to compute the instruction distance between store-load pairs. Have register-producing instruction mark the commit rename map with their CSN: CRMAP_CSN[rax] = CSN0 Have stores put the CSN of stored register in an effective-address indexed table (DDT): = CRMAP_CSN[rax] = CSN0 Have loads read CSN in DDT table and subtract to own CSN: Distance = CSN2 – = 2 – 0 = 2. Inst. Distance: 2 CSN0 - add rax <- 42 CSN1 - store rax, CSN2 - load rbx, CSN3 - add rax <- rbx Cost-Effective Physical Register Sharing 11/19/2018
18
SMB Through the Register File and ROB
Distance Prediction: Train a TAGE-like predictor with the distance. At Rename, get the distance and index into the ROB to get the register index of the producing store/load. Also consider load-load pairs to share registers for a longer time. Validation: Read destination register at issue and validate at execute. Squash the pipeline on a misprediction. So we are going to send the computed distance to the distance predictor. In this paper we used a TAGE like distance predictor, that gives us a distance before Rename. With the distance, we index into the ROB to get the physical register index of the producing store or load. Validation is done by marking the load as depending on the producing instruction, and by reading the shared register at issuing time, to its value can be compared to the value coming from the D-Cache. To keep things simple, we squash on a misprediction. Cost-Effective Physical Register Sharing 11/19/2018
19
Experimental Framework
Gem5-x86, Haswell-like (4GHz, 192ROB, 60IQ, 72LQ, 42SQ, 19-cycle, 256/256 INT/FP reg.), 8-wide, 6-issue. 32KB I/Dcache, 1MB L2. 12KB TAGE-like distance predictor, 156KB DDT* SPEC’00/’06, 50M warmup, 100M simulated. Experiments: A case for the ISRB. Not a case for ME. Not a case for SMB. *Comparable results with an 8.6KB DDT Cost-Effective Physical Register Sharing 11/19/2018
20
Move Elimination Small average speedup (~1%).
32-entry to get all the potential, 16-entry to get most of it. 16 entry to get most of the potential Cost-Effective Physical Register Sharing 11/19/2018
21
Speculative Memory Bypassing
24 entry to get most of the potential Most of the speedup can be explained by the fact that the distance predictor is able to correct some of the mistakes made by the memory dependency predictor implemented in gem5. Cost-Effective Physical Register Sharing 11/19/2018
22
Combined Less potential for SMB if too many move eliminations take place and the ISRB is too small. 32-entry to get the best of both worlds. Starting at 24 entries, SMB only is more interesting than SMB + ME Cost-Effective Physical Register Sharing 11/19/2018
23
Summary The renamer can execute some instructions through register sharing. Register sharing is not trivial. The ISRB: Small, checkpointable structure permitting physical register sharing. Cost-Effective Physical Register Sharing 11/19/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.