Practical Value Speculation for Future High-End Processors

Practical Value Speculation for Future High-End Processors
Arthur Perais & André Seznec Hi everyone, my name is Arthur perais and I am a third year phd student in INRIA where I am advised by André Seznec. These past two years, I have been working on Value Prediction, so this is what we will talk about today. Arthur Perais/André Seznec - CALCM 11/6/2018

Why Value Prediction? Sequential performance is important but hard to improve. « Natural » way: increase the superscalar width. Complexity, power, timing issues. Currently: speculation to maximize the utilization of the resources we can implement: Branch prediction to feed the execution core. Memory dependency prediction to increase ILP. What about Value Prediction to increase ILP? Why value prediction? A remaining problem today is that increasing sequential performance is hard, but it is still needed, even in the era of multicores/manycores, because of Amdahl’s Law. The « natural » way to increase sequential performance is to leverage ILP better by increasing the superscalar width. However we quickly run into power and timing issues. Branch prediction to feed the core. Memory dependency prediction to reorder some memory instructions and increase ILP. But there are other ways to speculate in order to increase ILP that are not implemented today, such as Value Prediction, which is the focus of the presentation. Arthur Perais/André Seznec - CALCM 11/6/2018

Value Prediction [Lipasti96][Mendelson97]
Breaks true data dependencies to extract more ILP, e.g: Becomes, if I3 is predicted: I1 I1 I2 I2 I3 I3 I4 I4 I5 I5 The key observation behind VP is that instructions often produce the same result. So the idea is to predict those results to break true data dependencies. The chain of instructions on the top part Can be broken into two chains that can be executed concurently Arthur Perais/André Seznec - CALCM 11/6/2018

Map of the Problematique
Existing predictors may not be adapted. Validation & recovery at execute: Validation in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. Predictor Implementation considerations: Multiporting for superscalar VP. Big storage footprint. Unfortunately, VP has so far been considered too complex to implement. The reasons are numerous. First, just like for branch prediction, we need a predictor to predict instructions. However, existing prediction mechanisms may not be adapted for implementation. We will see why in a minute. Second, predictions need to be checked, and recovery must take place in case of a misprediction. This is usually done as soon as possible, at execute so there basically is a comparator behind each functional units. Moreover, recovery is done by selectively replaying only the instructions that need to be. This entails complex hardware in an already very complex piece of hardware. Third, predictions need access to the register file to be written at dispatch and read at validation time. This entails additional ports on the PRF, which is something one would usually try to avoid since size and power grow quadratically with the port count. Lastly, regardless of its prediction scheme, the predictor must provide several predictions per cycle, and since it predicts 64-bit values, it is can be quite big. These are the issues we will try to address during this talk. Arthur Perais/André Seznec - CALCM 11/6/2018

A Novel Prediction Scheme Avoiding Usual Value Predictor Shortcomings.
1.1 A Novel Prediction Scheme Avoiding Usual Value Predictor Shortcomings. So let’s begin with some state of the art on value predictors, their shortcomings, and how to avoid them. Arthur Perais/André Seznec - CALCM 11/6/2018

Prediction Schemes Computational: Apply a function to the last result/prediction: Last Value Predictor [Lipasti96]. Stride [Gabbay98], 2-delta Stride [Eickemeyer93]. There are two main families of predictors First, computational predictors “compute” the prediction by applying a function to the result produced by the last occurrence. Arthur Perais/André Seznec - CALCM 11/6/2018

Stride Predictor: Regular Prediction
PC Prediction Retired Last Value Let us take a look at the stride predictor. For a given instance of a static instruction, so a dynamic instruction Get the retired last value and the stride, add them. If there is already an instance of the same static instruction in flight, then the current instance should use the result of this first instance as its last value. However, this result has not been retired or maybe even computed, so it is not available in that retired last value table. Arthur Perais/André Seznec - CALCM 11/6/2018

Stride Predictor: Inflight Instances
PC Prediction Retired Last Value Inflight Speculative Window A speculative window is required in the predictor to provide coherent predictions, that is, the first instance puts its predicted result in the speculative window, then, when the second instance arrives, it can fetch this predicted result and use it as last value. Arthur Perais/André Seznec - CALCM 11/6/2018

Stride Predictor: Back-to-back Prediction
Critical prediction loop PC Prediction Retired Last Value Inflight Back-to-back Inflight Speculative Window If two instances are back-to-back (e.g. in tight loops), the prediction of the previous instance is not yet in the speculative window when the second instance has to be predicted. Need to bypass the prediction from the output of the adder to its input so that the second instance can be predicted in the second cycle. For back to back prediction, there exists a 1-cycle critical prediction loop that consist of the multiplexer and the adder. This should be implementable as full 64-bit addition is done in one cycle in the ooo core. Arthur Perais/André Seznec - CALCM 11/6/2018

Prediction Schemes Context-based: Observe the stream of local values and identify patterns: Finite Context Method (FCM) [Sazeides97&98]. Context-based try to identify repeating patterns in the value history of instructions. Main representative is finite context method Arthur Perais/André Seznec - CALCM 11/6/2018

FCM Predictor: Regular Prediction
Retired history PC n-3 n-2 n-1 Hash VPT Prediction VHT The idea is to get the last n retired values produced by instances of a given instruction, then to hash them to access the prediction in the value prediction table. If there already are previous instances inflight for this static instruction, then their results are part of the history for the current instance but they have not retired yet. Arthur Perais/André Seznec - CALCM 11/6/2018

FCM Predictor: Inflight Predictions
Retired history PC n-3 n-2 n-1 Hash VPT Prediction VHT Align/ Merge Speculative history n-3 n-2 n-1 Once again, we need a speculative window to get the most recent history, to potentially merge it with the retired history. Speculative Window Arthur Perais/André Seznec - CALCM 11/6/2018

FCM Predictor: Back-to-back Prediction
Retired history PC n-3 n-2 n-1 Hash VPT Prediction VHT Align/ Merge Speculative history Critical prediction loop n-3 n-2 n-1 Lastly, if we want to predict two instance back-to-back, well the value predicted for the first instance is not available in the speculative window when the second instance arrives. Need to bypass it from the output of the VPT to be merged with the most recent history from the speculative window. This means that to be able to predict back-to-back instances: merge, hash, VPT read must fit in a single cycle. However, contrarily to stride, All those steps are unlikely to fit in a single cycle, especially if one wants a large VPT. Speculative Window Arthur Perais/André Seznec - CALCM 11/6/2018

Prediction Schemes: Shortcomings
A speculative window is usually required. How do you build it? Existing context-based predictors may not be adapted in case of predictable tight loops. If we summarize the main shortcomings of existing predictors A speculative window is usually required to provide coherent predictions if several instances are inflight. The design of this window has seldom been considered in previous work. Because of an existing critical prediction loop. So they may have to give up some potential in some cases. Arthur Perais/André Seznec - CALCM 11/6/2018

Leveraging Branch Prediction
Remove the need for the previous result(s) to generate a prediction by using control-flow as context. Indirect Target Prediction is a specific case of Value Prediction. Modify ITTAGE [Seznec06] to handle all instructions eligible for VP. To address these shortcomings, we propose to remove the need for the previous result(s) to generate a prediction for the current instance, by using control flow as context instead of dataflow. And actually, by noting that indirect target predictions is a specific case of VP, we can modify a state of the art indirect target predictor to handle all instructions. We choose the ITTAGE predictor. Arthur Perais/André Seznec - CALCM 11/6/2018

Introducing VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)]
N partially tagged components Base Predictor: Tagless LVP hash hash hash hash VT0 VT1 VTN pred ctr tag u … pred ctr tag u pred ctr =? =? Two parts in VTAGE. Base predictor: tagless Last Value Predictor. N partially tagged components. Accessed with the PC and some bits of the global branch history, in a geometric fashion. 2 bits of ghist for VT1, 4 for VT2, 8 for VT3 and so on. Thanks to this, the majority of the storage is dedicated to short histories but the predictor is still able to capture correlation using a very long history. Sat? use pred Arthur Perais/André Seznec - CALCM 11/6/2018

Predicting with VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)]
The rightmost matching component predicts (VTN) hash hash hash hash VT0 VT1 VTN pred ctr tag u … pred ctr tag u pred ctr =? =? To predict, all components are accessed in parallel. Prediction will flow from the rightmost component that matches. Sat? use pred Arthur Perais/André Seznec - CALCM 11/6/2018

No tag match: The base predictor
Predicting with VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)] No tag match: The base predictor predicts (VT0) hash hash hash hash VT0 VT1 VTN pred ctr tag u … pred ctr tag u pred ctr =? =? If there is no match in the tagged components, the prediction will flow from the base predictor. Then, the value of the confidence counter will tell us if we can use the prediction or not. Sat? use pred Arthur Perais/André Seznec - CALCM 11/6/2018

What VTAGE is Really About
Context is available and easy to manage. The result of the previous instance is not required: No speculative window. No prediction critical loop: Back-to-back occurrences can be seamlessly predicted. An alternative, practical context-based predictor. Arthur Perais/André Seznec - CALCM 11/6/2018

1.2 VTAGE: Results So let us see how it performs against other predictors. Arthur Perais/André Seznec - CALCM 11/6/2018

Simulator and Benchmarks
Cycle-level simulator: gem5 (x86_64). 4GHz, 8-wide, 20 cycles Bmispred., 256ROB, 128IQ, 48LQ/48SQ. 32KB L1D/L1I, 2MB unified L2 with stride prefetcher, 4GB DDR Single-thread benchmarks: Subset of SPEC’00 and SPEC’06 (19 benchmarks). Simpoint: One slice per benchmark, warmup for 50Minsts, run for 50Minsts. In our experiments… Fetch to commit latency of 19 cycles Arthur Perais/André Seznec - CALCM 11/6/2018

Predictors 8K-entry 2-delta Stride predictor [Eickemeyer93].
8K-entry VHT/8K-entry VPT order 4 FCM [Sazeides97]. 6+1 component VTAGE (6 x 1K-entry + 8K-entry tagless LVP). History lengths : 2 to 64 bits. 3-bit saturating confidence counters (use only if saturated). Increment on correct, reset on incorrect. Optimistic 0-cycle selective reissue. All predictors can predict back-to-back occurrences. One representative of the computational family, the 2 delta stride predictor. It is good a predicting strided patterns. One representative of the context-based family, and order 4 FCM predictor. Good at predicting repeating patterns. And finally, VTAGE, good at predicting control-flow dependent patterns. To recover, we use an optimistic 0 cycle selective reissue mechanism, that is, we instantly identify all the instructions that need to be replayed and mark them as not executed. They can begin to execute again in the cycle following the misprediction. Arthur Perais/André Seznec - CALCM 11/6/2018

Some Important Metrics
Only confident predictions are actually used in the pipeline. We differentiate: 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒= 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑜𝑡𝑎𝑙 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑎𝑏𝑙𝑒 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦= 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 + 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 One is meaningless without the other. Neither is really conclusive regarding speedup. Lastly, Let’s just define some metrics to be sure that we talk about the same thing Arthur Perais/André Seznec - CALCM 11/6/2018

Predictor Coverage and Accuracy
The top part shows coverage while the bottom part shows accuracy. Overall, VTAGE has better coverage than FCM, and even than stride sometimes. However, stride has better coverage in some other benchmarks. This hints that different predictors are good at predicting disting type of patterns. Accuracy is roughly similar in general, except in some particular cases such as gzîp, gamess and hammer where VTAGE is above. However, the principal interest of VTAGE is not its accuracy but the way it operates. Arthur Perais/André Seznec - CALCM 11/6/2018

Speedup – 0-cycle Selective Reissue
Interesting speedup in several benchmarks. Minor speedup in the remaining ones. And now let us look at speedup. So first, there is not much correlation with coverage. For instance namd has a high coverage, but speedup is inexistant. This is because ILP is already quite high and instructions on the critical path are not well predicted. Nonetheless, we find the same patterns, in applu, gcc, gamess and h264, VTAGE performs better. And in bzip, stride performs better. So different predictors are adapted to different benchmarks. But overall we get very interesting speedups in some cases, and minor ones in the remaining benchmarks. . Arthur Perais/André Seznec - CALCM 11/6/2018

Existing predictors may not be adapted. Validation & recovery at execute: Validation in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. Predictor Implementation considerations: Multiporting for superscalar VP. Big storage footprint. VTAGE So now we have a predictor that solves high-level issues with existing predictors, namely, the need for a speculative window and the unlikeliness to be able to predict instructions in tight loops. This predictors even performs better than an equivalent context-based predictor. Arthur Perais/André Seznec - CALCM 11/6/2018

Pushing Validation and Recovery Out of the Out-of-Order Engine
2.1 Pushing Validation and Recovery Out of the Out-of-Order Engine Let us try to deal with the issue of prediction validation and recovery that usually happen inside the out-of-order engine. Arthur Perais/André Seznec - CALCM 11/6/2018

Validation and Recovery for Value Prediction
Prediction validation is usually done out-of-order, at the output of the FUs. Recovery: Pipeline squashing, simple but slow. Selective replay (reissue), very complex but faster. Needs to ensure correctness so no misprediction should go unnoticed. So far, validation as usually been done at execute time, at the output of the FUs. This implies modifying a piece of hardware you don’t really want to modify. Once a misprediction is found, how do you recover? Arthur Perais/André Seznec - CALCM 11/6/2018

Avoid Selective Replay if Possible
Arbitrarily long dependency chain to replay. SQ St Mispred IQ St ALU Forwarded The issue with selective replay is that the dependency chain to replay is arbitrarily long (although bounded by the ROB size) - The penalty might not be that low because it is clearly not trivial to do. LQ Ld Arthur Perais/André Seznec - CALCM 11/6/2018

Getting Performance from Value Prediction
Mispredictions are expensive: Minimize the overall time spent recovering. 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 ≈ 𝑁 𝑚𝑖𝑠𝑝 ∗ 𝑀𝑖𝑠𝑝 𝑝𝑒𝑛𝑎𝑙𝑡𝑦 Selective Replay Accuracy Focus on providing very high accuracy at reasonable cost in coverage. So let’s take a step back and note that the overall penalty can be decreased by decreasing the number of mispredictions and/or decreasing the cost of a misprediction. However, a fast recovery mechanism is very expensive and we want to limit the hardware cost fo VP. Focus on providing very high accuracy at reasonable cost in coverage Arthur Perais/André Seznec - CALCM 11/6/2018

Late Validation and Recovery
Assuming high accuracy, the cost of recovering is not as important anymore: Validate and recover at commit time, in-order. No need to modify the out-of-order engine: If we can provide this very high accuracy, we won’t mispredict often, so we can pay a high recovery cost. That is, we can Delay validation and recovery until commit time This gives us the pipeline diagram here, where the stages in which VP intervenes are in blue Complexity is removed from the OoO core Arthur Perais/André Seznec - CALCM 11/6/2018

Providing Very High Accuracy
Wide (10-bit) saturating counters can do the trick but this takes area. Use 3-bit counters and a PRNG to control incrementing [Riley06]: Forward Probabilistic Counters (FPC). !(rand() % (1/proba)) PRNG && ++ Correct Smaller counters (for instance 3 bits) whose forward transitions are controled by probabilities. We refer to this mechanism as forward probabilistic counters. We emulate this behavior with a simple PRNG as shown on the diagram here: on a correct prediction,increment only with a probability, and reset if the prediction is wrong. In further experiments, we use the probabilities shown here (7 for 3-bit counters). In further experiments, we consider selective replay as an upper bound, and for it we use higher probabilities since it can handle more mispredictions without hurting performance. Now we a simple confidence estimation mechanism allowing to remove complexity from the out of order core by providing very high accuracy. Reset Incorrect 𝑝 𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑣𝑒 ={1, 1 8 , 1 8 , 1 8 , 1 8 , , } 𝑝 𝑠𝑞𝑢𝑎𝑠ℎ ={1, 1 16 , 1 16 , , , , } Arthur Perais/André Seznec - CALCM 11/6/2018

Late Validation: Results
2.2 Late Validation: Results Now for some results. We use the exact same framework as in the first part, so gem5, 8 wide, 4 Ghz, and big predictors. Arthur Perais/André Seznec - CALCM 11/6/2018

Impact of FPC on Coverage
Significant coverage reduction Baseline FPC Let us look at the impact of FPC on coverage. - In the top part, we can see coverage with baseline 3-bit counters and in the bottom with FPC. - While there are noticeable losses of coverage in some cases, they happen in benchmarks where accuracy was the lowest so performance was diminished with baseline counters anyway. In the remaining cases, some coverage is lost but this will not forbid VP to increase performance. Arthur Perais/André Seznec - HPCA 2014 11/6/2018

Speedup – Validation & Squash at Commit
Baseline Overall: Speedup Slowdown Acceptable Slowdown FPC Speedup over baseline, that is without value prediction, when validation and squash are done at commit. On the top part, speedup with baseline 3-bit counters, and on the bottom, speedup with 3-bit FPC counters. 1 – baseline counters not sufficient. 2 – FPC guarantees no slowdown because accuracy is very high (> 99.5%) 5 – Even with a conservative mechanism, we get interesting speedup Arthur Perais/André Seznec - CALCM 11/6/2018

Speedup – 0-cycle Selective Reissue
Almost no slowdown Baseline ≈ Speedup as with squash at commit FPC - But how do we fare againt the « best » recovery mechanism? 1 – From the top part 0-cycle recovery mechanism allows almost no slowdown even with baseline counters. This is because even though accuracy is lower, the cost of a misprediction is much less than for squashing. 2 – With FPC, most of the speedup is preserved, and it is similar to speedup with validation/squash at commit. In other words, a very complex recovery mechanism does not appear as necessary to avoid slowdown. Arthur Perais/André Seznec - CALCM 11/6/2018

Existing predictors may not be adapted. Validation & recovery at execute: Validation in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. Predictor Implementation considerations: Multiporting for superscalar VP. Big storage footprint. VTAGE Validation & Squash at commit. Thanks to a very simple confidence estimation mechanism, we are able to delay validation and recovery until commit time. In other words, we removed a great deal of the complexity usually involved with VP from the out of order engine. Arthur Perais/André Seznec - CALCM 11/6/2018

Addressing the Register File Complexity with the EOLE Architecture
3.1 Addressing the Register File Complexity with the EOLE Architecture Now onwards to the next issue: the register file. So I would just like to warn you that in this section I use scheduler and instruction queue, aka IQ interchangeably. I will probably also mistake the out-of-order core with the out-of-order engine, that is the backend where the scheduler and functional units are located. Arthur Perais/André Seznec - CALCM 11/6/2018

The – Slightly – Hidden Costs of VP
n-issue Out-of-order Engine Fetch ROB IQ PRF FUs VPredict PC Validation + Squashing @commit More ports on the PRF: Write ports to write predictions. Read ports to validate/train. What happens to the PRF if we add value prediction in a baseline pipeline? First, we need write ports on the PRF to write predictions in it at Dispatch so that the out-of-order engine can use them. Second, we need read ports in order to read the actual result from the PRF and validate the prediction against it, as well as train the predictor. Arthur Perais/André Seznec - CALCM 11/6/2018

Let’s Count. Baseline 8-wide, 6-issue: VP 8-wide, 6-issue:
12 read ports, 6 write ports. VP 8-wide, 6-issue: 12R/6W for OoO execution. 8W to write 8 predictions/cycle in the PRF. 8R to validate/train 8 instructions/cycle. 12R/6W vs. 20R/14W! - We cannot bear such an increase in the number of ports because of area and power constraints. We need a way to reduce complexity in the PRF. Arthur Perais/André Seznec - CALCM 11/6/2018

Leveraging the – slightly – Hidden Benefits of VP
Value Prediction provides: Instructions with ready operands flowing from the value predictor. Predicted instructions not needing to be executed before retirement. Offload execution to some other in-order parts of the core to reduce complexity in the out-of-order core. Save PRF ports in the process. We make two key observations. First, VP provides instructions with ready operands flowing from the predictor. Thus some instructions are ready to execute long before they are dispatched. Second, VP provides predicted instructions that do not actually need to be executed before retirement since dependents can use the predicted result to execute. Therefore, we can offload part of the execution from the OoO engine without lengthening the execution critical path. We will see how we can reduce the number of ports on the PRF in the process. Arthur Perais/André Seznec - CALCM 11/6/2018

Introducing Early Execution
Out-of-order engine Fetch Decode Rename Dispatch Early Execution VPred Execute ready single-cycle instructions in parallel with Rename, in-order. Do not dispatch to the IQ. First, we propose early execution to execute single-cycle ready instructions in-order in the front-end. Since execution is in-order, it can be done in parallel with Rename. Early Executed instructions are not be dispatched to the IQ, their results simply have to be written in the PRF, like regular predictions. Arthur Perais/André Seznec - CALCM 11/6/2018

Introducing Late Execution
Out-of-order engine Validation/ Late Execution Validation CMP Retire VPredict Prediction FIFO Queue Execute single-cycle predicted instructions just before retirement, in-order. Do not dispatch to the IQ either. Second, we propose late execution to execute single-cycle predicted instructions at retirement time, just before the prediction is validated. Similarly to early-executed instructions, late-executed instructions are not dispatched to the IQ. Here you can also see the prediction queue where instructions are pushed at fetch and popped at validation. Arthur Perais/André Seznec - CALCM 11/6/2018

{Early | OoO | Late} Execution: EOLE
Much less instructions enter the IQ: We may be able to reduce the issue-width: Simpler IQ. Less ports on the PRF. Less bypass. Simpler OoO. Non critical predictions become useful as the instructions can be late-executed. What about hardware cost? With EOLE, much less instructions enter the IQ. Therefore, we could for instance reduce the issue-width We get a simpler scheduler. Less ports on the register file. We also get less bypass network. So we could really get an overall simpler out-of-order engine. Moreover, even non critical predictions become useful as the instructions can be late executed, that is, they don’t use any resource in the out-of-order engine anymore. In other words, predicting a simple addition becomes interesting. But what about the hardware cost of this first proposition? Arthur Perais/André Seznec - CALCM 11/6/2018

Hardware Cost of EOLE Early Execution: A single rank of simple ALUs.
Associated bypass network. No additional PRF ports. Late Execution & Validation: Rank of simple ALUs and comparators (to validate). No bypass. n read ports to validate becomes 2n to handle n instructions per cycle: 16R for an 8-wide pipeline. From 20R/14W for an 8-wide, 6-issue with VP, we now need 28R/14W! Only 12R/6W for the baseline… Early Execution appears as fairly light since execution is in-order and ALUs only handle simple operations. Furthermore, we do not require additional ports on top of a baseline Value Prediction processor. The most expensive piece of hardware might be the full bypass. For Late Execution, which is also done in-order, we need a rank of simple ALUs and some comparators, but we do not need any bypass. However, we will need up to 16 read ports to handle 8 instructions per cycle, which is double what we needed to validate 8 instructions per cycle. Therefore, if we consider an EOLE pipeline with the same issue width as a Value Prediction pipeline, we actually increase the port requirements, from 20R/14W to 28R/14W, which is quite counter productive. Fortunately, we can greatly reduce this number with simple optimizations. Arthur Perais/André Seznec - CALCM 11/6/2018

Achieving Ligther Value Prediction with EOLE
3.2 Achieving Ligther Value Prediction with EOLE That is, EOLE actually enables lighter Value Prediction if it is implemented carefully. Arthur Perais/André Seznec - CALCM 11/6/2018

Reducing the Issue Width
If less instructions enter the IQ, then we can reduce the issue width (maybe the IQ size): From 6 to 4 (-4R and -2W): 24R/12W. The remaining issue capacity is offloaded to the Early/Late Execution stages. Still too many ports. First, as previously mentioned, we can reduce the issue width. In our framework, we found that we could reduce it from 6 to 4 without sacrificing performance. In a sense, the remaining issue capacity is offloaded to the early/late execution stages. We save 2 write ports and 4 read ports, but we still need too many: 12W/24R Arthur Perais/André Seznec - CALCM 11/6/2018

Banking the Physical Register File
Prediction and Validation are done in-order. Bank the PRF and attribute predictions to consecutive banks. 8 pred/cycle (p = pred) pI0 pI4 pI1 pI5 pI2 pI6 pI3 pI7 Bank 0 Bank 1 Bank 2 Bank 3 8 valid/cycle (a = actual) aI0 aI4 aI1 aI5 aI2 aI6 aI3 aI7 Second, we can leverage the fact that prediction and validation are done in-order to bank the PRF and allocate destination registers of sequential instructions to different banks. That is, we can guarantee that with 4 banks and 8 predictions per cycle, all predictions can be written to the PRF with only 2 write ports per bank. The idea is similar for validation and read ports. This means that we save 6 write ports per bank. However, due to late execution, read ports savings are not as straightforward. 2 write ports per bank instead of 8 for a 4-bank file. Read port savings are not as straightforward because of Late Execution. Arthur Perais/André Seznec - CALCM 11/6/2018

Read Port Sharing 8 instructions can be validated with 2R per bank…
…but Late Execution needs 16R per-bank to process 8 instructions. Fortunately, not all instructions are predictable (e.g. stores) or late-executable (e.g. loads). Constrain the number of read ports and share them between late execution and validation as needed : 4R per-bank is a good tradeoff. That is, 8 instructions can be validated with 2R per bank; by construction But late execution might need 16R per-bank to process 8 instructions per cycle, since operands can come from any bank. Fortunately, not all instructions are predictable of late executable, meaning that we don’t need the ideal number of ports to get performance. In particular, we found that constraining the number of ports dedicated to late/execution and validation to 4 per-bank was a good tradeoff, still assuming 4 banks. Arthur Perais/André Seznec - CALCM 11/6/2018

Let’s Count, Again. 4-issue out-of-order engine (4W/8R per bank).
8 predictions per cycle (2W per bank). Constrained late-execution/validation (4R per bank). 12R/6W per bank in total. From 28R/14W, we now only need 12R/6W! This is the same amount as the PRF without VP, except issue width is 33% less. If we summarize, this means that for a 4-issue out of order engine, we need 4W and 8R per bank. Then, to write 8 predictions per cycle, we need 2W per bank, and finally, to late-execute and validate, we need 4R per bank. That’s 12 read ports and 6 write ports per bank in total, assuming 4 banks. Thus, from 28R/14W for the first EOLE proposition, we now only need the same amount of ports as the baseline model without VP, except issue width is 33% less. Arthur Perais/André Seznec - CALCM 11/6/2018

Putting It All Together
Less than n-issue Out-of-order Engine Predictions flow through Early Execution Predictions/Early results are written to the PRF at Dispatch ROB IQ Bank 0 Rename FUs Fetch Bank 1 Early Exec Bank 2 PC Predictions VPredict Bank 3 The resulting block diagram of our proposition looks like the one shown here: Predictions flow through early execution where some instructions are also executed Predictions and results from early execution are written to the PRF at dispatch using only 2 ports per-bank (assuming 8-wide dispatch). Regular but narrower OoO execution happens Single-cycle predicted instructions are late-executed just before retirement by reading their operand in the PRF. finally, all predicted instructions are validated at commit time. All predicted instructions are validated at commit time Regular Out-of-order Execution Single cycle predicted instructions are late executed by reading operands from the PRF Validation + Squashing @commit Late Execution Arthur Perais/André Seznec - CALCM 11/6/2018

Putting It All Together
EOLE provides a way to nullify the pressure applied by VP on the PRF (assuming banking is cheap). It reduces the complexity of the OoO engine itself: smaller issue width, simpler Wakeup & Select, less bypass. EOLE needs VP to provide instructions to early/late execute while VP needs EOLE to mitigate the complexity it introduces. The two features are complementary. In a nutshell, EOLE provides a way to nullify the pressure applied by VP on the PRF It also reduces the complexity of the OoO engine which is a big plus since the scheduler is responsible for 15 to 20% of the core power, at least in the Pentium Pro and Alpha However EOLE really needs VP to provide instructions to ealy/late execute, but VP requires EOLE to mitigate the complexity it introduces: both features are complementary. Arthur Perais/André Seznec - CALCM 11/6/2018

3.3 EOLE: Results Let’s try to see if we still get some good speedups with EOLE on top of Value prediction We mostly use the same simulator and parameters as in the previous study, except we consider a base issue with of 6 instead of 8 as we did not see any improvement with the latter on this set of benchmarks. Arthur Perais/André Seznec - CALCM 11/6/2018

Speedup over Baseline 8-wide/6-issue
- The first graph shows speedup brought by a hybrid of VTAGE and 2delta Stride taht were used in the first part over the baseline, without EOLE. As expected, we obtain good speedup with VP, and no slowdown is observed thanks to the very high accuracy of the predictor. In further experiments, this is the performance we use as reference. Arthur Perais/André Seznec - CALCM 11/6/2018

Early Executed – Late Executed
Low EOLE potential Except this one, actually. In this graph, we give insight on the proportion of dynamic instructions that can respectively be early executed, late executed because they are high confidence branches, and late executed because they are single cycle predicted instructions. Given those numbers, we expect EOLE to perform quite well, except in some benchmarks where we expect performance to decrease if we reduce the issue-width because the predictor is not performing that well, namely milc, hmmer and lbm. Arthur Perais/André Seznec - CALCM 11/6/2018

Reducing the Issue Width
Slight speedup in general Slowdown in almost all cases Slowdown in a single case Next, we consider both simple VP and EOLE models where issue width is reduced from 6 to 4. In the legend, items with a 4I are 4-issue and the item with a 6I is 6 –issue. 64IQ is for the number of entries in the instruction queue. Performance is over the 6-issue, 64IQ model featuring the hybrid predictor that we just saw two slides ago. If we reduce the issue width of the simple Value Prediction model, we obtain noticeable slowdowns in almost all benchmarks. However, if we reduce the issue width of our EOLE pipeline, we observe a single slowdown in hammer, in the order of You’ll note that milc and lbm are not slowed down. Furthermore, if we keep the issue width as in the baseline model but we add Early and Late Execution (white bar), we actually get more speedup in several benchmarks. As a result, EOLE appears as a way to either slightly increase performance or to keep performance roughly constant while decreasing the issue width. Arthur Perais/André Seznec - CALCM 11/6/2018

Limited Issue and PRF Ports
Without VP Same performance for 4R/bank as ideal Finally, we consider the EOLE model with only 12R/6W per bank, assuming 4 banks, as discussed previously. The first bar gives performance for the 6-issue model without Value Prediction, to give you some insight on performance without Value Prediction while having some bars for EOLE. The two next bars show speedup respectively for a 4-issue EOLE without any port constraints, and with port constaints. The main conclusion is that 4 additional read port per banks are sufficient to obtain the same performance as the unconstrained model. Therefore, we can implement VP with a PRF that has as many ports as the baseline 6-issue without Value Prediction, while having reduced the issue width by 33%. Arthur Perais/André Seznec - CALCM 11/6/2018

Existing predictors may not be adapted. Validation & recovery at execute: Validation in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. Predictor Implementation considerations: Multiporting for superscalar VP. Big storage footprint. VTAGE Validation & Squash at commit. EOLE Bonus: OoO complexity-- This mechanism solves one of the big remaining issues with VP, namely the number of additionnal accessed made to the register file. In bonus, we reduce complexity in the out-of-order engine. And lastly, since we have a little time, we’ll quickly dive into some technical details of the prediction infrastructure. Because although I presented VTAGE, that is attractive from a high level standpoint, most of the remaining complexity of VP actually lies in the predictor infrastructure. Arthur Perais/André Seznec - CALCM 11/6/2018

4.1 Addressing Port Requirement on the Predictor with Block-Based Prediction (BeBoP) First, let’s try to deal with superscalar VP. Arthur Perais/André Seznec - CALCM 11/6/2018

Issues with Superscalar Prediction
n instructions per cycle require n predictor accesses per cycle. Multiporting and replication are too expensive, especially if n is around 8. Banking can be envisioned for RISC and predictors that do not use local value history (e.g. LVP, Stride, VTAGE). What if we want to use CISC or a different predictor? We need a more general scheme. - So, superscalar VP is simply the ability to predict several instructions per cycle. The intuitive way to look at it is that n instructions per cycle require n accesses per cycle. How do you handle this? First, We can multiport or replicate, but it is very expensive. Second, If some constraints are met, that is, using RISC and predictors that do not use local value history, we can bank the predictor. But what if we want CISC or another predictor? - There is a call for a more general scheme that is efficient (i.e. avoids multiporting or replication) and accomodates both RISC and CISC. Arthur Perais/André Seznec - CALCM 11/6/2018

Introducing Block-Based Prediction (BeBoP)
Instruction are usually fetched by chunks (fetch blocks). Group predictions belonging to a single fetch-block in a single predictor entry. At fetch, the predictor is accessed using the block PC. Information available in both CISC and RISC. n predictions are retrieved in a single read. One port, n predictions. The general scheme we propose, block-based prediction or BeBoP, as opposed to the usual instruction-based prediction, is based on the observation that sequential instructions are actually fetched by blocks. The key idea then consists in grouping predictions belonging to a single fetch-block in a single predictor entry. Arthur Perais & André Seznec - HPCA 2015 11/6/2018

Introducing Block-Based Prediction (BeBoP)
PC Attribute prediction to uops in sequential order ICache line Fetch Block Fetch b0 b1 b2 b3 Pred0 Conf0 Pred1 Conf1 Pre-Decode I0 I1 I2 sat? sat? Decode UOP0 UOP1 UOP0 To illustrate this in more detailed, let us consider BeBoP on CISC with an abstract predictor having two predictions per entry. So we fetch a 4-byte block using the fetch PC. In parallel we access the predictor. Pre-Decode, we get the instruction boundaries. Decode, nothing new, we get the microops. Rename, we get the destination registers. Using this information, and the state of the confidence counters, because we only use confident predictions, we can now attribute predictions to MUOPS in a sequential fashion, that is, Pred0 goes to muop0 of I1 so in r0 and Pred1 goes to MUOP1 of I1 so in R1. I am just giving the key idea here, and you can look at the paper for even more technical details. Rename r0 r1 r2 enable enable Arthur Perais/André Seznec - CALCM 11/6/2018

4.2 Addressing the Predictor Storage Footprint with The Differential VTAGE Predictor Now that we have a layout able to generate several predictions per cycle, we need to ensure that the predictor does not require huge amount of storage to improve performance. Arthur Perais/André Seznec - CALCM 11/6/2018

Limitations of VTAGE VTAGE is big since it stores full 64-bit values in all its components. VTAGE cannot handle strided patterns efficiently: Each value in the pattern occupies its own entry (a single entry for the whole pattern in the Stride predictor). A « naïve » hybrid of Stride and VTAGE remedy this, but is inefficient: Both components are trained for every instructions. Storage requirement is high. So let us go back to VTAGE for a minute. If you recall, it has advantages: no speculative window required, and predictions are independent. However it is not perfect. It’s big because it stored full 64-bit values while there may be redundancies in the values. In particular, it cannot handle strided patterns efficiently while those patterns are frequent in programs. Storage requirement is high because both components must be big enough to perform well. VTAGE is not easy to shrink. Arthur Perais/André Seznec - CALCM 11/6/2018

Introducing the Differential VTAGE Predictor
Inspired by the D-FCM of [Goeman & al., HPCA01] Store differences (strides) instead of full 64-bit values in the predictor. Use a table to track last values (Last Value Table). Pros: Cons: Space efficient: strides can be small (8/16/32 bits). Slower (adder on the prediction critical path). A prediction now depends on the previous result (Last Value). Tightly coupled: can predict control flow dependent strided patterns. To address these shortcomings It stores differences, strides instead of full 64-bit values. It uses a table to track the last outcomes of dynamic instructions, called the Last Value Table. We get a stride from the predictor and we add it to the last outcome from the LVT. We can easily reduce the size of D-VTAGE because small strides can be used. It can predict like stride, it can predict like VTAGE, and it can predict like a combination of both which is not the case of a naive hybrid. So its prediction scheme is more powerful. However, it may be slower because there is an adder on the prediction critical path. More importantly, a prediction now depends on the previous result, so we reintroduce the critical prediction loop of the stride predictor in DVTAGE. Arthur Perais/André Seznec - CALCM 11/6/2018

A Major Issue with D-VTAGE
What if the last value has not been retired or even computed? Retire + Update LVT Pred Inst_1 Pred Instance_2 Ideally: Last_0 Last_1 LVT Retire + Update LVT Pred Inst_1 Pred Inst_2 In practice: Last_0 Last_0 Actually we also reintroduce the following issue: what if the last value has not been retired or even computed. Because ideally, we use the result of instance 0 to predict instance 1, and the result of instance 1 to predict instance 2 as illustrated by the yellow arrows. But in some cases, the LVT has not been updated with the result of instance 1 when instance 2 must be predicted, so it sees the result of instance 0 in the LVT, which in incorrect. So we need a speculative window from where to get up-to-date last outcomes under penalty of completely desynchronizing the predictor. t Arthur Perais/André Seznec - CALCM 11/6/2018

Speculative Window Use in-flight predictions blocks as last values by using a small fully associative window: We can do it thanks to block-based prediction: Less blocks than instructions in flight: Few entries in the window. Less blocks than instructions fetched each cycle: Only 1/2 associative lookups per cycle instead of 6/8. Much less complex than the broadcast-based IQ model. We propose to implement this window as a fully associative structure where inflight prediction blocks can be used as last values. We can do it thanks to BeBoP. because first there are only a few blocks in flight so few entries in the window. And second only a one or two blocks fetch so ½ associative lookups instead of 8. Arthur Perais/André Seznec - CALCM 11/6/2018

Associative Speculative Window
Block PC LVT tag n last seq … tag n last seq =? =? The window would be searched in parallel with the LVT when a prediction block must be generated. It is a fully associative buffer that is chronologically ordered by internal sequence numbers so you can get the up-to-date prediction block. It is partially tagged (15 bits in our case) since false positive is allowed. Priority Encoder n Last Values Arthur Perais/André Seznec - CALCM 11/6/2018

What is new in D-VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)] LVT VT0
hash hash hash hash VT1 VTN n last value n n str ctr nstr nctr tag u … nstr nctr tag u =? =? Let us summarize what is new in D-VTAGE, combined to BeBoP here. On the left, we add the LVT, which is direct mapped. The speculative window is not shown by lack of space but it is there. The rest is similar to VTAGE except strides are stored instead of full values. This ends the overview of D-VTAGE. Sat? + n pred n use Arthur Perais/André Seznec - CALCM 11/6/2018

BeBoP & DVTAGE: Results
4.3 BeBoP & DVTAGE: Results Finally, let us take a look a results. Arthur Perais/André Seznec - CALCM 11/6/2018

Experimental Framework
Configs w/o BeBoP: Baseline_6_60 (6 issue, 60IQ). Baseline_VP_6_60 (6 issue, 60 IQ, VP w/ > 256kB D-VTAGE). EOLE_4_60 (VP w/ > 256kB D-VTAGE + 4-issue EOLE [Perais & Seznec, ISCA’14], 60IQ). Forward Probabilistic Counters for confidence estimation [Perais & Seznec, HPCA’14]. D-VTAGE has 6 partially tagged components We mostly retain the parameters used for the study on EOLE and we consider three configurations: the baseline without value prediction, the same with a DVTAGE predictor, and the port constrained EOLE model used in previous experiments. We also use more SPEC benchmarks as more recent versions of gem5 are able to run them on x86. Arthur Perais/André Seznec - CALCM 11/6/2018

Speedup over Baseline_VP_6_60, inst-based VP
0,982 Reduces the issue-width from 6 to 4 at marginal cost in performance. Improves performance in a few cases, thanks to the additional issue bandwidth provided by Early/Late Execution through VP. Performance we want to attain with D-VTAGE w/ BeBoP. EOLE is required for a practical implementation of VP, or at least it is a way to make VP more practical. This graph shows the performance of a 4-issue EOLE model with D-VTAGE over a 6-issue model also with D-VTAGE. It simply serves to convince you that EOLE is viable with this predictor. And indeed, EOLE allows to reduce the issue-width from 6 to 4 at marginal cost in performance. It substantially improves it in a few cases, thanks to the additional issue bandwidth provided by Early/Late Execution through VP. This is the performance we want to attain with a small D-VTAGE and BeBoP. So this is the performance we use as baseline in further experiments. Arthur Perais/André Seznec - CALCM 11/6/2018

Grouping Predictions using BeBoP on EOLE_4_60
Vary n from 4 to 8 while keeping the size roughly constant. Performance is over EOLE_4_60 w/o BeBoP. At similar number of entries, 6 perform similarly to 8, but better than 4. At similar storage budget, more entries but less predictions per entry may perform sightly better. [Min,Max] box plots of speedups + gmean 1/2K : #entries in LVT/base pred 128/256: #entries in the 6 comps - Now that we have a certain confidence in the fact that D-VTAGE is a good predictor, we need to make it more realistic. First, since we work with CISC, how many predictions per entry do we need given that we use 16-byte fetch blocks? More predictions will give better coverage inside blocks, but each entry will require more space so there will be less entries. Here we give min,max boxplot of the speedups, with the gmean of the speedups in the box, when varying the number of predictions per entry. We found that in general, 6 predictions perform similarly to 8, but noticeably better than 4. However, at similar storage budget, more entries but less predictions per entry may perform slightly better. However we found that it is really depending on the storage budget. So for now, 6 predictions per entry appear as a good design point. Arthur Perais/André Seznec - CALCM 11/6/2018

Towards Realistic Configurations with BeBoP
With 6 predictions per entry and a 2K + 6x256 D-VTAGE predictor, get insight on the performance of the block-based speculative window. Performance is over EOLE_4_60 w/o BeBoP. The window is needed in a few benchmarks (wupwise, applu, bzip and xalancmbk). Without it, performance is decreased by up to 18%. However, only a few entries are necessary, around 32 to 56. Then, for D-VTAGE, we need to gauge the usefulness of the speculative window. In this graph, we go from an infinite window to no window at all on a big D-VTAGE predictor with 6 predictions per entry. This means that even though it is associative, its complexity is not really comparable to that of a broadcast-based scheduler, for instance. Arthur Perais/André Seznec - CALCM 11/6/2018

Towards Realistic Configurations with BeBoP
Tradeoff on predictor size: All D-VTAGE components contain strides, except the LVT. Partial strides can be used. Virtually no performance difference between 16-, 32- and 64-bit. Marginal slowdown with 8-bit strides (0.5% avg, 3% max). Predictor size is reduced by a factor of 2.1 by using 8-bit strides for the 2K + 6x256 configuration (6p per entry). Lastly, we need to reduce the size of the predictor as much as possible. The best way to do it is to use partial strides. We found virtually no difference in performance by using 16, 32 and 64-bit strides, and even with 8-bit strides, the difference is marginal (.5% difference on average, 3% max between 64 and 8 bit). Meanwhile, predictor size is reduced by 2.1 by using 8-bit stride on the configuration we have been using so far. It is a very efficient way to reduce the predictor footprint, and Note that it would not have been possible to reduce the size of VTAGE this way. Arthur Perais/André Seznec - CALCM 11/6/2018

Final D-VTAGE Models Three final configurations:
Small: 128-entry LVT/base + 6x128 tagged, 32-entry spec. window, 8-bit strides -> 17.18kB Medium: 256-entry LVT/base + 6x256 tagged, 32-entry spec. window, 8-bit strides -> 32.76kB Large: 512-entry LVT/base + 6x256 tagged, 56-entry spec. window, 16-bit strides -> 61.6kB Use Baseline_6_60 as the baseline: this is the modern CPU model we want to outperform. Objective: Minimize the slowdown when compared to EOLE_4_60 w/ instruction-based D-VTAGE: « ideal » model in our case. Using the insight gained in previous experiments, we devise three D-VTAGE configurations, all having 6 predictions per entry. They require and 64KB respectively. //Note in accordance with previous observations, we try to keep the tagged components as big as possible even if it means using a fairly small LVT/base component. //Note that most of the reduction in size comes from the use of 8/16-bit strides. This would not be possible in VTAGE, for instance. Arthur Perais/André Seznec - CALCM 11/6/2018

Final D-VTAGE Models 1.322 1.622 1.088 1.294 1.350 1.288 1.312 Realistic VP with 16kB of storage (Small) increases performance by almost 9% on average. Performance is increased by around or more than 30% in 6 benchmarks out of 36 (by more than 10% in 11 out of 36). Not so bad for a predictor with so few entries. Arthur Perais/André Seznec - CALCM 11/6/2018

Final D-VTAGE Models 1.110 1.130 VP with 32/64kB of storage increases performance by 11% and 13% on average (EOLE_4_60 is at 15%). Medium is generally on-par with Large. Medium is very close to the unrealistic instruction-based EOLE_4_60 in 5 benchmarks showing good speedup out of 14. Performance is still good in most of the 9 remaining « well behaving » benchmarks. Medium performs often very close to Large. It is also very close to the unrealistic… Performance is still acceptable in the remaining 9 « well behaving » benchmarks So overall, we can obtain good enough speedup with only 32KB of storage. Arthur Perais/André Seznec - CALCM 11/6/2018

Existing predictors may not be adapted. Validation & recovery at execute: Validation in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. Predictor Implementation considerations: Multiporting for superscalar VP. Big storage footprint. VTAGE Validation & Squash at commit. EOLE Bonus: OoO complexity-- BeBoP provides superscalar VP with single ported arrays BeBoP Arthur Perais/André Seznec - CALCM 11/6/2018

Existing predictors may not be adapted. Validation & recovery at execute: Validation in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. Predictor Implementation considerations: Multiporting for superscalar VP. Big storage footprint. VTAGE Validation & Squash at commit. EOLE Bonus: OoO complexity-- DVTAGE provides an efficient predictor having a reasonable storage footprint So we claim to have addressed the complexity remaining in the predictor ifnfrastructure BeBoP D-VTAGE Arthur Perais/André Seznec - CALCM 11/6/2018

4 Concluding Remarks Ship it! Arthur Perais/André Seznec - CALCM
11/6/2018

Value Prediction In a Processor?
Validation and Squash at commit: no overhaul of the OoO engine [Perais & Seznec, HPCA14]. EOLE: avoid additionnal ports on the PRF for VP, reduce the complexity of the OoO engine [Perais & Seznec, ISCA14]. BeBoP: block-based prediction to enable superscalar prediction with single-ported arrays. D-VTAGE: tightly-coupled hybrid having good performance with reasonable budget thanks mostly to partial strides. [Perais & Seznec, HPCA15]. If we summarize the current state of VP, we have : Validation and squash at commit: no overhaul of the OoO engine is required for VP. EOLE: VP enabled performance with a simpler OoO engine. Arthur Perais/André Seznec - CALCM 11/6/2018

That’s all folks! What about using a simple Stride/2d Stride predictor? -> D-VTAGE has a stride predictor. -> Per-Path Stride Predictor is better so global branch history should help. -> We did not simulate a very small stride predictor so I cannot give a definitive answer on that, but I would say that here we give a way to benefit from the prediction scheme of VTAGE while being able to reduce the size of the predictor to an acceptable one. In the process, we get the ability to predict strided patterns and even more. The only issue is then going to arise in code where there is aliasing because the last value table is too small and instructions in the block are predictable by stride, and aliasing disappears by doubling the size of the LVT. XOR the least significant bits with the PC to the fetch block instead of using local tags. Can’t for last values. Accuracy is > 99.6% in all cases, > 99.9% in 21 benchmarks out of 36. Coverage is 47.6% for D-VTAGE vs. 34.3% for VTAGE and 35.7% for 2d-Stride, on average. On average, for D-VTAGE with the baseline configuration (no EOLE), loads represent 26,12% of the predictions, and among those, 9.1% are long latency (latency > L2 hit). On average, they amount to 7.5% of the total number of long latency loads. Power and energy are mentioned in all reviews. VP as implemented in this paper decreases consumption because 1) Performance increases 2) The issue-width is reduced (thanks to EOLE), but increases power consumption because of 1) The - simple - additional ALUs required by EOLE 2) The value predictor 3) The speculative value prediction window. Given that the scheduler is responsible for a substantial part of the consumption of the core (18% in the Alpha 21264, 16% in the PentiumPro, as summarized by Ernst and Austin, "Efficient Dynamic Scheduling Through Tag Elimination" ISCA'02), a reduction of the issue-width and its implications clearly gives us headroom. The value predictor itself is comparable in design (number of tables, ports, storage volume, pressure) - and therefore in power consumption - to an aggressive branch predictor. Regarding the speculative window, we argue that 32 entries are a good tradeoff. This is roughly two times fewer entries than Haswell's scheduler, so not equivalent to the scheduler we simulate. Indeed, assuming a naive CAM-like scheduler and 6/8 results per cycle, then each entry of the scheduler must provision 12/16 comparators for wakeup, assuming 2 operands per entry (consider that AMD Bulldozer's actually has 4, "40-Entry Unified Out-of-Order Scheduler and Integer Execution Unit for the AMD Bulldozer x86-64 Core" at ISSCC 2011). The speculative window only requires as many comparators per entry as there are blocks (2 in our study) fetched per cycle (granted that the comparators are bigger since we match 15 bits). As a result, the complexity and power consumption of the speculative window should be much lower than those of the scheduler. Arthur Perais/André Seznec - CALCM 11/6/2018

Practical Value Speculation for Future High-End Processors

Similar presentations

Presentation on theme: "Practical Value Speculation for Future High-End Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Practical Value Speculation for Future High-End Processors

Similar presentations

Presentation on theme: "Practical Value Speculation for Future High-End Processors"— Presentation transcript:

Similar presentations

About project

Feedback