Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arthur Perais & André Seznec

Similar presentations


Presentation on theme: "Arthur Perais & André Seznec"— Presentation transcript:

1 Arthur Perais & André Seznec
BeBoP: A Cost Effective Predictor Infrastructure for Superscalar Value Prediction Arthur Perais & André Seznec This presentation is about how to build a simple prediction infrastructure handling several predictions per cycle. Arthur Perais & André Seznec - HPCA 2015 EMETTEUR 00 MOIS 2011 11/10/2018

2 Outline Value Prediction so Far.
Cost Effective Mechanisms for Supercalar VP. Block-Based Prediction (BeBoP). The Differential VTAGE predictor. Experimental Results. Conclusion. First, I’ll begin by a summary of what has been done recently in this field. Second I will present two mechanisms, BeBoP, that allows superscalar value prediction with single ported structures, and the D-VTAGE predictor, that allows good VP performance a reasonnable storage budget. Third, I will illustrate the fact that a realistic infrastructure using those two mechanisms performs on-par with a more idealistic implementation of VP. And finally I’ll give some concluding remarks. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

3 Value Prediction so Far
1 Value Prediction so Far So, let’s jump into the state-of-the-art on Value Prediction. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

4 Value Prediction [Lipasti&al.96][Mendelson&al.97]
Breaks true data dependencies to extract more ILP, e.g.: Becomes, if I3 is predicted: I1 I1 I2 I2 I3 I3 I4 I4 I5 I5 And first of all, a quick refresher on the idea behind VP. For instance, the chain of dependent instructions can be broken into two independent chains if I3 is predicted. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

5 Value Prediction [Lipasti&al.96][Mendelson&al.97]
Instruction-based VP: Prediction functions: Last Value: Pred = Result of last instance. Stride: Pred = Result of last instance + constant. FCM: Pred based on results of n recent instances. VTAGE: ≈ Control-flow based Last Value Predictor. Predictor Inst PC Prediction Prediction is ensured by a predictor to which we provide an instruction PC and from which we get a prediction. Several prediction schemes were proposed, such as Last Value, Stride, local value history-based, and global branch history based. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

6 Implementing Value Prediction
Validation & recovery at execute: HW in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. The predictor is an issue in itself: Multiporting for superscalar VP. Big storage footprint. Unfortunately, VP has been considered too complex to implement so far. The reasons are numerous. First, predictions need to be checked, and recovery must take place on a misprediction. This is usually done at execute by selectively replaying instructions. This entails complex hardware in an already very complex piece of hardware. Second, predictions need access to the register file to be written and validated. This entails additional ports on the PRF, which is something one would usually try to avoid. Third, regardless of its prediction scheme, the predictor must provide several predictions per cycle, and since it predicts 64-bit values, it is can be quite big. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

7 Value Prediction Recently
Validation & Recovery [Perais & Seznec, HPCA’14]: Only use highly confident predictions to minimize the the number of mispredictions. Squash the pipeline on a wrong prediction. Validate outside the OoO engine, at commit. n-issue Out-of-order Engine Fetch VPredict ROB IQ PRF PC FUs Fortunately, some mechanisms were recently proposed to make VP more practical. The first one dealt with prediction validation and recovery. In particular, given very high accuracy on the predictor, predictions are seldom wrong, so a high recovery cost can be absorbed. This means that we can squash the pipeline on a misprediction instead of using selective replay, for instance. This also means that validation can be done as late as possible, just before commit hence outside the Out-of-order engine. This yields the following pipeline organization where VP almost only intervenes on the left, outside the OoO engine. Validation + Squashing @commit Arthur Perais & André Seznec - HPCA 2015 11/10/2018

8 Implementing Value Prediction
Validation & recovery at execute: HW in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. The predictor is an issue in itself: Multiporting for superscalar VP. Storage footprint. Validation & Squash at commit. This mechanism solves one of the big remaining issues with VP. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

9 Value Prediction Recently
Leverage VP to reduce Complexity in the OoO engine: Early Execution: Execute instructions in the front-end if their operands are predicted. Rename Fetch To OoO engine Early Exec PC Predictions VPredict Next, previous work notes that VP provides operands for instructions early in the pipeline, therefore instructions with predicted sources can be early executed in-order in parallel with Rename. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

10 Value Prediction Recently
Leverage VP to reduce Complexity in the OoO engine: Late Execution: Execute predicted instructions outside the OoO engine, at commit. Rename Fetch To OoO engine Early Exec PC Predictions VPredict Similarly, predicted instructions become non-critical since their result is already available, because it is predicted. Therefore, with Late Execution, predicted ALU instructions are executed as late as possible, just before commit. Validation + Squashing @commit Late Execution From OoO engine Arthur Perais & André Seznec - HPCA 2015 11/10/2018

11 Value Prediction Recently
Leverage VP to reduce Complexity in the OoO engine [Perais & Seznec, ISCA’14]: Early/Late Execution (EOLE) to reduce the issue-width. Simpler Wakeup & Select, simpler bypass, less PRF ports. With Early and Late Execution, also known as the EOLE architecture, the OoO issue-width can be reduced since much less instructions need to be executed out-of-order. This mechanically saves PRF ports and greatly reduces complexity in a critical piece of hardware. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

12 Implementing Value Prediction
Validation & recovery at execute: HW in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. The predictor is an issue in itself: Multiporting for superscalar VP. Storage footprint. Validation & Squash at commit. EOLE Bonus: OoO complexity-- In conclusion, we get VP-enabled performance with as many PRF ports as a pipeline without VP, but we have a simpler OoO engine (e.g. 4-issue instead of 6). Yet, we still have to deal with the predictor itself for a practical implementation of VP. 4:15 Arthur Perais & André Seznec - HPCA 2015 11/10/2018

13 Cost Effective Mechanisms for Superscalar VP
2 Cost Effective Mechanisms for Superscalar VP Block-Based Prediction (BeBoP) - Which is why we need cost-effective mechanisms to implement the predictor. The first one we introduce today is Block-Based Prediction, abbreviated BeBoP, and deals with superscalar VP. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

14 Issues with Superscalar Prediction
n instructions per cycle require n predictor accesses per cycle. Multiporting and replication are too expensive, especially if n is around 8. Banking can be envisioned for RISC and predictors that do not use local value history (e.g. LVP, Stride, VTAGE). What if we want to use CISC or a different predictor? We need a more general scheme. - So, superscalar VP is simply the ability to predict several instructions per cycle. The intuitive way to look at it is that n instructions per cycle require n accesses per cycle. How do you handle this? First, We can multiport or replicate, but it is very expensive. Second, If some constraints are met, that is, using RISC and predictors that do not use local value history, we can bank the predictor. But what if we want CISC or another predictor? - There is a call for a more general scheme that is efficient (i.e. avoids multiporting or replication) and accomodates both RISC and CISC. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

15 Introducing Block-Based Prediction (BeBoP)
Instruction are usually fetched by chunks (fetch blocks). Group predictions belonging to a single fetch-block in a single predictor entry. At fetch, the predictor is accessed using the block PC. Information available in both CISC and RISC. n predictions are retrieved in a single read. One port, n predictions. The general scheme we propose, block-based prediction or BeBoP, as opposed to the usual instruction-based prediction, is based on the observation that sequential instructions are actually grouped by fetch blocks in the instruction cache. The key idea then consists in grouping predictions belonging to a single fetch-block in a single predictor entry. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

16 Introducing Block-Based Prediction (BeBoP)
PC Attribute prediction to uops in sequential order ICache line Fetch Block Fetch b0 b1 b2 b3 Pred0 Conf0 Pred1 Conf1 Pre-Decode I0 I1 I2 sat? sat? Decode UOP0 UOP1 UOP0 To illustrate this in more detailed, let us consider BeBoP on CISC with an abstract predictor having two predictions per entry. So we fetch a 4-byte block using the fetch PC. In parallel we access the predictor. Pre-Decode, we get the instruction boundaries. Decode, nothing new, we get the microops. Rename, we get the destination registers. Using this information, and the state of the confidence counters, because we only use confident predictions, we can now attribute predictions to MUOPS in a sequential fashion, that is, Pred0 goes to muop0 of I1 so in r0 and Pred1 goes to MUOP1 of I1 so in R1. Although this is only the big picture, this gives the key idea of what BeBoP is. Rename r0 r1 r2 enable enable Arthur Perais & André Seznec - HPCA 2015 11/10/2018

17 Cost Effective Mechanisms for Superscalar VP
2 Cost Effective Mechanisms for Superscalar VP The Differential VTAGE Predictor So we are now able to generate several predictions per cycle with a single read port. Yet we still need to reduce the predictor footprint. To that extent, we improve on the VTAGE predictor by proposing the differential VTAGE, D-VTAGE, 7:00 Arthur Perais & André Seznec - HPCA 2015 11/10/2018

18 A remainder on VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)]
N partially tagged components Base Predictor: Tagless LVP hash hash hash hash VT0 VT1 VTN pred ctr tag u pred ctr tag u pred ctr =? =? First of all, a quick remainder on VTAGE, without block-based prediction for clarity. Two parts in VTAGE. Base predictor: tagless Last Value Predictor. N partially tagged components. Accessed with the PC and some bits of the global branch history, in a geometric fashion. 2 bits of ghist for VT1, 4 for VT2, 8 for VT3 and so on. Sat? use pred Arthur Perais/André Seznec - HPCA 2014 11/10/2018

19 Predicting with VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)]
The rightmost matching component predicts (VTN) hash hash hash hash VT0 VT1 VTN pred ctr tag u pred ctr tag u pred ctr =? =? To predict, all components are accessed in parallel. Prediction will flow from the rightmost component that matches, that is the one using the longest global branch history. For instance here, all components match, so the prediction flows from VTN. Sat? use pred Arthur Perais/André Seznec - HPCA 2014 11/10/2018

20 No tag match: The base predictor
Predicting with VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)] No tag match: The base predictor predicts (VT0) hash hash hash hash VT0 VT1 VTN pred ctr tag u pred ctr tag u pred ctr =? =? If there is no match in the tagged components, the prediction will flow from the base predictor. Then, the value of the confidence counter tells us if we can use the selected prediction. Sat? use pred Arthur Perais/André Seznec - HPCA 2014 11/10/2018

21 Limitations of VTAGE VTAGE is big since it stores full 64-bit values in all its components. VTAGE cannot handle strided patterns efficiently: Each value in the pattern occupies its own entry (a single entry for the whole pattern in the Stride predictor). A « naive » hybrid of Stride and VTAGE remedy this, but is inefficient: Both components are trained for every instructions. Storage requirement is high. Of course, VTAGE is not perfect. In particular, it cannot handle strided patterns efficiently while those patterns are frequent in programs. Storage requirement is high because both components must be big enough to perform well. VTAGE is not easy to shrink. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

22 Introducing the Differential VTAGE Predictor
Inspired by the D-FCM of [Goeman & al., HPCA01] Store differences (strides) instead of full 64-bit values in the predictor. Use a table to track last values (Last Value Table). Pros: Cons: Space efficient: strides can be small (8/16/32 bits). Slower (adder on the prediction critical path). A prediction now depends on the previous result (Last Value). Tightly coupled: can predict control flow dependent strided patterns. To address these shortcomings It stores differences, strides instead of full 64-bit values. It uses a table to track the last outcomes of dynamic instructions, called the Last Value Table. We get a stride from the predictor and we add it to the last outcome from the LVT. We can easily reduce the size of D-VTAGE because small strides can be used. It can predict like stride, it can predict like VTAGE, and it can predict like a combination of both which is not the case of a naive hybrid. So its prediction scheme is more powerful. However, it may be slower because there is an adder on the prediction critical path. More importantly, a prediction now depends on the previous result, this raises the following question. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

23 Introducing the Differential VTAGE Predictor
What if the last value has not been retired or even computed? Retire + Update LVT Pred Inst_1 Pred Instance_2 Ideally: Last_0 Last_1 LVT Retire + Update LVT Pred Inst_1 Pred Inst_2 In practice: Last_0 Last_0 Actually we also introduce the following issue: what if the last value has not been retired or even computed. Because ideally, we use the result of instance 0 to predict instance 1, and the result of instance 1 to predict instance 2 as illustrated by the yellow arrows. But in some cases, the LVT has not been updated with the result of instance 1 when instance 2 must be predicted, so it sees the result of instance 0 in the LVT, which in incorrect. So we need a speculative window from where to get up-to-date results. t Arthur Perais & André Seznec - HPCA 2015 11/10/2018

24 Speculative Window Use in-flight predictions blocks as last values by using a fully associative speculative window. Doable thanks to BeBoP: Less blocks than instructions in flight: Few entries in the window. Less blocks than instructions fetched each cycle: Only 1/2 associative lookups per cycle instead of 6/8. Much less complex than the broadcast-based IQ model. We propose to implement this window as a fully associative structure where inflight prediction blocks can be used as last values. We can do thanks to bebop because first, there are only a few blocks in flight so few entries in the window. And second only a one or two blocks fetch so ½ associative lookups instead of 8. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

25 Speculative Window … Block PC LVT tag n last seq tag n last seq =? =?
The window would be searched in parallel with the LVT when a prediction block must be generated. It is a small associative buffer that is chronologically ordered by internal sequence numbers so you can get the up-to-date prediction block. It is partially tagged (15 bits in our case) since false positive is allowed. - Managed as a circular buffer. Priority Encoder n Last Values Arthur Perais & André Seznec - HPCA 2015 11/10/2018

26 What is new in D-VTAGE pc pc ghist[0,L(1)] pc ghist[0,L(N)] LVT VT0
hash hash hash hash VT1 VTN n last value n n str ctr nstr nctr tag u nstr nctr tag u =? =? Let us summarize what is new in D-VTAGE, combined to BeBoP here. On the left, we add the LVT, which is direct mapped. It contains the last values. The speculative window is not shown by lack of space but it is there. The rest is similar to VTAGE except strides are stored instead of full values. This ends the overview of D-VTAGE. Sat? + n pred n use Arthur Perais/André Seznec - HPCA 2014 11/10/2018

27 3 Experimental Results Now we have a new predictor that can be made small thanks to the use of strides, and we can make it single ported with block-based prediction. So it is time to see if we can manage to obtain good performance with those two mechanims 12:20 Arthur Perais & André Seznec - HPCA 2015 11/10/2018

28 Experimental Framework
Simulator: gem5 (x86_64). Haswell-like core: 4GHz, 8-wide, 6-issue, 20/21 cycles min. B/Vmispred. 192ROB, 60IQ, 72LQ/48SQ, 256INT/256FP regs. 32KB L1D/L1I, 1MB unified L2 with stride prefetcher. 4GB DDR (min. ~75 cycles). 16-byte fetch block, 2 blocks/cycle (single taken branch). To evaluate our propositions, we use the gem5 simulator with the x86 ISA. We model a haswell-like core using the following parameters. 4Ghz, 8-wide, 6-issue with the given branch/value misprediction penalties. Pretty standard otherwise except we allow two 16-byte blocks to be fetched each cycle, potentially over a taken branch. I don’t have the time to go into details, but D-VTAGE with BeBoP can accomodate this using the banking technique of the Alpha EV8 branch predictor. More on this in the paper. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

29 Experimental Framework
Configs w/o BeBoP: Baseline_6_60 (6 issue, 60IQ). Baseline_VP_6_60 (6 issue, 60 IQ, VP w/ 256kB D-VTAGE). EOLE_4_60 (VP w/ 256kB D-VTAGE + 4-issue EOLE [Perais & Seznec, ISCA’14], 60IQ). Value Predictors use Forward Probabilistic Counters for confidence estimation [Perais & Seznec, HPCA’14]. VTAGE/D-VTAGE have 6 partially tagged components. In more details, we use three main pipeline configurations. First is our baseline, 6-issue and 60-entry IQ, same as in the previous slide. Second is similar except if features a big D-VTAGE predictor with validation and recovery at commit. Last, since VP needs EOLE to limit complexity and PRF ports, we consider a 4-issue EOLE configuration with a big D-VTAGE). Arthur Perais & André Seznec - HPCA 2015 11/10/2018

30 Experimental Framework
Single-thread benchmarks: Subset of SPEC’00 and SPEC’06 (36 benchmarks) – ref inputs. Simpoint: One slice per benchmark, warmup for 50Minsts, run for 100Minsts. We use single thread benchmarks as we focus on sequential performance. For each benchmark, we identify one region of interest using simpoint, then we warmup for 50M instructions and finally collect results for 100M instructions. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

31 Speedup over Baseline_6_60, instruction-based VP
Generally on-par with the hybrid, plus a few « well behaving » benchmarks. D-VTAGE substantially outperforms a hybrid tracking as many PCs in several cases It performs noticeably worse in five cases (by 4% at most) The first graphs studies the performance of D-VTAGE vs existing similarly sized predictors. Only the hybrid is bigger since it consists of 2d-Stride and VTAGE put side-by-side. Nonetheless, it cannot track more PCs than each individual component. So it appears as a good predictor overall, and we only consider D-VTAGE in further experiments. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

32 Speedup over Baseline_VP_6_60, inst-based VP
0,982 Reduces the issue-width from 6 to 4 at marginal cost in performance. Improves performance in a few cases, thanks to the additional issue bandwidth provided by Early/Late Execution through VP. Performance we want to attain with D-VTAGE w/ BeBoP. Yet, EOLE is required for a practical implementation of VP, or at least it is a way to make VP more practical. This graph shows the performance of a 4-issue EOLE pipeline with D-VTAGE over a 6-issue pipeline also with D-VTAGE (grey bar of previous graph). It simply serves to convince you that EOLE is viable with this predictor. And indeed, EOLE allows to reduce the issue-width from 6 to 4 at marginal cost in performance. It substantially improves it in a few cases, thanks to the additional issue bandwidth provided by Early/Late Execution through VP. This is the performance we want to attain with a small D-VTAGE and BeBoP. 15:30 Arthur Perais & André Seznec - HPCA 2015 11/10/2018

33 Grouping Predictions using BeBoP on EOLE_4_60
Vary n from 4 to 8 while keeping the size roughly constant. Performance is over EOLE_4_60 w/o BeBoP. At similar number of entries, 6 perform similarly to 8, but better than 4. At similar storage budget, more entries but less predictions per entry may perform sightly better. [Min,Max] box plots of speedups + gmean 1/2K : #entries in LVT/base pred 128/256: #entries in the 6 comps - Now that we have a certain confidence in the fact that D-VTAGE is a good predictor, we need to make it more realistic. First, since we work with CISC, how many predictions per entry do we need given that we use 16-byte fetch blocks? More predictions will give better coverage inside blocks, but each entry will require more space so there will be less entries. Here we give min,max boxplot of the speedups, with the gmean of the speedups in the box, when varying the number of predictions per entry. We found that in general, 6 predictions perform similarly to 8, but noticeably better than 4. However, at similar storage budget, more entries but less predictions per entry may perform slightly better. However we found that it is really depending on the storage budget. So for now, 6 predictions per entry appear as a good design point. Arthur Perais/André Seznec - CALCM 11/10/2018

34 Towards Realistic Configurations with BeBoP
With 6 predictions per entry and a 2K + 6x256 D-VTAGE predictor, get insight on the performance of the block-based speculative window. Performance is over EOLE_4_60 w/o BeBoP. The window is needed in a few benchmarks (wupwise, applu, bzip and xalancmbk). Without it, performance is decreased by up to 18%. However, only a few entries are necessary, around 32 to 56. Then, for D-VTAGE, we need to gauge the usefulness of the speculative window. In this graph, we go from an infinite window to no window at all on a big D-VTAGE predictor with 6 predictions per entry. This means that even though it is associative, its complexity is not really comparable to that of a broadcast-based scheduler, for instance. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

35 Towards Realistic Configurations with BeBoP
Tradeoff on predictor size: All D-VTAGE components contain strides, except the LVT. Partial strides can be used. Virtually no performance difference between 16-, 32- and 64-bit. Marginal slowdown with 8-bit strides (0.5% avg, 3% max). Predictor size is reduced by a factor of 2.1 by using 8-bit strides for the 2K + 6x256 configuration (6p per entry). Lastly, we need to reduce the size of the predictor as much as possible. The best way to do it is to use partial strides. We found virtually no difference in performance by using 16, 32 and 64-bit strides, and even with 8-bit strides, the difference is marginal (.5% difference on average, 3% max between 64 and 8 bit). It is a very efficient way to reduce the predictor footprint, and Note that it would not have been possible to reduce the size of VTAGE this way. 17:40 Arthur Perais & André Seznec - HPCA 2015 11/10/2018

36 Final D-VTAGE Models Three final configurations:
Small: 128-entry LVT/base + 6x128 tagged, 32-entry spec. window, 8-bit strides -> 17.18kB Medium: 256-entry LVT/base + 6x256 tagged, 32-entry spec. window, 8-bit strides -> 32.76kB Large: 512-entry LVT/base + 6x256 tagged, 56-entry spec. window, 16-bit strides -> 61.6kB Use Baseline_6_60 as the baseline: this is the modern CPU model we want to outperform. Objective: Minimize the slowdown when compared to EOLE_4_60 w/ instruction-based D-VTAGE: "ideal" model in our case. Using the insight gained in previous experiments, we devise three D-VTAGE configurations, all having 6 predictions per entry. They require and 64KB respectively. //Note in accordance with previous observations, we try to keep the tagged components as big as possible even if it means using a fairly small LVT/base component. //Note that most of the reduction in size comes from the use of 8/16-bit strides. This would not be possible in VTAGE, for instance. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

37 Final D-VTAGE Models 1.322 1.622 1.088 1.294 1.350 1.288 1.312 Realistic VP with 16kB of storage (Small) increases performance by almost 9% on average. Performance is increased by around or more than 30% in 6 benchmarks out of 36 (by more than 10% in 11 out of 36). Not so bad for a predictor with so few entries. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

38 Final D-VTAGE Models 1.110 1.130 VP with 32/64kB of storage increases performance by 11% and 13% on average (EOLE_4_60 is at 15%). Medium is generally on-par with Large. Medium is very close to the unrealistic instruction-based EOLE_4_60 in 5 benchmarks showing good speedup out of 14. Performance is still good in most of the 9 remaining "well behaving" benchmarks. Medium performs often very close to Large. It is also very close to the unrealistic… Even in the 9 remaining well behaving benchmarks, performance is still acceptable. So overall, we can obtain good performance with only 32KB of storage. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

39 4 Concluding Remarks Ship it! Arthur Perais & André Seznec - HPCA 2015
11/10/2018

40 Value Prediction In a Processor?
Validation and Squash at commit: no overhaul of the OoO engine [Perais & Seznec, HPCA14]. EOLE: avoid additionnal ports on the PRF for VP, reduce the complexity of the OoO engine [Perais & Seznec, ISCA14]. If we summarize the current state of VP, we have : Validation and squash at commit: no overhaul of the OoO engine is required for VP. EOLE: VP enabled performance with a simpler OoO engine. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

41 Value Prediction In a Processor?
Now the predictor: BeBoP: block-based prediction to enable superscalar prediction with single-ported arrays. Also enables the use of an associative speculative window for stride-based predictors. Accomodates both CISC and RISC. D-VTAGE: tightly-coupled hybrid having good performance with reasonable budget thanks mostly to partial strides. And now, we have a cost-effective prediction infrastructure: BeBoP to enable superscalar value prediction with single-ported arrays. A predictor whose structure allows for a great reduction in size while retaining good performance. Arthur Perais & André Seznec - HPCA 2015 11/10/2018

42 Implementing Value Prediction
Validation & recovery at execute: HW in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. The predictor is an issue in itself: Multiporting for superscalar VP. Storage footprint. Validation & Squash at commit. EOLE Bonus: OoO complexity-- BeBoP Arthur Perais & André Seznec - HPCA 2015 11/10/2018

43 Implementing Value Prediction
Validation & recovery at execute: HW in the OoO Core. Selective replay. Register File: Write ports to write predictions. Read ports to validate predictions. The predictor is an issue in itself: Multiporting for superscalar VP. Storage footprint. Validation & Squash at commit. EOLE Bonus: OoO complexity-- The predictor infrastructure is not an issue anymore BeBoP D-VTAGE Arthur Perais & André Seznec - HPCA 2015 11/10/2018

44 That’s all folks! What about using a simple Stride/2d Stride predictor? -> D-VTAGE has a stride predictor. -> Per-Path Stride Predictor is better so global branch history should help. -> We did not simulate a very small stride predictor so I cannot give a definitive answer on that, but I would say that here we give a way to benefit from the prediction scheme of VTAGE while being able to reduce the size of the predictor to an acceptable one. In the process, we get the ability to predict strided patterns and even more. The only issue is then going to arise in code where there is aliasing because the last value table is too small and instructions in the block are predictable by stride, and aliasing disappears by doubling the size of the LVT. XOR the least significant bits with the PC to the fetch block instead of using local tags. Can’t for last values. Accuracy is > 99.6% in all cases, > 99.9% in 21 benchmarks out of 36. Coverage is 47.6% for D-VTAGE vs. 34.3% for VTAGE and 35.7% for 2d-Stride, on average. On average, for D-VTAGE with the baseline configuration (no EOLE), loads represent 26,12% of the predictions, and among those, 9.1% are long latency (latency > L2 hit). On average, they amount to 7.5% of the total number of long latency loads. Power and energy are mentioned in all reviews. VP as implemented in this paper decreases consumption because 1) Performance increases 2) The issue-width is reduced (thanks to EOLE), but increases power consumption because of 1) The - simple - additional ALUs required by EOLE 2) The value predictor 3) The speculative value prediction window. Given that the scheduler is responsible for a substantial part of the consumption of the core (18% in the Alpha 21264, 16% in the PentiumPro, as summarized by Ernst and Austin, "Efficient Dynamic Scheduling Through Tag Elimination" ISCA'02), a reduction of the issue-width and its implications clearly gives us headroom. The value predictor itself is comparable in design (number of tables, ports, storage volume, pressure) - and therefore in power consumption - to an aggressive branch predictor. Regarding the speculative window, we argue that 32 entries are a good tradeoff. This is roughly two times fewer entries than Haswell's scheduler, so not equivalent to the scheduler we simulate. Indeed, assuming a naive CAM-like scheduler and 6/8 results per cycle, then each entry of the scheduler must provision 12/16 comparators for wakeup, assuming 2 operands per entry (consider that AMD Bulldozer's actually has 4, "40-Entry Unified Out-of-Order Scheduler and Integer Execution Unit for the AMD Bulldozer x86-64 Core" at ISSCC 2011). The speculative window only requires as many comparators per entry as there are blocks (2 in our study) fetched per cycle (granted that the comparators are bigger since we match 15 bits). As a result, the complexity and power consumption of the speculative window should be much lower than those of the scheduler. Arthur Perais & André Seznec - HPCA 2015 11/10/2018


Download ppt "Arthur Perais & André Seznec"

Similar presentations


Ads by Google