Arthur Perais & André Seznec

Slides:

Advertisements

Similar presentations

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Advertisements

EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Revisiting Load Value Speculation:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

Data Prefetching Smruti R. Sarangi.

Multiscalar Processors

Michigan Technological University, Houghton MI

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Practical Value Speculation for Future High-End Processors

Arthur Perais & André Seznec

Exploring Value Prediction with the EVES predictor

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 18: Pipelining Today’s topics:

Hardware Multithreading

Address-Value Delta (AVD) Prediction

Lecture 18: Pipelining Today’s topics:

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

How to improve (decrease) CPI

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 20: OOO, Memory Hierarchy

* From AMD 1996 Publication #18522 Revision E

Data Prefetching Smruti R. Sarangi.

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Chapter 8. Pipelining.

Computer Evolution and Performance

ECE 352 Digital System Fundamentals

ARM ORGANISATION.

Patrick Akl and Andreas Moshovos AENAO Research Group

Lecture 10: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Project Guidelines Prof. Eric Rotenberg.

Presentation transcript:

Arthur Perais & André Seznec EOLE: Paving the Way for an Effective Implementation of Value Prediction Arthur Perais & André Seznec Let’s talk about value prediction! Arthur Perais & André Seznec - ISCA 2014 EMETTEUR 00 MOIS 2011 9/18/2018

Increasing Sequential Performance is Hard. « Natural » way: increase the superscalar width. Complexity, power, timing issues. Currently: try to maximize the utilization of the resources we can implement: Branch prediction to feed the execution core. Memory dependency prediction to increase ILP. Value Prediction to increase ILP. Let us note that while amdahl’s law tells us that sequential performance is still important, increasing it is hard. The « natural » way to do so is to leverage ILP better by increasing the superscalar width. However we quickly run into power and timing issues. Branch prediction to feed the core. Memory dependency prediction to reorder some memory instructions and increase ILP. But there are other ways to speculate in order to increase ILP that are not implemented today, such as Value Prediction, which is the focus of the presentation. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Outline Value Prediction Today. Introducing the EOLE Architecture. Lighter Value Prediction with EOLE. Results. Conclusion. So first I’ll describe state of the art Value prediction Then i’ll introduce our new architecture that aims to reduce the remaining complexity associated with Value Prediction Then I’ll show some results about how EOLE fares against state of the art And finally I’ll conclude and give directions for future work Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Value Prediction Today 1 Value Prediction Today What we have. So, let’s begin with Value Prediction Today. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Value Prediction [Lipasti96][Mendelson97] Breaks true data dependencies to extract more ILP, e.g: Becomes, if I3 is predicted: I1 I1 I2 I2 I3 I3 I4 I4 I5 I5 And first of all, what is value prediction? Well, value prediction really breaks RAW dependencies by speculating on results. For instance, the chain of dependent instructions can be broken into two independent chains if I3 is predicted. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Why is VP not Implemented Yet? Predictors: Stride-based and FCM: How do you deal with the speculative window? FCM: Timing issues with instructions in tight loops. Validation & Recovery: Validate in the OoO core. Selective replay to absorb the cost of a misprediction. Too complex. First because of predictors: Most of them require the previous result to predict for the current instance and this is problematic because there can be several instances in the instruction window. To provide coherent predictions, you need to keep track of all inflight predictions, and be able to pick the correct one to use as your speculative last value to predict the current instance. But how do you do it? Timing issues in some existing predictors when the same instructions is fetched in consecutive cycles. Validating was usually done out of order, meaning more hardware in the OoO core. Selective replay was assumed as the recovery mechanism to absorb the cost of mispredictions because it is faster than squashing, for instance. Complexity is increased because of the predictor and the hardware in the out-of-order engine. As it is described here, Value Prediction is simply to complex to be implemented. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

A First Solution [Perais&Seznec@HPCA14] A new predictor leveraging branch history, VTAGE: No speculative window required. No issues with tight loops. Validation & Recovery at retirement: Validate outside the OoO core, in-order. Squashing with very high predictor accuracy to recover. Actually still too complex. Recently, we proposed a first solution to reduce complexity. First we devised a new predictor that leverages the global branch history to predict, VTAGE. It does not require the previous result of an instruction to compute a prediction for the current instance, thus, it does not require a speculative window and has no issues with tight loops. And second, we found that we can delay validation and recovery until retirement. That is, we can validate outside the out-of-order core, in-order. And we can recover at commit using squashing as we are able to push the predictor accuracy very high. Yet, even with these improvements, VP is still too complex. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

The – Slightly – Hidden Costs of VP n-issue Out-of-order Engine Fetch ROB IQ PRF FUs VPredict PC Validation + Squashing @commit More ports on the PRF: Write ports to write predictions. Read ports to validate/train. What happens to the PRF if we add value prediction in a baseline pipeline? First, we need write ports on the PRF to write predictions in it at Dispatch so that the out-of-order engine can use them. Second, we need read ports in order to read the actual result from the PRF and validate the prediction against it, as well as train the predictor. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Let’s Count. Baseline 8-wide, 6-issue: VP 8-wide, 6-issue: 12 read ports, 6 write ports. VP 8-wide, 6-issue: 12R/6W for OoO execution. 8W to write 8 predictions/cycle in the PRF. 8R to validate/train 8 instructions/cycle. 12R/6W vs. 20R/14W! - We cannot bear such an increase in the number of ports: we need a way to reduce complexity in the PRF. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

2 The EOLE Architecture What we propose. - And this is actually the goal of the EOLE architecture we propose. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Leveraging the – slightly – Hidden Benefits of VP Value Prediction provides: Instructions with ready operands flowing from the value predictor. Predicted instructions not needing to be executed before retirement. Offload execution to some other in-order parts of the core to reduce complexity in the out-of-order core. Save PRF ports in the process. EOLE is based on two observations. First, VP provides instructions with ready operands flowing from the predictor. Thus some instructions are ready to execute long before they are dispatched. Second, VP provides predicted instructions that do not actually need to be executed before retirement since dependents can use the predicted result to execute. Therefore, we can offload part of the execution from the OoO engine without lengthening the execution critical path. We will see how we can reduce the number of ports on the PRF in the process. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Introducing Early Execution Out-of-order engine Fetch Decode Rename Dispatch Early Execution VPred Execute ready single-cycle instructions in parallel with Rename, in-order. Do not dispatch to the IQ. First, we propose early execution to execute single-cycle ready instructions in-order in the front-end. Since execution is in-order, it can be done in parallel with Rename. Early Executed instructions are not be dispatched to the IQ, their results simply have to be written in the PRF, like regular predictions. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Early Execution Hardware From Decode and Value Predictor Values come from: Decode (Immediate) Value Predictor Bypass Network To Dispatch The hardware required is a rank of simple ALUs and the associated bypass network. ALU stages can be chained but we found that a single stage is sufficient. Operands come from Decode, the value predictor or the bypass network. Once early executed, results will be written in the PRF using the ports provisioned for Value Prediction. Execute what you can, write in the PRF with the ports provisioned for VP. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Introducing Late Execution Out-of-order engine Validation/ Late Execution Validation CMP Retire VPredict Prediction FIFO Queue Execute single-cycle predicted instructions just before retirement, in-order. Do not dispatch to the IQ either. Second, and as discussed before, the execution of predicted instructions becomes non critical since dependent can execute using the prediction instead of the actual result. Thus, we propose late execution to execute single-cycle predicted instructions at retirement time, just before the prediction is validated. Similarly to early-executed instructions, late-executed instructions are not dispatched to the IQ. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Late Execution Hardware Validation Late Exec Control Prediction FIFO Queue PRF CMP I1Correct CMP I2Correct To late-execute instructions, we add a rank of simple ALUs after the out-of-order engine and before the validation. To read operands, we can use the read ports we assumed were there for validation. However, to ensure smooth late-execution, we might require more read ports as validation required one per instruction while late execution requires two as there can be two operands to read. To VPred Execute just before validation and retirement by leveraging the ports provisioned for validation. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

{Early | OoO | Late} Execution: EOLE Much less instructions enter the IQ: We may be able to reduce the issue-width: Simpler IQ. Less ports on the PRF. Less bypass. Simpler OoO. Non critical predictions become useful as the instructions can be late-executed. What about hardware cost? With EOLE, less instructions enter the IQ. Therefore, we could for instance reduce the issue-width, hence the number of ports required by OoO execution. We get a simpler scheduler. Less ports on the register file. We also get less bypass network. So we could really get an overall simpler out-of-order engine. Moreover, even non critical predictions become useful as the instructions can be late executed, that is, they don’t use any resource in the out-of-order engine anymore. But what about the hardware cost of this first proposition? Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Hardware Cost of EOLE Early Execution: A single rank of simple ALUs. Associated bypass network. No additional PRF ports. Late Execution & Validation: Rank of simple ALUs and comparators (to validate). No bypass. n read ports to validate becomes 2n to handle n instructions per cycle: 16R for an 8-wide pipeline. From 20R/14W for an 8-wide, 6-issue with VP, we now need 28R/14W! Only 12R/6W for the baseline… Early Execution appears as fairly light since execution is in-order and ALUs only handle simple operations. Furthermore, we do not require additional ports on top of a baseline Value Prediction processor. The most expensive piece of hardware might be the full bypass. For Late Execution, which is also done in-order, we need a rank of simple ALUs and some comparators, but we do not need any bypass. However, we will need up to 16 read ports to handle 8 instructions per cycle, which is double what we needed to validate 8 instructions per cycle. Therefore, if we consider an EOLE pipeline with the same issue width as a baseline Value Prediction pipeline, we actually increase the port requirements, from 20R/14W to 28R/14W, which is quite counter productive. Fortunately, we can greatly reduce this number with simple optimizations. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Ligther Value Prediction with EOLE 3 Ligther Value Prediction with EOLE What we can optimize. That is, EOLE actually enables lighter Value Prediction if it is implemented carefully. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Reducing the Issue Width If less instructions enter the IQ, then we can reduce the issue width (maybe the IQ size): From 6 to 4 (-4R and -2W): 24R/12W. The remaining issue capacity is offloaded to the Early/Late Execution stages. Still too many ports. First, as previously mentioned, we can reduce the issue width. In our framework, we found that we could reduce it from 6 to 4 without sacrificing performance. In a sense, the remaining issue capacity is offloaded to the early/late execution stages. We save 2 write ports and 4 read ports, but we still need too many: 12W/24R Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Banking the Physical Register File Prediction and Validation are done in-order. Bank the PRF and attribute predictions to consecutive banks. 8 pred/cycle Bank 0 Bank 1 Bank 2 Bank 3 Second, we can leverage the fact that prediction and validation are done in-order to bank the PRF and allocate destination registers of sequential instructions to different banks. That is, we can guarantee that with 4 banks and 8 predictions per cycle, all predictions can be written to the PRF with only 2 write ports per bank. The idea is similar for validation and read ports. This means that we save 6 write ports per bank. However, due to late execution, read ports savings are not as straightforward. 8 valid/cycle 2 write ports per bank instead of 8 for a 4-bank file. Read port savings are not as straightforward because of Late Execution. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Read Port Sharing 8 instructions can be validated with 2R per bank… …but Late Execution needs 16R per-bank to process 8 instructions. Fortunately, not all instructions are predictable (e.g. stores) or late-executable (e.g. loads). Constrain the number of read ports and share them between late execution and validation as needed : 4R per-bank is a good tradeoff. That is, 8 instructions can be validated with 2R per bank; by construction But late execution might need 16R per-bank to process 8 instructions per cycle, since operands can come from any bank. Fortunately, not all instructions are predictable of late executable, meaning that we don’t need the ideal number of ports to get performance. In particular, we found that constraining the number of ports dedicated to late/execution and validation to 4 per-bank was a good tradeoff, still assuming 4 banks. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Let’s Count, Again. 4-issue out-of-order engine (4W/8R per bank). 8 predictions per cycle (2W per bank). Constrained late-execution/validation (4R per bank). 12R/6W per bank in total. From 28R/14W, we now only need 12R/6W! This is the same amount as the PRF without VP, except issue width is 33% less. Let us count, again. If we summarize, this means that for a 4-issue out of order engine, we need 4W and 8R per bank. Then, to write 8 predictions per cycle, we need 2W per bank, and finally, to late-execute and validate, we need 4R per bank. That’s 12 read ports and 6 write ports per bank in total, assuming 4 banks. Thus, from 28R/14W for the first EOLE proposition, we now only need the same amount of ports as the baseline model without VP, except issue width is 33% less. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Putting It All Together Less than n-issue Out-of-order Engine Predictions flow through Early Execution Predictions/Early results are written to the PRF at Dispatch ROB IQ Bank 0 Rename FUs Fetch Bank 1 Early Exec Bank 2 PC Predictions VPredict Bank 3 The resulting block diagram of our proposition looks like the one shown here: Predictions flow through early execution where some instructions are also executed Predictions and results from early execution are written to the PRF at dispatch using only 2 ports per-bank (assuming 8-wide dispatch). Regular but narrower OoO execution happens Single-cycle predicted instructions are late-executed just before retirement by reading their operand in the PRF. finally, all predicted instructions are validated at commit time. All predicted instructions are validated at commit time Regular Out-of-order Execution Single cycle predicted instructions are late executed by reading operands from the PRF Validation + Squashing @commit Late Execution Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Putting It All Together EOLE provides a way to nullify the pressure applied by VP on the PRF (assuming banking is cheap). It reduces the complexity of the OoO engine itself: smaller issue width, simple Wakeup & Select, less bypass. EOLE needs VP to provide instructions to early/late execute while VP needs EOLE to mitigate the complexity it introduces. The two features are complementary. In a nutshell, EOLE provides a way to nullify the pressure applied by VP on the PRF It also reduces the complexity of the OoO engine, which could therefore be clocked higher as Wakeup & Select should be faster. However EOLE really needs VP to provide instructions to ealy/late execute, but VP requires EOLE to mitigate the complexity it introduces: both features are complementary. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

4 Experimental Results What we get. So now let us see how a reduced issue width EOLE pipeline fares against a simple Value prediction pipeline. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Experimental Framework Simulator: gem5 (x86_64). 4GHz, 8-wide, 6-issue, 20 cycles min. Bmispred., 192ROB, 64IQ, 48LQ/48SQ, 256INT/256FP regs. 32KB L1D/L1I, 2MB unified L2 with stride prefetcher, 4GB DDR3-1600 (min. ~75 cycles). 8K-entry base predictor + 6 1K-entry tagged components VTAGE + 8K-entry 2-delta Stride hybrid predictor with Forward Probabilistic Counters [Perais&Seznec14] To evaluate our proposition, we use the gem5 simulator with the x86 ISA. We model a 4GHz, 8-wide but 6-issue pipeline having a fetch to commit latency of 19 cycles without VP/EOLE and 20 cycles with, because we add a cycle for prediction validation and late execution. This gives us the respective minimum branch misprediction penalties reported here. The most important remaining parameters are laid out here As for our value predictor, We use a hybrid predictor consisting of the VTAGE predictor described in our previous work accompanied by a 2-delta Stride predictor. 2-delta takes the last result produced by an instruction and adds a constant – the stride - to it to generate the prediction. To have very high accuracy, we use Forward Probabilistic Counters, which mimick wide confidence counters. You’ll note that it is quite big as to not be the limiting factor in this study. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Experimental Framework Single-thread benchmarks: Subset of SPEC’00 and SPEC’06 (19 benchmarks) – ref inputs. Simpoint: One slice per benchmark, warmup for 50Minsts, run for 100Minsts. We use single thread benchmarks as we focus on sequential performance. For each benchmark, we identify one region of interest using simpoint, then we warmup for 50M instructions and finally collect results for 100M instructions. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Speedup over Baseline 8-wide/6-issue - The first graph shows speedup of the hybrid predictor over the baseline, without EOLE. As expected, we obtain good speedup with VP, and no slowdown is observed thanks to the very high accuracy of the predictor. In further experiments, this is the performance we use as reference. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Early Executed – Late Executed Low EOLE potential Except this one, actually. In this graph, we give insight on the proportion of dynamic instructions that can respectively be early executed, late executed because they are high confidence branches, and late executed because they are single cycle predicted instructions. Given those numbers, we expect EOLE to perform quite well, except in some benchmarks where we expect performance to decrease if we reduce the issue-width or the IQ Size because the predictor is not performing that well, namely milc, hmmer and lbm. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Reducing the Issue Width Slight speedup in general Slowdown in almost all cases Slowdown in a single case Next, we consider both simple VP and EOLE models where issue width is reduced from 6 to 4. In the legend, items with a 4I are 4-issue and the item with a 6I is 6 –issue. 64IQ is for the number of entries in the instruction queue. Performance is over the baseline 6-issue, 64IQ model featuring the hybrid predictor that we just saw two slides ago. If we simply reduce the issue width of the simple Value Prediction model, we obtain noticeable slowdowns in almost all benchmarks. However, if we reduce the issue width of our EOLE pipeline, we observe a single slowdown in hammer, in the order of 0.98. You’ll note that milc and lbm are not slowed down. Furthermore, if we keep the issue width as in the baseline model but we add Early and Late Execution (white bar), we actually get more speedup in several benchmarks. As a result, EOLE appears as a way to either slightly increase performance or to keep performance roughly constant while decreasing the issue width. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Reducing the IQ Size Slowdown in Noticeable slowdown all cases In many cases Another degree of freedom is the number of entries in the Instruction Queue or Scheduler. In this experiment, we reduce it from 64 to 48. The items in the legend with a 48IQ have a 48 entry IQ and the one with a 64IQ has a 64 entry IQ. The reference has a 64-entry IQ. Once again, if we reduce the IQ size in the simple Value Prediction model, we obtain quite substantial slowdowns, greater than when we reduced the issue-width. If we reduce it for our EOLE pipeline, performance is better than for the simple VP case, but we now observe slowdowns in more than one benchmark, down to 0.9. Finally, the last bar is the same as in the last figure, so not much to say here. In a nutshell, if EOLE allows to mitigate the slowdown due to a reduction in the IQ size, reducing the issue width appears as more interesting from a performance standpoint. It is also more interesting from a complexity standpoint as it has more « benevolent » side effets than reducing the IQ size, such as requiring less ports on the PRF. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Limited Issue and PRF Ports Without VP Same performance for 4R/bank as ideal Finally, we consider the EOLE model with only 12R/6W per bank, assuming 4 banks, as discussed previously. The first bar gives performance for the 6-issue model without Value Prediction, to give you some insight on performance without Value Prediction while having some bars for EOLE. The reference is the same as before. The two next bars show speedup respectively for a 4-issue EOLE without any port constraints, and with port constaints. The main conclusion is that 4 additional read port per banks are sufficient to obtain the same performance as the unconstrained model. Therefore, we can implement VP with a PRF that has as many ports as the baseline 6-issue without Value Prediction, while having reduced the issue width by 33%. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

5 Concluding Remarks What remains to be done. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

VP In a Processor with EOLE? Pros: No additional ports on the PRF, assuming enough banks. Simpler Out-of-order engine. Performance very similar to the baseline VP pipeline. Cons: Additional hardware (Early and Late Execution, Predictor). Impact on power consumption is unclear. What EOLE gets us is - Value Prediction with no additional ports on the PRF. Simpler OoO engine because issue width is reduced. Performance very similar to the baseline VP pipleine, that is, speedup. But, some additional hardware is required. But the impact on power is unclear. On the one hand, power consumption is reduced since the issue width is, but on the other hand, there are new hardware structures. Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Future Work What about the predictor? 8-wide fetch -> 8 predictions/cycle -> 8-ported tables? Hybrid with Stride, how do you implement the speculative window? The remaining complexity is really in the predictor. You will need to provide several predictions per cycle. Can you implement big multi-ported tables? If you want to use a Stride component in the value predictor, how do you implement the speculative window I talked about at the beginning? Arthur Perais & André Seznec - ISCA 2014 9/18/2018

Questions? Arthur Perais & André Seznec - ISCA 2014 9/18/2018