Presentation is loading. Please wait.

Presentation is loading. Please wait.

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA.

Similar presentations


Presentation on theme: "Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA."— Presentation transcript:

1 Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA

2 AS-AF Caps Team Instruction fetch on wide issue superscalar processors  Fetching 6-10 instructions in // on each cycle:  Fetch can be pipelined  I-cache can be banked  Instruction streams are “relatively long”  Next block address generation is critical Not the real issueµ

3 AS-AF Caps Team Instruction block address generation  One block per cycle  Speculative: accuracy is critical  Accuracy comes with hardware complexity:  Conditional branch predictor  Sequential block address computation  Return address stack read  Jump prediction  Branch target prediction/computation  Final address selection

4 AS-AF Caps Team Using a complex instruction address generator (IAG)  PentiumPro, Alpha EV6, Alpha EV8:  Fast (relatively) inaccurate IAG responding in a single cycle backed with a complex multicycle IAG  Loss of a significant part of instruction bandwidth  Overfetching implemented on Alpha EV8  Trend to deeper pipelines:  Smaller line predictor: less accurate  Deeper pipelined IAG: longer misfetch penalty 10 % misfetches, 3 cycles penalty : 30 % bandwidth loss

5 AS-AF Caps Team Ahead pipelining the IAG  Suggested with multiple block ahead prediction in Seznec, Jourdan, Sainrat and Michaud (ASPLOS 96):  Conventional IAG: use information available with block A to predict block B  Multiple block ahead prediction: use information available with A to predict block C (or D)  This paper :  How to really do it

6 AS-AF Caps Team Fetch blocks for us 1st inst Rc: B inst /fetch block Rs: ends with the cache block Rl ends with the second block

7 AS-AF Caps Team Fetch blocks (2) Cond 1st inst 1nt: 1NT bypassing 0nt: no NT bypassing aNT: all NT bypassing

8 AS-AF Caps Team Fetch blocks (3) CondUncond 0nt 0NT 0NT+: no extra cond Cond 0nt 0NT: no extra CTI

9 AS-AF Caps Team Technological hypothesis  Alpha EV8 front-end + twice faster clock:  Cycle = 1 EV8 cycle phase  Cycle = time to cross a 8 to 10 entries multiplexor  Cycle = time to read a 2Kb table and route back the read data as an index.  2 cycles for reading a 16Kb table and routing back the read data as an index

10 AS-AF Caps Team Hierarchical IAG  Complex IAG + Line predictor  Conventional complex IAG spans over four cycles:  3 cycles for conditional branch prediction  3 cycles for I-cache read and branch target computation  Jump prediction, return stack read  + 1 cycle for final address selection  Line prediction:  a single 2Kb table + 1-bit direction table  select between fallthrough and line predictor read

11 AS-AF Caps Team Hierarchical IAG (2) LP RAS Pred Check Cond. Jump Pred Final Selection Branch target addresses + decode info

12 AS-AF Caps Team Ahead pipelining the IAG  Same functionalities required:  Final selection: uses the last cycle in the address generation  Conditional branch predictor  Jump predictor  Return address stack:  Branch target prediction Recomputation impossible ! Use of a Branch target buffer  Decode information: MUST BE PREDICTED

13 AS-AF Caps Team Ahead pipelining the IAG (2)  Initiate table reads N cycles ahead with information available:  N-block ahead address  N-block ahead branch history  Problem: significant loss of accuracy Use of N-block ahead (address, history) + intermediate path !

14 AS-AF Caps Team Inflight path incorporation  Pipelined table access + final logic:  Column selection  Wordline selection  Final logic  Insert one bit of inflight information per intermediate block  Not the same bit for each table in the predictor !

15 AS-AF Caps Team Ahead conditional branch predictor  Global history branch predictor  512 Kbits  2bc-gskew 4 cycles prediction delay:  Indexed using 5 blocks ahead (address + history)  One bit per table per intermediate block Accuracy equivalent to conventional conditional branch predictors

16 AS-AF Caps Team Ahead Branch Target Prediction  Use of a BTB:  Tradeoffs: Size: 2Kb/16Kb 1 vs 2 cycles tagless or tags (+1 cycle) Associativity: +1 cycle ?  Difficulty:  The longer the pipeline read, the larger the number of possible pathes especially if nottaken branches are bypassed

17 AS-AF Caps Team Ahead Branch target prediction (2)  2Kb is too small:  16 Kb 2 cycles access time  Associativity is required, but extra cycle for tag check is disastrous:  tagless + way-prediction !  2-way skewed-associativity: to incorporate different inflight information on the two ways

18 AS-AF Caps Team Ahead Branch Target Prediction (3)  Bypassing not-taken branches:  N possible branches : N possible targets  N targets per entry: waste of space Many single entry used  Read of N contiguous entries

19 AS-AF Caps Team Ahead jump predictor, return stack  Jump predictor: history + address  3 cycles ahead + inflight information  Return address stack as usual:  Direct access to the previous cycle top  Access to the possible address to be pushed

20 AS-AF Caps Team Decode information !!  Needs the decode information of the current block to select among:  Fallthrough  Branch targets  Jump target  Return address stack top Cannot be got from the I-cache, must be predicted !

21 AS-AF Caps Team Decode information !!  Needs the decode information of the current block to select among:  Fallthrough  Branch targets  Jump target  Return address stack top Cannot be got from the I-cache, must be predicted !

22 AS-AF Caps Team Ahead pipelined IAG and decode Final Selection Cond. Predicted Decode RAS J ump BTB Decode Info ??

23 AS-AF Caps Team Decode information (2)  1st try (not in the paper): in the BTB  Entries without targets (more capacity misses)  Duplicated decode information if multiple targets  Decode mispredictions, but correct target prediction by jump predictor, RAS, fallthrough Not very convincing results

24 AS-AF Caps Team Predicting Decode information Principle: Decode information associated with a block is associated with the block address itself whenever possible whenever possible:  BTB, jump predictor  But not for returns and fallthrough blocks

25 AS-AF Caps Team Predicting decode information (2)  Return block decode prediction:  Return decode prediction table indexed with (one cycle ahead) top of the return stack Systematic decode misprediction if chaining call and associated return in two successive blocks. Sorry !

26 AS-AF Caps Team Fallthrough block decode prediction  Just after a taken control-flow instruction, no time to read an extra table before the final selection stage:  Store decode information for block A+1 with address A in BTB, jump predictor  Fall through after fall through:  2-block ahead decode prediction table: A to read A+2 decode info

27 AS-AF Caps Team Hierarchical IAG vs ahead pipelined IAG 0123-4-3-2 CB BTB JP Ret dec 2A dec selection completed IAG init. Decode pred check LP selection LP check

28 AS-AF Caps Team Recovering after a misprediction  Generation of address for the next block after recovery should have begun 4 cycles before recovery:  4 cycles extra misprediction penalty !? Unacceptable  Checkpoint/repair mechanism should provide information for a smooth restart

29 AS-AF Caps Team Recovering after a misprediction (2) predicting the next block  There is no time to read any of the table in the IAG, only time to cross the final selection stage.  The checkpoint must provide everything that was at the entry of the final stage one cycle ahead. Less than 200 bits for all policies, except bypassing all not-taken branches

30 AS-AF Caps Team Recovering from misprediction (3) 3rd block and subsequent 3rd block  BTB and Jump predictor cannot provide targets in time:  All possible targets must come from checkpoint  Conditional branch predictions are not available 4th and 5th block  Conditional branch predictions are not available (approximately) 4 possible sets of entries for the final selection stage in the checkpoint repair, but 2 cycles access time.

31 AS-AF Caps Team Recovering from misprediction (4) one or two bubbles restart  If full speed I-fetch resuming is too costly in the checkpoint mechanism then:  Two bubbles restart: All possible targets recomputed/predicted Only conditional predictions for next and third block to be checkpointed  One bubble restart: only one set of exits from BTB, jump predictor + conditional branch predictions to be checkpointed

32 AS-AF Caps Team Performance evaluation  SPEC 2000 int + float  Traces of 100 million instructions after skipping the initialization  Immediate update

33 AS-AF Caps Team Average fetch block size (integer applications)  Instruction streams: 9.79 instructions  2 * 8 inst/cache line  No bypassing nottaken branches: 5.81 instructions  Bypassing one taken branch: 7.28 instructions  Bypassing all not taken branches: 7.40 instructions Bypassing a single taken branch is sufficient

34 AS-AF Caps Team Accuracy of IAGs (integer benchmarks ) misp/KI Very similar accuracies

35 AS-AF Caps Team Misfetches : IAG faults corrected at decode time misfetches/KI Integer applications

36 AS-AF Caps Team Misfetches (2) misfetches/KI floating-point applications

37 AS-AF Caps Team (Maximum) Instruction fetch bandwidth integer applications

38 AS-AF Caps Team Summary  Ahead pipelining the IAG is a valid alternative to the use of a hierarchy of IAGs:  Accuracy in the same range  Significantly higher instruction bandwidth  Main contributions:  Decode prediction  Checkpoint/repair analysis

39 AS-AF Caps Team Future works  Sufficient to feed 4-to-6 way processors  May be a little bit short for 8-way or more processors  We plan to extend the study to:  Decoupled instruction fetch front end (Reinmann et al )  Multiple (non-contiguous) block address generation  Trace caches


Download ppt "Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA."

Similar presentations


Ads by Google