Download presentation
Presentation is loading. Please wait.
Published byOsborne Blair Modified over 8 years ago
1
Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA
2
AS-AF Caps Team Instruction fetch on wide issue superscalar processors Fetching 6-10 instructions in // on each cycle: Fetch can be pipelined I-cache can be banked Instruction streams are “relatively long” Next block address generation is critical Not the real issueµ
3
AS-AF Caps Team Instruction block address generation One block per cycle Speculative: accuracy is critical Accuracy comes with hardware complexity: Conditional branch predictor Sequential block address computation Return address stack read Jump prediction Branch target prediction/computation Final address selection
4
AS-AF Caps Team Using a complex instruction address generator (IAG) PentiumPro, Alpha EV6, Alpha EV8: Fast (relatively) inaccurate IAG responding in a single cycle backed with a complex multicycle IAG Loss of a significant part of instruction bandwidth Overfetching implemented on Alpha EV8 Trend to deeper pipelines: Smaller line predictor: less accurate Deeper pipelined IAG: longer misfetch penalty 10 % misfetches, 3 cycles penalty : 30 % bandwidth loss
5
AS-AF Caps Team Ahead pipelining the IAG Suggested with multiple block ahead prediction in Seznec, Jourdan, Sainrat and Michaud (ASPLOS 96): Conventional IAG: use information available with block A to predict block B Multiple block ahead prediction: use information available with A to predict block C (or D) This paper : How to really do it
6
AS-AF Caps Team Fetch blocks for us 1st inst Rc: B inst /fetch block Rs: ends with the cache block Rl ends with the second block
7
AS-AF Caps Team Fetch blocks (2) Cond 1st inst 1nt: 1NT bypassing 0nt: no NT bypassing aNT: all NT bypassing
8
AS-AF Caps Team Fetch blocks (3) CondUncond 0nt 0NT 0NT+: no extra cond Cond 0nt 0NT: no extra CTI
9
AS-AF Caps Team Technological hypothesis Alpha EV8 front-end + twice faster clock: Cycle = 1 EV8 cycle phase Cycle = time to cross a 8 to 10 entries multiplexor Cycle = time to read a 2Kb table and route back the read data as an index. 2 cycles for reading a 16Kb table and routing back the read data as an index
10
AS-AF Caps Team Hierarchical IAG Complex IAG + Line predictor Conventional complex IAG spans over four cycles: 3 cycles for conditional branch prediction 3 cycles for I-cache read and branch target computation Jump prediction, return stack read + 1 cycle for final address selection Line prediction: a single 2Kb table + 1-bit direction table select between fallthrough and line predictor read
11
AS-AF Caps Team Hierarchical IAG (2) LP RAS Pred Check Cond. Jump Pred Final Selection Branch target addresses + decode info
12
AS-AF Caps Team Ahead pipelining the IAG Same functionalities required: Final selection: uses the last cycle in the address generation Conditional branch predictor Jump predictor Return address stack: Branch target prediction Recomputation impossible ! Use of a Branch target buffer Decode information: MUST BE PREDICTED
13
AS-AF Caps Team Ahead pipelining the IAG (2) Initiate table reads N cycles ahead with information available: N-block ahead address N-block ahead branch history Problem: significant loss of accuracy Use of N-block ahead (address, history) + intermediate path !
14
AS-AF Caps Team Inflight path incorporation Pipelined table access + final logic: Column selection Wordline selection Final logic Insert one bit of inflight information per intermediate block Not the same bit for each table in the predictor !
15
AS-AF Caps Team Ahead conditional branch predictor Global history branch predictor 512 Kbits 2bc-gskew 4 cycles prediction delay: Indexed using 5 blocks ahead (address + history) One bit per table per intermediate block Accuracy equivalent to conventional conditional branch predictors
16
AS-AF Caps Team Ahead Branch Target Prediction Use of a BTB: Tradeoffs: Size: 2Kb/16Kb 1 vs 2 cycles tagless or tags (+1 cycle) Associativity: +1 cycle ? Difficulty: The longer the pipeline read, the larger the number of possible pathes especially if nottaken branches are bypassed
17
AS-AF Caps Team Ahead Branch target prediction (2) 2Kb is too small: 16 Kb 2 cycles access time Associativity is required, but extra cycle for tag check is disastrous: tagless + way-prediction ! 2-way skewed-associativity: to incorporate different inflight information on the two ways
18
AS-AF Caps Team Ahead Branch Target Prediction (3) Bypassing not-taken branches: N possible branches : N possible targets N targets per entry: waste of space Many single entry used Read of N contiguous entries
19
AS-AF Caps Team Ahead jump predictor, return stack Jump predictor: history + address 3 cycles ahead + inflight information Return address stack as usual: Direct access to the previous cycle top Access to the possible address to be pushed
20
AS-AF Caps Team Decode information !! Needs the decode information of the current block to select among: Fallthrough Branch targets Jump target Return address stack top Cannot be got from the I-cache, must be predicted !
21
AS-AF Caps Team Decode information !! Needs the decode information of the current block to select among: Fallthrough Branch targets Jump target Return address stack top Cannot be got from the I-cache, must be predicted !
22
AS-AF Caps Team Ahead pipelined IAG and decode Final Selection Cond. Predicted Decode RAS J ump BTB Decode Info ??
23
AS-AF Caps Team Decode information (2) 1st try (not in the paper): in the BTB Entries without targets (more capacity misses) Duplicated decode information if multiple targets Decode mispredictions, but correct target prediction by jump predictor, RAS, fallthrough Not very convincing results
24
AS-AF Caps Team Predicting Decode information Principle: Decode information associated with a block is associated with the block address itself whenever possible whenever possible: BTB, jump predictor But not for returns and fallthrough blocks
25
AS-AF Caps Team Predicting decode information (2) Return block decode prediction: Return decode prediction table indexed with (one cycle ahead) top of the return stack Systematic decode misprediction if chaining call and associated return in two successive blocks. Sorry !
26
AS-AF Caps Team Fallthrough block decode prediction Just after a taken control-flow instruction, no time to read an extra table before the final selection stage: Store decode information for block A+1 with address A in BTB, jump predictor Fall through after fall through: 2-block ahead decode prediction table: A to read A+2 decode info
27
AS-AF Caps Team Hierarchical IAG vs ahead pipelined IAG 0123-4-3-2 CB BTB JP Ret dec 2A dec selection completed IAG init. Decode pred check LP selection LP check
28
AS-AF Caps Team Recovering after a misprediction Generation of address for the next block after recovery should have begun 4 cycles before recovery: 4 cycles extra misprediction penalty !? Unacceptable Checkpoint/repair mechanism should provide information for a smooth restart
29
AS-AF Caps Team Recovering after a misprediction (2) predicting the next block There is no time to read any of the table in the IAG, only time to cross the final selection stage. The checkpoint must provide everything that was at the entry of the final stage one cycle ahead. Less than 200 bits for all policies, except bypassing all not-taken branches
30
AS-AF Caps Team Recovering from misprediction (3) 3rd block and subsequent 3rd block BTB and Jump predictor cannot provide targets in time: All possible targets must come from checkpoint Conditional branch predictions are not available 4th and 5th block Conditional branch predictions are not available (approximately) 4 possible sets of entries for the final selection stage in the checkpoint repair, but 2 cycles access time.
31
AS-AF Caps Team Recovering from misprediction (4) one or two bubbles restart If full speed I-fetch resuming is too costly in the checkpoint mechanism then: Two bubbles restart: All possible targets recomputed/predicted Only conditional predictions for next and third block to be checkpointed One bubble restart: only one set of exits from BTB, jump predictor + conditional branch predictions to be checkpointed
32
AS-AF Caps Team Performance evaluation SPEC 2000 int + float Traces of 100 million instructions after skipping the initialization Immediate update
33
AS-AF Caps Team Average fetch block size (integer applications) Instruction streams: 9.79 instructions 2 * 8 inst/cache line No bypassing nottaken branches: 5.81 instructions Bypassing one taken branch: 7.28 instructions Bypassing all not taken branches: 7.40 instructions Bypassing a single taken branch is sufficient
34
AS-AF Caps Team Accuracy of IAGs (integer benchmarks ) misp/KI Very similar accuracies
35
AS-AF Caps Team Misfetches : IAG faults corrected at decode time misfetches/KI Integer applications
36
AS-AF Caps Team Misfetches (2) misfetches/KI floating-point applications
37
AS-AF Caps Team (Maximum) Instruction fetch bandwidth integer applications
38
AS-AF Caps Team Summary Ahead pipelining the IAG is a valid alternative to the use of a hierarchy of IAGs: Accuracy in the same range Significantly higher instruction bandwidth Main contributions: Decode prediction Checkpoint/repair analysis
39
AS-AF Caps Team Future works Sufficient to feed 4-to-6 way processors May be a little bit short for 8-way or more processors We plan to extend the study to: Decoupled instruction fetch front end (Reinmann et al ) Multiple (non-contiguous) block address generation Trace caches
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.