Download presentation
Presentation is loading. Please wait.
Published byTodd Daniels Modified over 9 years ago
1
Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis
2
October 27, 2003copyright J. E. Smith, 2003 2 Superscalar Processor Evaluation Processors typically evaluated via simulation Highly detailed simulator Many cycles of simulation Has a black box character -- provides little insight Workload Implications All workload characteristics are needed for detailed simulation, BUT not all are critical for determining performance Workload space limited to specific benchmarks Alternative Approach – Use an analytical model
3
October 27, 2003copyright J. E. Smith, 2003 3 Analytical Approach Analytical Model driven by relevant benchmark properties Helps isolate important workload characteristics If performance estimate is accurate then workload characteristics must be the important ones Workload characteristics can be varied over a “workload space” Apply characteristics directly by short-circuiting simulation
4
October 27, 2003copyright J. E. Smith, 2003 4 Basis for Model Consider profile of dynamic instructions issued per cycle: Background constant IPC With never-ending series of transient events determine performance with ideal caches & predictors then account for transient events time IPC branch mispredicts i-cache miss long d-cache miss
5
October 27, 2003copyright J. E. Smith, 2003 5 IBID Model Based on generic superscalar processor Useful for reasoning about transient events
6
October 27, 2003copyright J. E. Smith, 2003 6 Series/Parallel Performance Penalties Branch Misprediction and I-Cache Miss penalties “serialize” i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and B-predict misses (and with each other) Overlap with other long D-cache misses more important Short D-cache misses handled differently (later) Branch Mispredicts I-Cache Misses Long D-Cache Misses
7
October 27, 2003copyright J. E. Smith, 2003 7 Validating Series/Parallel Model Combined: simulated performance with realistic caches/predictor Independent: ideal performance minus individually determined performance losses Overlap Compensated: account for overlaps w/ D-cache misses 4-way issue, 48 window, 128 ROB 16K I-cache and D-Cache 8K gshare branch predictor
8
October 27, 2003copyright J. E. Smith, 2003 8 IW Characteristic Key Result (Michaud, Seznec, Jourdan): Square Root relationship between I ssue R ate and W indow size
9
October 27, 2003copyright J. E. Smith, 2003 9 Similar Experiment Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate I, as a fcn of window size W Straight lines on log log graph
10
October 27, 2003copyright J. E. Smith, 2003 10 IW Characteristic Allows determination of “background” IPC Allows evaluation of transients to determine penalties time IPC branch mispredicts i-cache miss long d-cache miss
11
October 27, 2003copyright J. E. Smith, 2003 11 Transient #1: Branch Mispredictions Typical behavior steady state mispredicted branch enters window flush pipeline re-fill pipeline instructions re-enter window issue ramps back up to steady state misprediction detected misspeculated instructions
12
October 27, 2003copyright J. E. Smith, 2003 12 Branch Misprediction Penalty 1) lost opportunity performance lost by issuing soon-to-be flushed instructions 2) pipeline re-fill penalty obvious penalty; most people equate this with the penalty 3) window fill penalty performance lost due to window startup
13
October 27, 2003copyright J. E. Smith, 2003 13 Use Sqrt Model
14
October 27, 2003copyright J. E. Smith, 2003 14 Experimental Data
15
October 27, 2003copyright J. E. Smith, 2003 15 Branch Mispredict Penalty short pipeline = 5 stages before issue long pipeline = 10 stages before issue Insight from analytical model: Penalty from drain/fill is significant Insight from analytical model: Penalty similar across all benchmarks for a given pipeline length
16
October 27, 2003copyright J. E. Smith, 2003 16 Implication of Wider Pipes Assume 1 mispredict every 96 instructions E.g. SPEC benchmark crafty with 4K gshare Graph full mispredict “cycle” Issue=8 gives very modest improvement vs issue=4 (window never full enough to issue 8) Issue=4 barely reaches peak performance
17
October 27, 2003copyright J. E. Smith, 2003 17 Importance of Branch Prediction Insight: Doubling issue width means predictor has to be four times better for similar performance profile (issue efficiency)
18
October 27, 2003copyright J. E. Smith, 2003 18 Implication of Deeper Pipelines Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe Advantage of wide issue diminishes as pipe deepens Pentium 4 decode pipe depth = 15 & issue width = 3
19
October 27, 2003copyright J. E. Smith, 2003 19 Transient #2: I-Cache Misses steady state cache miss occurs window drains instructions re-enter window issue ramps back up to steady state instructions buffered in decode pipe miss delay instructions fill decode pipe
20
October 27, 2003copyright J. E. Smith, 2003 20 I-cache miss penalty Penalty = Miss delay (L2 or memory latency) minus window drain plus window re-fill penalty Instructions buffered in window offsets re-fill penalty insight: penalty is independent of pipeline length. Instructions buffered in pipe compensate for pipe re-fill
21
October 27, 2003copyright J. E. Smith, 2003 21 I-cache miss penalty Estimated i-cache penalty: for n consecutive (clustered) misses: Avg. Miss penalty = (miss delay – drain + fill + (n-1)(miss delay-1))/n miss delay – 1 + 1/n For isolated miss miss delay For long cluster miss delay – 1
22
October 27, 2003copyright J. E. Smith, 2003 22 Independence from Pipe Length 16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 10 cycles Penalty independent of pipe length Similar across benchmarks
23
October 27, 2003copyright J. E. Smith, 2003 23 Reducing Miss Penalty – I-Caches Add Ifetch buffer Overlaps execution with miss handling Bypassed by miss instructions To be effective, should be enhanced with high fetch bandwidth greater than issue width steady state cache miss occurs instructions buffered in decode pipe instructions fill decode pipe Increases this Without increasing this
24
October 27, 2003copyright J. E. Smith, 2003 24 Transient #3: D-Cache Misses More complex than front-end miss events Branch mispredict and icache misses block I-fetch Data cache misses can be handled in parallel with I-fetch and execution Divide into: Short misses – handle like long latency functional unit Long misses – get special treatment
25
October 27, 2003copyright J. E. Smith, 2003 25 D-cache long miss penalty Three things can reduce performance 1) Structural hazard ROB fills up behind load (or inst dependent on load) and dispatch stalls 2) Data dependences Instructions dependent on load pile up and stall window 3) Control dependences Mispredicted branch dependent on load data Instructions beyond branch wasted
26
October 27, 2003copyright J. E. Smith, 2003 26 ROB Blockage Experiment: Window size 32, Issue width 4, ROB size 64 Ideal branch prediction Cache miss delay 1000 cycles Simulate sampled, isolated cache misses and see what happens
27
October 27, 2003copyright J. E. Smith, 2003 27 Results BenchmarkAvg. # insts#insts in Fract. Samples issued afterin windowwhere ROB fills missafter miss Bzip244.113.11.0 Crafty44.6 9.60.9 Eon55.2 6.01.0 Gap56.810.71.0 Gcc51.7 8.20.9 Mcf55.8 5.50.9 Parser44.2 7.41.0 Twolf49.612.90.8 Vortex49.7 3.51.0 Vpr27.016.90.6 Full ROB stalls most of the time Relatively few dependent instructions pile up in window
28
October 27, 2003copyright J. E. Smith, 2003 28 D-Cache Miss Penalty For typical ROBs, data and control dependences are not limiters – assume structural (ROB) stall: if load at tail of window: Penalty = Miss delay minus ROB fill, window drain plus ramp-up Miss delay minus ROB fill if load at head of window: Penalty = Miss delay minus window drain plus ramp-up Miss delay If second long load miss is within ROB distance of first, then penalty is completely overlapped
29
October 27, 2003copyright J. E. Smith, 2003 29 Transient #3: D-Cache Misses steady state d cache miss occurs window drains Miss data returns Commit resumes; Issue ramps to steady state ROB fills miss delay
30
October 27, 2003copyright J. E. Smith, 2003 30 Reducing Miss Penalty – D-Caches Enlarge ROB, Window, Rename Values Overlap miss delay with execution steady state d cache miss occurs ROB fills window drains independent insts. miss data returns commit resumes; issue ramps back up to steady state miss delay ROB full
31
October 27, 2003copyright J. E. Smith, 2003 31 Put it together Issue width 4, window size 48 => peak CPI 8 cycle L1 I cache miss delay 200 cycle L2 cache miss delay (both I and D) 6.4 cycle branch mispredict delay (4 in pipeline) Performance (cycles) = #insts*peak CPI + (total #br mispredicts- mispreds w/in ROBsize of long miss)*penalty + (total #Icache misses – misses w/in ROBsize of long miss)*penalty + (total #long misses – long misses w/in ROBsize of long miss)*penalty
32
October 27, 2003copyright J. E. Smith, 2003 32 Compare with Detailed Simulation Very accurate Greatest inaccuracy from Dcache long misses
33
October 27, 2003copyright J. E. Smith, 2003 33 Important Workload Characteristics
34
October 27, 2003copyright J. E. Smith, 2003 34 Conclusions: Key Workload Characteristics Instruction dependences important: For establishing background (ideal) IPC Not for performance penalties All “major” events important Branch mispredicts I cache misses (both short and long) D cache misses (long) But ONLY “major” events are important in a well-balanced design Clustering of events only important for D cache misses Is miss within ROB distance of preceding miss?
35
October 27, 2003copyright J. E. Smith, 2003 35 Conclusions: Performance Evaluation Accurate analytical models can (and should be) developed Trace driven cache/predictor simulators have an important role Hybrid analytical/simulation models also should be considered Combine real address streams with analytical processor models Statistical simulation If you really need detailed simulation – you’re not doing research, you’re doing development!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.