Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003copyright J. E. Smith, 2003 2 Superscalar Processor Evaluation  Processors typically evaluated via simulation Highly detailed simulator Many cycles of simulation Has a black box character -- provides little insight  Workload Implications All workload characteristics are needed for detailed simulation, BUT not all are critical for determining performance Workload space limited to specific benchmarks  Alternative Approach – Use an analytical model

October 27, 2003copyright J. E. Smith, 2003 3 Analytical Approach  Analytical Model driven by relevant benchmark properties  Helps isolate important workload characteristics If performance estimate is accurate then workload characteristics must be the important ones  Workload characteristics can be varied over a “workload space” Apply characteristics directly by short-circuiting simulation

October 27, 2003copyright J. E. Smith, 2003 4 Basis for Model  Consider profile of dynamic instructions issued per cycle:  Background constant IPC With never-ending series of transient events  determine performance with ideal caches & predictors then account for transient events time IPC branch mispredicts i-cache miss long d-cache miss

October 27, 2003copyright J. E. Smith, 2003 5 IBID Model  Based on generic superscalar processor  Useful for reasoning about transient events

October 27, 2003copyright J. E. Smith, 2003 6 Series/Parallel Performance Penalties  Branch Misprediction and I-Cache Miss penalties “serialize” i.e. penalties add linearly  Long D-Cache Misses may overlap with I-cache and B-predict misses (and with each other) Overlap with other long D-cache misses more important Short D-cache misses handled differently (later) Branch Mispredicts I-Cache Misses Long D-Cache Misses

October 27, 2003copyright J. E. Smith, 2003 7 Validating Series/Parallel Model  Combined: simulated performance with realistic caches/predictor  Independent: ideal performance minus individually determined performance losses  Overlap Compensated: account for overlaps w/ D-cache misses 4-way issue, 48 window, 128 ROB 16K I-cache and D-Cache 8K gshare branch predictor

October 27, 2003copyright J. E. Smith, 2003 8 IW Characteristic  Key Result (Michaud, Seznec, Jourdan): Square Root relationship between I ssue R ate and W indow size

October 27, 2003copyright J. E. Smith, 2003 9 Similar Experiment  Ideal caches, predictor  Efficient I fetch keeps window full  Graph issue rate I, as a fcn of window size W  Straight lines on log log graph 

October 27, 2003copyright J. E. Smith, 2003 10 IW Characteristic  Allows determination of “background” IPC  Allows evaluation of transients to determine penalties time IPC branch mispredicts i-cache miss long d-cache miss

October 27, 2003copyright J. E. Smith, 2003 11 Transient #1: Branch Mispredictions  Typical behavior steady state mispredicted branch enters window flush pipeline re-fill pipeline instructions re-enter window issue ramps back up to steady state misprediction detected misspeculated instructions

October 27, 2003copyright J. E. Smith, 2003 12 Branch Misprediction Penalty 1) lost opportunity performance lost by issuing soon-to-be flushed instructions 2) pipeline re-fill penalty obvious penalty; most people equate this with the penalty 3) window fill penalty performance lost due to window startup

October 27, 2003copyright J. E. Smith, 2003 13 Use Sqrt Model

October 27, 2003copyright J. E. Smith, 2003 14 Experimental Data

October 27, 2003copyright J. E. Smith, 2003 15 Branch Mispredict Penalty short pipeline = 5 stages before issue long pipeline = 10 stages before issue  Insight from analytical model: Penalty from drain/fill is significant  Insight from analytical model: Penalty similar across all benchmarks for a given pipeline length

October 27, 2003copyright J. E. Smith, 2003 16 Implication of Wider Pipes  Assume 1 mispredict every 96 instructions E.g. SPEC benchmark crafty with 4K gshare Graph full mispredict “cycle”  Issue=8 gives very modest improvement vs issue=4 (window never full enough to issue 8)  Issue=4 barely reaches peak performance

October 27, 2003copyright J. E. Smith, 2003 17 Importance of Branch Prediction  Insight: Doubling issue width means predictor has to be four times better for similar performance profile (issue efficiency)

October 27, 2003copyright J. E. Smith, 2003 18 Implication of Deeper Pipelines  Assume 1 misprediction per 96 instructions  Vary fetch/decode/rename section of pipe  Advantage of wide issue diminishes as pipe deepens Pentium 4 decode pipe depth = 15 & issue width = 3

October 27, 2003copyright J. E. Smith, 2003 19 Transient #2: I-Cache Misses steady state cache miss occurs window drains instructions re-enter window issue ramps back up to steady state instructions buffered in decode pipe miss delay instructions fill decode pipe

October 27, 2003copyright J. E. Smith, 2003 20 I-cache miss penalty Penalty = Miss delay (L2 or memory latency) minus window drain plus window re-fill penalty Instructions buffered in window offsets re-fill penalty insight: penalty is independent of pipeline length. Instructions buffered in pipe compensate for pipe re-fill

October 27, 2003copyright J. E. Smith, 2003 21 I-cache miss penalty  Estimated i-cache penalty: for n consecutive (clustered) misses: Avg. Miss penalty = (miss delay – drain + fill + (n-1)(miss delay-1))/n  miss delay – 1 + 1/n For isolated miss  miss delay For long cluster  miss delay – 1

October 27, 2003copyright J. E. Smith, 2003 22 Independence from Pipe Length  16 K I-cache; ideal D-cache and predictor  Two different pipeline lengths (4 and 8 cycles)  I-cache miss delay 10 cycles  Penalty independent of pipe length  Similar across benchmarks

October 27, 2003copyright J. E. Smith, 2003 23 Reducing Miss Penalty – I-Caches  Add Ifetch buffer Overlaps execution with miss handling Bypassed by miss instructions  To be effective, should be enhanced with high fetch bandwidth greater than issue width steady state cache miss occurs instructions buffered in decode pipe instructions fill decode pipe Increases this Without increasing this

October 27, 2003copyright J. E. Smith, 2003 24 Transient #3: D-Cache Misses  More complex than front-end miss events Branch mispredict and icache misses block I-fetch Data cache misses can be handled in parallel with I-fetch and execution  Divide into: Short misses – handle like long latency functional unit Long misses – get special treatment

October 27, 2003copyright J. E. Smith, 2003 25 D-cache long miss penalty  Three things can reduce performance 1) Structural hazard ROB fills up behind load (or inst dependent on load) and dispatch stalls 2) Data dependences Instructions dependent on load pile up and stall window 3) Control dependences Mispredicted branch dependent on load data  Instructions beyond branch wasted

October 27, 2003copyright J. E. Smith, 2003 26 ROB Blockage  Experiment: Window size 32, Issue width 4, ROB size 64 Ideal branch prediction Cache miss delay 1000 cycles Simulate sampled, isolated cache misses and see what happens

October 27, 2003copyright J. E. Smith, 2003 27 Results BenchmarkAvg. # insts#insts in Fract. Samples issued afterin windowwhere ROB fills missafter miss Bzip244.113.11.0 Crafty44.6 9.60.9 Eon55.2 6.01.0 Gap56.810.71.0 Gcc51.7 8.20.9 Mcf55.8 5.50.9 Parser44.2 7.41.0 Twolf49.612.90.8 Vortex49.7 3.51.0 Vpr27.016.90.6  Full ROB stalls most of the time  Relatively few dependent instructions pile up in window

October 27, 2003copyright J. E. Smith, 2003 28 D-Cache Miss Penalty  For typical ROBs, data and control dependences are not limiters – assume structural (ROB) stall:  if load at tail of window: Penalty = Miss delay minus ROB fill, window drain plus ramp-up  Miss delay minus ROB fill  if load at head of window: Penalty = Miss delay minus window drain plus ramp-up  Miss delay  If second long load miss is within ROB distance of first, then penalty is completely overlapped

October 27, 2003copyright J. E. Smith, 2003 29 Transient #3: D-Cache Misses steady state d cache miss occurs window drains Miss data returns Commit resumes; Issue ramps to steady state ROB fills miss delay

October 27, 2003copyright J. E. Smith, 2003 30 Reducing Miss Penalty – D-Caches  Enlarge ROB, Window, Rename Values Overlap miss delay with execution steady state d cache miss occurs ROB fills window drains independent insts. miss data returns commit resumes; issue ramps back up to steady state miss delay ROB full

October 27, 2003copyright J. E. Smith, 2003 31 Put it together  Issue width 4, window size 48 => peak CPI  8 cycle L1 I cache miss delay  200 cycle L2 cache miss delay (both I and D)  6.4 cycle branch mispredict delay (4 in pipeline)  Performance (cycles) = #insts*peak CPI + (total #br mispredicts- mispreds w/in ROBsize of long miss)*penalty + (total #Icache misses – misses w/in ROBsize of long miss)*penalty + (total #long misses – long misses w/in ROBsize of long miss)*penalty

October 27, 2003copyright J. E. Smith, 2003 32 Compare with Detailed Simulation  Very accurate  Greatest inaccuracy from Dcache long misses

October 27, 2003copyright J. E. Smith, 2003 33 Important Workload Characteristics

October 27, 2003copyright J. E. Smith, 2003 34 Conclusions: Key Workload Characteristics  Instruction dependences important: For establishing background (ideal) IPC Not for performance penalties  All “major” events important Branch mispredicts I cache misses (both short and long) D cache misses (long)  But ONLY “major” events are important in a well-balanced design  Clustering of events only important for D cache misses Is miss within ROB distance of preceding miss?

October 27, 2003copyright J. E. Smith, 2003 35 Conclusions: Performance Evaluation  Accurate analytical models can (and should be) developed  Trace driven cache/predictor simulators have an important role  Hybrid analytical/simulation models also should be considered Combine real address streams with analytical processor models Statistical simulation If you really need detailed simulation – you’re not doing research, you’re doing development!

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Similar presentations

Presentation on theme: "Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.

Similar presentations

Presentation on theme: "Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis."— Presentation transcript:

Similar presentations

About project

Feedback