Download presentation
Presentation is loading. Please wait.
Published byWendy Skinner Modified over 8 years ago
1
Reducing the Scheduling Critical Cycle using Wakeup Prediction HPCA-10 Todd Ehrhart and Sanjay Patel Center for Reliable and High-Performance Computing University of Illinois at Urbana-Champaign Feb. 18, 2004
2
Outline n Overview n Analysis n Architecture n Experiments n Conclusions
3
Intuition n Loops and other execution patterns may cause a steady-state in machine delays ABCDABCD 1000 n After a few iterations, may have steady-state –May have (near-)constant delays
4
Basic Observation n Wakeup delay is highly invariant –Bias toward positive deviations
5
So... n Wakeup times can be estimated based on static IP n Idea: –Ignore dependencies –Estimate wakeup times –Wakeup instruction when time expires n Breaks the scheduling critical cycle –Thus, can reduce cycle time n But, there are problems
6
Outline n Overview n Analysis n Architecture n Experiments n Conclusions
7
Architectural Flow Predict Wakeup Time Wait for Wakeup Timeout Execute Instructions If wakeup time wrong, must replay
8
Architectural Flow Predict Wakeup Time Wait for Wakeup Timeout Execute Instructions Re-predict Mis-speculated Instructions Determine Actual Wakeup Time Check Feedback
9
Fixing the problems n Replays –Cost-adjust the wakeup estimate Probability of a replay Cost of a replay
10
Fixing the problems n Replays –Cost-adjust the wakeup estimate Probability of a replay Cost of a replay n Replay cost unknown/unmeasurable –Make replay cost an adjustable parameter Depends on machine load –Use load as feedback value –Goal: maximize retire bandwidth
11
Fixing the problems n Replays –Cost-adjust the wakeup estimate Probability of a replay Cost of a replay n Replay cost unknown/unmeasurable –Make replay cost an adjustable parameter Depends on machine load –Use load as feedback value –Goal: maximize retire bandwidth n Re-prediction –Exponential backoff
12
Cost-adjusted Wakeup Estimate n Being close counts
13
Cost-adjusted Wakeup Estimate, II n After some assumptions and math... n Minimum cost occurs at: –F(d) = Rf(d) and f(d) > Rf '(d) n f(d) is unknown, so use a gradient-descent technique: n Looks like a running average R is replay cost estimate
14
Feedback-adjusted Replay Cost n Cost of replay changes during execution –Program phases, etc. n Add second feedback layer –Observe loads on each class of functional unit Adjust replay cost accordingly To prevent wild oscillations, adjust once every 1000 cycles –Cheap: Needs a few accumulators, and is off critical path r is estimated cost of single replay; R = r * count
15
Re-prediction n An observation (covers 99% of instances): n Return instruction to Self-Schedule Array –but, with twice its previous wakeup time estimate Slope=2
16
Outline n Overview n Analysis n Architecture n Experiments n Conclusions
17
High-Level Architecture
18
Scheduler Architecture
19
Predictor Architectures n Local allowance
20
Predictor Architectures n Global allowance
21
Predictor Architectures n Problem: On miss, cannot fall back on dependency-based wakeup –Cycle-time constraints n Default Predictor –Used on miss in main predictor –Update same as Global Allowance predictor
22
Finding the Actual Wakeup Time n Done in the Register File –conceptually: Reg File Source Register Numbers VRegister InfoCycle Count Ready to Execute? ANDMIN - Wait Time Actual Wakeup Time Set to zero when register written; counts up each cycle.
23
Outline n Overview n Analysis n Architecture n Experiments n Conclusions
24
Setup n X86 trace-driven simulator –Fetch timing effects simulated –7 traces 26m to 100m consecutive insts. From SPECint n 8-wide, 18-deep pipeline n Configurations: –Baseline: 1-cycle wakeup/select –BasePipeSched: 2-cycle wakeup/select –WPLocal: local wakeup prediction (128x4 predictor), r=1 –WPLocalAdj: WPLocal + feedback-adjusted r –WPGlobal: global wakeup prediction (128x4 predictor), r=1 –WPGlobalAdj: WPGlobal + feedback-adjusted r
25
Results 7% IPC drop
26
Ideal Fetch n Approximates high-bandwidth fetch –Trace cache, etc. n Otherwise, same as before.
27
Results: Ideal Fetch 7% IPC drop
28
Resource-constrained n Half the number of functional units in each class n Uses i-cache fetch (like first experiment) n Otherwise, same as others
29
Results: Resource-constrained 9% IPC drop
30
Other Results n Some leeway in prediction accuracy –Doubling predictions results in 27% IPC drop. n Works consistently in deep pipelines –Without pipelined wakeup: –With pipelined wakeup:
31
Outline n Overview n Analysis n Architecture n Experiments n Conclusions
32
Conclusions n Likely to increase performance –IPC drop ~7% –Performance impact of cycle time decrease could exceed that of IPC decrease n Feedback paths are not critical –Simpler design process
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.