Download presentation
Presentation is loading. Please wait.
Published byElisabeth Caldwell Modified over 9 years ago
1
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations
2
Clustering Reg-rename & Instr steer IQ Regfile FF IQ Regfile FF r1 r2 + r3 r4 r1 + r2 r5 r6 + r7 r8 r1 + r5 p21 p2 + p3 p22 p21 + p2 p42 p21 p41 p56 + p57 p43 p42 + p41 40 regs in each cluster r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs
3
3 Recent Trends Not much change in structure capacities Not much change in cycle time Pipeline depths have become shorter (circuit delays have reduced); this is good for energy efficiency Optimal performance is observed at about 50 pipeline stages (we are currently at ~20 stages for energy reasons) Deep pipelines improve parallelism (helps if there’s ILP); Deep pipelines increase the gap between dependent instructions (hurts when there is little ILP)
4
ILP Limits Wall 1993
5
5 Techniques for High ILP Better branch prediction and fetch (trace cache) cascading branch predictors? More physical registers, ROB, issue queue, LSQ two-level regfile/IQ? Higher issue width clustering? Lower average cache hierarchy access time Memory dependence prediction Latency tolerance techniques: ILP, MLP, prefetch, runahead, multi-threading
6
2Bc-gskew Branch Predictor Address Address+History BIM Meta G1 G0 Pred Vote 44 KB; 2-cycle access; used in the Alpha 21464
7
Rules On a correct prediction if all agree, no update if they disagree, strengthen correct preds and chooser On a misprediction update chooser and recompute the prediction on a correct prediction, strengthen correct preds on a misprediction, update all preds
8
Impact of Mem-Dep Prediction In the perfect model, loads only wait for conflicting stores; in naïve model, loads issue speculatively and must be squashed if a dependence is later discovered From Chrysos and Emer, ISCA’98
9
Runahead Mutlu et al., HPCA’03 Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the oldest instruction is a cache miss, behave like it causes a context-switch: checkpoint the committed registers, rename table, return address stack, and branch history register assume a bogus value and start a new thread this thread cannot modify program state, but can prefetch
10
Memory Bottlenecks 128-entry window, real L2 0.77 IPC 128-entry window, perfect L2 1.69 2048-entry window, real L2 1.15 2048-entry window, perfect L2 2.02 128-entry window, real L2, runahead 0.94
11
SMT Pipeline Structure Front End Front End Front End Front End Execution Engine RenameROB I-CacheBpred RegsIQ FUsDCache Private/ Shared Front-end Private Front-end Shared Exec Engine SMT maximizes utilization of shared execution engine
12
12 SMT Fetch Policy Fetch policy has a major impact on throughput: depends on cache/bpred miss rates, dependences, etc. Commonly used policy: ICOUNT: every thread has an equal share of resources faster threads will fetch more often: improves thruput slow threads with dependences will not hoard resources low probability of fetching wrong-path instructions higher fairness
13
Area Effect of Multi-Threading The curve is linear for a while Multi-threading adds a 5-8% area overhead per thread (primary caches are included in the baseline) From Davis et al., PACT 2005
14
Single Core IPC 4 bars correspond to 4 different L2 sizes IPC range for different L1 sizes
15
Maximal Aggregate IPCs
16
Power/Energy Basics Energy = Power x time Power = Dynamic power + Leakage power Dynamic Power = C V 2 f switching activity factor C capacitances being charged V voltage swing f processor frequency
17
17 Guidelines Dynamic frequency scaling (DFS) can impact power, but has little impact on energy Optimizing a single structure for power/energy is good for overall energy only if execution time is not increased A good metric for comparison: ED (because DVFS is an alternative way to play with the E-D trade-off) Clock gating is commonly used to reduce dynamic energy, DFS is very cheap (few cycles), DVFS and power gating are more expensive (micro-seconds or tens of cycles, fewer margins, higher error rates) 2
18
Criticality Metrics Criticality has many applications: performance and power; usually, more useful for power optimizations QOLD – instructions that are the oldest in the issueq are considered critical can be extended to oldest-N does not need a predictor young instrs are possibly on mispredicted paths young instruction latencies can be tolerated older instrs are possibly holding up the window older instructions have more dependents in the pipeline than younger instrs
19
Other Criticality Metrics QOLDDEP: Producing instructions for oldest in q ALOLD: Oldest instr in ROB FREED-N: Instr completion frees up at least N dependent instrs Wake-Up: Instr completion triggers a chain of wake-up operations Instruction types: cache misses, branch mpreds, and instructions that feed them
20
20 Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.