1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 7810 Paper critiques and class participation: 25% Final exam: 25% Project: Simplescalar (?) modeling, simulation, and analysis: 50% Read and think about.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
CS 7810 Lecture 11 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Lecture: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture: SMT, Cache Hierarchies
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture: SMT, Cache Hierarchies
Lecture 11: Memory Data Flow Techniques
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
15-740/ Computer Architecture Lecture 14: Runahead Execution
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 22: Multithreading
Lecture 9: Dynamic ILP Topics: out-of-order processors
ECE 721, Spring 2019 Prof. Eric Rotenberg.
Sizing Structures Fixed relations Empirical (simulation-based)
Presentation transcript:

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations

Wakeup Logic rdyLrdyRtagRtagL or == tag1tagIW … rdyLrdyRtagRtagL

Selection Logic Issue window reqgrant anyreq enable Arbiter cell For multiple FUs, will need sequential selectors

4 Structure Complexities Critical structures: register map tables, issue queue, LSQ, register file, register bypass Cycle time is heavily influenced by: window size (physical register size), issue width (#FUs) Conflict between the desire to increase IPC and clock speed Can achieve both if we use large structures and deep pipelining; but, some structures can’t be easily pipelined and long-latency structures can also hurt IPC

5 Deep Pipelines What does it mean to have  2-cycle wakeup  2-cycle bypass  2-cycle regread

Scaling Options 20-IQ 40 Regs F F F F 20-IQ 40 Regs F F F F 2-cycle wakeup 2-cycle regread 2-cycle bypass 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F 15-IQ 30 Regs F F F Pipeline Scaling Capacity Scaling Replicated Capacity Scaling

7 Recent Trends Not much change in structure capacities Not much change in cycle time Pipeline depths have become shorter (circuit delays have reduced); this is good for energy efficiency Optimal performance is observed at about 50 pipeline stages (we are currently at ~20 stages for energy reasons) Deep pipelines improve parallelism (helps if there’s ILP); Deep pipelines increase the gap between dependent instructions (hurts when there is little ILP)

ILP Limits Wall 1993

9 Techniques for High ILP Better branch prediction and fetch (trace cache)  cascading branch predictors? More physical registers, ROB, issue queue, LSQ  two-level regfile/IQ? Higher issue width  clustering? Lower average cache hierarchy access time Memory dependence prediction Latency tolerance techniques: ILP, MLP, prefetch, runahead, multi-threading

Impact of Mem-Dep Prediction In the perfect model, loads only wait for conflicting stores; in naïve model, loads issue speculatively and must be squashed if a dependence is later discovered From Chrysos and Emer, ISCA’98

Clustering Reg-rename & Instr steer IQ Regfile FF IQ Regfile FF r1  r2 + r3 r4  r1 + r2 r5  r6 + r7 r8  r1 + r5 p21  p2 + p3 p22  p21 + p2 p42  p21 p41  p56 + p57 p43  p42 + p41 40 regs in each cluster r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs

2Bc-gskew Branch Predictor Address Address+History BIM Meta G1 G0 Pred Vote 44 KB; 2-cycle access; used in the Alpha 21464

Rules On a correct prediction  if all agree, no update  if they disagree, strengthen correct preds and chooser On a misprediction  update chooser and recompute the prediction  on a correct prediction, strengthen correct preds  on a misprediction, update all preds

Runahead Mutlu et al., HPCA’03 Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the oldest instruction is a cache miss, behave like it causes a context-switch: checkpoint the committed registers, rename table, return address stack, and branch history register assume a bogus value and start a new thread this thread cannot modify program state, but can prefetch

Memory Bottlenecks 128-entry window, real L2  0.77 IPC 128-entry window, perfect L2  entry window, real L2  entry window, perfect L2  entry window, real L2, runahead  0.94

16 Title Bullet