Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Lecture: Out-of-order Processors
Lynn Choi Dept. Of Computer and Electronics Engineering
Lecture: Out-of-order Processors
Lecture: SMT, Cache Hierarchies
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Smruti R. Sarangi IIT Delhi
Lecture: SMT, Cache Hierarchies
Lecture 11: Memory Data Flow Techniques
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 18: Pipelining Today’s topics:
Ka-Ming Keung Swamy D Ponpandi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
15-740/ Computer Architecture Lecture 14: Runahead Execution
Lecture: SMT, Cache Hierarchies
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 22: Multithreading
Lecture 9: Dynamic ILP Topics: out-of-order processors
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Sizing Structures Fixed relations Empirical (simulation-based)
Presentation transcript:

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed

Register Rename Logic Map Table Physical Source Regs Physical Dest Logical Source Regs Mux Free Pool Logical Dest Regs Dependence Check Logic Logical Source Reg

Shadow copies (shift register) Map Table – RAM 7-bits 7-bits 7-bits 7-bits 7-bits Phys reg id Num entries = Num logical regs Shadow copies (shift register)

Map Table – CAM 5-bits 1-bit 1-bit Logical reg id v a l i d Num entries = Num phys regs Shadow copies

Wakeup Logic . . … = = tag1 tagIW or or rdyL tagL tagR rdyR rdyL tagL

Selection Logic For multiple FUs, will need sequential selectors Issue window req grant enable anyreq Arbiter cell enable For multiple FUs, will need sequential selectors

Structure Complexities Critical structures: register map tables, issue queue, LSQ, register file, register bypass Cycle time is heavily influenced by: window size (physical register size), issue width (#FUs) Conflict between the desire to increase IPC and clock speed Can achieve both if we use large structures and deep pipelining; but, some structures can’t be easily pipelined and long-latency structures can also hurt IPC

Deep Pipelines What does it mean to have  2-cycle wakeup  2-cycle bypass  2-cycle regread

Frequency Scaling Options 2-cycle wakeup 2-cycle regread 2-cycle bypass Pipeline Scaling 20-IQ F 20-IQ F F F 40 Regs 40 Regs F F F F Capacity Scaling 15-IQ F F Replicated Capacity Scaling 30 Regs F 15-IQ F 15-IQ F F F 30 Regs 30 Regs F F

Recent Trends Not much change in structure capacities Not much change in cycle time Pipeline depths have become shorter (circuit delays have reduced); this is good for energy efficiency Optimal performance is observed at about 50 pipeline stages (we are currently at ~20 stages for energy reasons) Deep pipelines improve parallelism (helps if there’s ILP); Deep pipelines increase the gap between dependent instructions (hurts when there is little ILP)

ILP Limits Wall 1993

Techniques for High ILP Better branch prediction and fetch (trace cache)  cascading branch predictors? More physical registers, ROB, issue queue, LSQ  two-level regfile/IQ? Higher issue width  clustering? Lower average cache hierarchy access time Memory dependence prediction Latency tolerance techniques: ILP, MLP, prefetch, runahead, multi-threading

Impact of Mem-Dep Prediction In the perfect model, loads only wait for conflicting stores; in naïve model, loads issue speculatively and must be squashed if a dependence is later discovered From Chrysos and Emer, ISCA’98

Clustering Reg-rename & Instr steer IQ IQ Regfile Regfile F F F F 40 regs in each cluster r1  r2 + r3 r4  r1 + r2 r5  r6 + r7 r8  r1 + r5 p21  p2 + p3 p22  p21 + p2 p42  p21 p41  p56 + p57 p43  p42 + p41 r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs

2Bc-gskew Branch Predictor BIM Address Pred G0 Vote Address+History G1 Meta 44 KB; 2-cycle access; used in the Alpha 21464

Rules On a correct prediction if all agree, no update if they disagree, strengthen correct preds and chooser On a misprediction update chooser and recompute the prediction on a correct prediction, strengthen correct preds on a misprediction, update all preds

Runahead Mutlu et al., HPCA’03 Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) ROB L1 D Runahead Cache Retired Rename FUs When the oldest instruction is a cache miss, behave like it causes a context-switch: checkpoint the committed registers, rename table, return address stack, and branch history register assume a bogus value and start a new thread this thread cannot modify program state, but can prefetch

Memory Bottlenecks 128-entry window, real L2  0.77 IPC 128-entry window, perfect L2  1.69 2048-entry window, real L2  1.15 2048-entry window, perfect L2  2.02 128-entry window, real L2, runahead  0.94

Title Bullet