1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 7810 Paper critiques and class participation: 25% Final exam: 25% Project: Simplescalar (?) modeling, simulation, and analysis: 50% Read and think about.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

CS Lecture 14 Delaying Physical Register Allocation Through Virtual-Physical Registers T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals.

CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Lecture: Out-of-order Processors

Lecture 18: Core Design, Parallel Algos

Lynn Choi Dept. Of Computer and Electronics Engineering

Lecture: Out-of-order Processors

Lecture: SMT, Cache Hierarchies

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Computer Architecture Lecture 4 17th May, 2006

Lecture: SMT, Cache Hierarchies

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 10: Runahead and MLP

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: SMT, Cache Hierarchies

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 22: Multithreading

Lecture 9: Dynamic ILP Topics: out-of-order processors

Presentation transcript:

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations

Clustering Reg-rename & Instr steer IQ Regfile FF IQ Regfile FF r1  r2 + r3 r4  r1 + r2 r5  r6 + r7 r8  r1 + r5 p21  p2 + p3 p22  p21 + p2 p42  p21 p41  p56 + p57 p43  p42 + p41 40 regs in each cluster r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs

3 Recent Trends Not much change in structure capacities Not much change in cycle time Pipeline depths have become shorter (circuit delays have reduced); this is good for energy efficiency Optimal performance is observed at about 50 pipeline stages (we are currently at ~20 stages for energy reasons) Deep pipelines improve parallelism (helps if there’s ILP); Deep pipelines increase the gap between dependent instructions (hurts when there is little ILP)

ILP Limits Wall 1993

5 Techniques for High ILP Better branch prediction and fetch (trace cache)  cascading branch predictors? More physical registers, ROB, issue queue, LSQ  two-level regfile/IQ? Higher issue width  clustering? Lower average cache hierarchy access time Memory dependence prediction Latency tolerance techniques: ILP, MLP, prefetch, runahead, multi-threading

2Bc-gskew Branch Predictor Address Address+History BIM Meta G1 G0 Pred Vote 44 KB; 2-cycle access; used in the Alpha 21464

Rules On a correct prediction  if all agree, no update  if they disagree, strengthen correct preds and chooser On a misprediction  update chooser and recompute the prediction  on a correct prediction, strengthen correct preds  on a misprediction, update all preds

Impact of Mem-Dep Prediction In the perfect model, loads only wait for conflicting stores; in naïve model, loads issue speculatively and must be squashed if a dependence is later discovered From Chrysos and Emer, ISCA’98

Runahead Mutlu et al., HPCA’03 Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the oldest instruction is a cache miss, behave like it causes a context-switch: checkpoint the committed registers, rename table, return address stack, and branch history register assume a bogus value and start a new thread this thread cannot modify program state, but can prefetch

Memory Bottlenecks 128-entry window, real L2  0.77 IPC 128-entry window, perfect L2  entry window, real L2  entry window, perfect L2  entry window, real L2, runahead  0.94

SMT Pipeline Structure Front End Front End Front End Front End Execution Engine RenameROB I-CacheBpred RegsIQ FUsDCache Private/ Shared Front-end Private Front-end Shared Exec Engine SMT maximizes utilization of shared execution engine

12 SMT Fetch Policy Fetch policy has a major impact on throughput: depends on cache/bpred miss rates, dependences, etc. Commonly used policy: ICOUNT: every thread has an equal share of resources  faster threads will fetch more often: improves thruput  slow threads with dependences will not hoard resources  low probability of fetching wrong-path instructions  higher fairness

Area Effect of Multi-Threading The curve is linear for a while Multi-threading adds a 5-8% area overhead per thread (primary caches are included in the baseline) From Davis et al., PACT 2005

Single Core IPC 4 bars correspond to 4 different L2 sizes IPC range for different L1 sizes

Maximal Aggregate IPCs

Power/Energy Basics Energy = Power x time Power = Dynamic power + Leakage power Dynamic Power =  C V 2 f  switching activity factor C capacitances being charged V voltage swing f processor frequency

17 Guidelines Dynamic frequency scaling (DFS) can impact power, but has little impact on energy Optimizing a single structure for power/energy is good for overall energy only if execution time is not increased A good metric for comparison: ED (because DVFS is an alternative way to play with the E-D trade-off) Clock gating is commonly used to reduce dynamic energy, DFS is very cheap (few cycles), DVFS and power gating are more expensive (micro-seconds or tens of cycles, fewer margins, higher error rates) 2

Criticality Metrics Criticality has many applications: performance and power; usually, more useful for power optimizations QOLD – instructions that are the oldest in the issueq are considered critical  can be extended to oldest-N  does not need a predictor  young instrs are possibly on mispredicted paths  young instruction latencies can be tolerated  older instrs are possibly holding up the window  older instructions have more dependents in the pipeline than younger instrs

Other Criticality Metrics QOLDDEP: Producing instructions for oldest in q ALOLD: Oldest instr in ROB FREED-N: Instr completion frees up at least N dependent instrs Wake-Up: Instr completion triggers a chain of wake-up operations Instruction types: cache misses, branch mpreds, and instructions that feed them

20 Title Bullet