1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
CS 7810 Lecture 8 Memory Dependence Prediction using Store Sets G.Z. Chrysos and J.S. Emer Proceedings of ISCA
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
CS 7810 Lecture 12 Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors D. Brooks et al. IEEE Micro, Nov/Dec.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 7810 Paper critiques and class participation: 25% Final exam: 25% Project: Simplescalar (?) modeling, simulation, and analysis: 50% Read and think about.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
CS 7810 Lecture 14 Reducing Power with Dynamic Critical Path Information J.S. Seng, E.S. Tune, D.M. Tullsen Proceedings of MICRO-34 December 2001.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
CS Lecture 2 Limits of Instruction-Level Parallelism David W. Wall WRL Research Report 93/6 Also appears in ASPLOS’91.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Lecture: Out-of-order Processors
Lecture 18: Core Design, Parallel Algos
Lynn Choi Dept. Of Computer and Electronics Engineering
Lecture: Out-of-order Processors
Lecture: SMT, Cache Hierarchies
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture: SMT, Cache Hierarchies
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Ka-Ming Keung Swamy D Ponpandi
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 10: Runahead and MLP
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
15-740/ Computer Architecture Lecture 14: Runahead Execution
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture: SMT, Cache Hierarchies
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 22: Multithreading
Lecture 9: Dynamic ILP Topics: out-of-order processors
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations

ILP Limits Wall 1993

3 Techniques for High ILP Better branch prediction and fetch (trace cache)  cascading branch predictors? More physical registers, ROB, issue queue, LSQ  two-level regfile/IQ? Higher issue width  clustering? Lower average cache hierarchy access time Memory dependence prediction Latency tolerance techniques: ILP, MLP, prefetch, runahead, multi-threading

2Bc-gskew Branch Predictor Address Address+History BIM Meta G1 G0 Pred Vote 44 KB; 2-cycle access; used in the Alpha 21464

Rules On a correct prediction  if all agree, no update  if they disagree, strengthen correct preds and chooser On a misprediction  update chooser and recompute the prediction  on a correct prediction, strengthen correct preds  on a misprediction, update all preds

Impact of Mem-Dep Prediction In the perfect model, loads only wait for conflicting stores; in naïve model, loads issue speculatively and must be squashed if a dependence is later discovered From Chrysos and Emer, ISCA’98

Runahead Mutlu et al., HPCA’03 Trace Cache Current Rename IssueQ Regfile (128) Checkpointed Regfile (32) Retired Rename ROB FUs L1 D Runahead Cache When the oldest instruction is a cache miss, behave like it causes a context-switch: checkpoint the committed registers, rename table, return address stack, and branch history register assume a bogus value and start a new thread this thread cannot modify program state, but can prefetch

Memory Bottlenecks 128-entry window, real L2  0.77 IPC 128-entry window, perfect L2  entry window, real L2  entry window, perfect L2  entry window, real L2, runahead  0.94

SMT Pipeline Structure Front End Front End Front End Front End Execution Engine RenameROB I-CacheBpred RegsIQ FUsDCache Private/ Shared Front-end Private Front-end Shared Exec Engine SMT maximizes utilization of shared execution engine

10 SMT Fetch Policy Fetch policy has a major impact on throughput: depends on cache/bpred miss rates, dependences, etc. Commonly used policy: ICOUNT: every thread has an equal share of resources  faster threads will fetch more often: improves thruput  slow threads with dependences will not hoard resources  low probability of fetching wrong-path instructions  higher fairness

Area Effect of Multi-Threading The curve is linear for a while Multi-threading adds a 5-8% area overhead per thread (primary caches are included in the baseline) From Davis et al., PACT 2005

Single Core IPC 4 bars correspond to 4 different L2 sizes IPC range for different L1 sizes

Maximal Aggregate IPCs

Power/Energy Basics Energy = Power x time Power = Dynamic power + Leakage power Dynamic Power =  C V 2 f  switching activity factor C capacitances being charged V voltage swing f processor frequency

15 Guidelines Dynamic frequency scaling (DFS) can impact power, but has little impact on energy Optimizing a single structure for power/energy is good for overall energy only if execution time is not increased A good metric for comparison: ED (because DVFS is an alternative way to play with the E-D trade-off) Clock gating is commonly used to reduce dynamic energy, DFS is very cheap (few cycles), DVFS and power gating are more expensive (micro-seconds or tens of cycles, fewer margins, higher error rates) 2

Criticality Metrics Criticality has many applications: performance and power; usually, more useful for power optimizations QOLD – instructions that are the oldest in the issueq are considered critical  can be extended to oldest-N  does not need a predictor  young instrs are possibly on mispredicted paths  young instruction latencies can be tolerated  older instrs are possibly holding up the window  older instructions have more dependents in the pipeline than younger instrs

Other Criticality Metrics QOLDDEP: Producing instructions for oldest in q ALOLD: Oldest instr in ROB FREED-N: Instr completion frees up at least N dependent instrs Wake-Up: Instr completion triggers a chain of wake-up operations Instruction types: cache misses, branch mpreds, and instructions that feed them

18 Title Bullet