EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.

Slides:

Advertisements

Similar presentations

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.

Advertisements

Instruction-Level Parallelism (ILP)

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

CDA 5155 Out-of-order execution: Advanced pipelines.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECS 470 Register Renaming Lecture 8 Coverage: Chapter 3.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Multiscalar processors

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Review for Midterm 2 CPSC 321 Computer Architecture Andreas Klappenecker.

EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Lecture 13 Slide 1 EECS 470 © Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

CDA 5155 Out-of-order execution: Pentium Pro/II/III Week 7.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS161 – Design and Architecture of Computer

CS161 – Design and Architecture of Computer

Multiscalar Processors

CS203 – Advanced Computer Architecture

Michigan Technological University, Houghton MI

Lecture: Out-of-order Processors

Single Clock Datapath With Control

Pipeline Implementation (4.6)

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Chapter 4 The Processor Part 3

Morgan Kaufmann Publishers The Processor

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Out of Order Processors

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Krste Asanovic Electrical Engineering and Computer Sciences

Advanced Computer Architecture

Control unit extension for data hazards

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

Control unit extension for data hazards

Patrick Akl and Andreas Moshovos AENAO Research Group

Control unit extension for data hazards

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 1 An Overview of High-Performance Computer Architecture

Conceptual execution on a processor which exploits ILP

Handling Stores and Loads

Presentation transcript:

EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3

Dynamic “Register” Scheduling Recap Q: Do we need to know the result of an instruction to schedule its dependent operations –A: Once again, no, we need know only dependencies and latency To decouple wakeup-select loop –Broadcast dstID back into scheduler N-cycles after inst enters REG, where N is the latency of the instruction What if latency of operation is non- deterministic? –E.g., load instructions (2 cycle hit, 8 cycle miss) –Wait until latency known before scheduling dependencies (SLOW) –Predict latency, reschedule if incorrect Reschedule all vs. selective MEM IFID AllocREN EX ROB CT In-order Any order ARF REG RS Any order WB phyID V V dstIDOp

Dynamic “Register” Scheduling Recap src 1 src 2 dstID src 1 src 2 dstID src 1 src 2 dstID req grant Selection Logic == == == dstID timer MEM EX ROB REG RS WB

Benefits of Register Communication Directly specified dependencies (contained within instruction) –Accurate description of communication No false or missing dependency edges Permits realization of dataflow schedule –Early description of communication Allows scheduler pipelining without impacting speed of communication Small communication name space –Fast access to communication storage Possible to map/rename entire communication space (no tags) Possible to bypass communication storage

Why Memory Scheduling is Hard (Or, Why is it called HARDware?) Loads/stores also have dependencies through memory –Described by effective addresses Cannot directly leverage existing infrastructure –Indirectly specified memory dependencies Dataflow schedule is a function of program computation, prevents accurate description of communication early in the pipeline Pipelined scheduler slow to react to addresses –Large communication space ( bytes!) cannot fully map communication space, requires more complicated cache and/or store forward network ? *p = … *q = … … = *p

Requirements for a Solution Accurate description of memory dependencies –No (or few) missing or false dependencies –Permit realization of dataflow schedule Early presentation of dependencies –Permit pipelining of scheduler logic Fast access to communication space –Preferably as fast as register communication (zero cycles)

In-order Load/Store Scheduling Schedule all loads and stores in program order –Cannot violate true data dependencies (non- speculative) Capabilities/limitations: –Not accurate - may add many false dependencies –Early presentation of dependencies (no addresses) –Not fast, all communication through memory structures Found in in-order issue pipelines st X ld Y st Z ld X ld Z truerealized Dependencies program order

ld Y st X In-order Load/Store Scheduling Example st X ld Y st Z ld X ld Z truerealized Dependencies program order time ld Y st Z ld X ld Z st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z

Blind Dependence Speculation Schedule loads and stores when register dependencies satisfied –May violate true data dependencies (speculative) Capabilities/limitations: –Accurate - if little in-flight communication through memory –Early presentation of dependencies (no dependencies!) –Not fast, all communication through memory structures Most common with small windows st X ld Y st Z ld X ld Z truerealized Dependencies program order

ld Y st X Blind Dependence Speculation Example st X ld Y st Z ld X ld Z truerealized Dependencies program order time ld Y st Z ld X ld Z st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z mispeculation detected!

Discussion Points Suggest two ways to detect blind load mispeculation Suggest two ways to recover from blind load mispeculation

The Case for More/Less Accurate Dependence Speculation Small windows: blind speculation is accurate for most programs, compiler can register allocate most short term communication Large windows: blind speculation performs poorly, many memory communications in execution window [For 099.go, from Moshovos96]

Conservative Dataflow Scheduling Schedule loads and stores when all dependencies known satisfied –Conservative - won’t violate true dependencies (non-speculative) Capabilities/limitations: –Accurate only if addresses arrive early –Late presentation of dependencies (verified with addresses) –Not fast, all communication through memory and/or complex store forward network Common for larger windows st X ld Y st?Z ld X ld Z truerealized Dependencies program order

ld Y st X Conservative Dataflow Scheduling st X ld Y st?Z ld X ld Z truerealized Dependencies program order time ld Y st?Z ld X ld Z st?Z ld X ld Z ld Y st X st?Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z Z stall cycle

Discussion Points What if no dependent store or unknown store address is found? Describe the logic used to locate dependent store instructions What is the tradeoff between small and large memory schedulers? How should uncached loads/stores be handled? Video RAM?

Memory Dependence Speculation [Moshovos96] Schedule loads and stores when data dependencies satisfied –Uses dependence predictor to match sourcing stores to loads –Doesn’t wait for addresses, may violate true dependencies (speculative) Capabilities/limitations: –Accurate as predictor –Early presentation of dependencies (data addresses not used in prediction) –Not fast, all communication through memory structures st?X ld Y st?Z ld X ld Z truerealized Dependencies program order

Dependence Speculation - In a Nutshell Assumes static placement of dependence edges is persistent –Good assumption! Common cases: –Accesses to global variables –Stack accesses –Accesses to aliased heap data Predictor tracks store/load PCs, reproduces last sourcing store PC given load PC A: *p = … B: *q = … C: … = *p Dependence Predictor CA or B

ld Y st?X Memory Dependence Speculation Example st?X ld Y st?Z ld X ld Z truerealized Dependencies program order time ld Y st?Z ld X ld Z st?Z ld X ld Z ld Y st?X st?Z ld X ld Z ld Y st X st?Z ld X ld Z ld Y st X st?Z ld X ld Z X

Memory Renaming [Tyson/Austin97] Design maxims: –Registers Good, Memory Bad –Stores/Loads Contribute Nothing to Program Results Basic idea: –Leverage dependence predictor to map memory communication onto register synchronization and communication infrastructure Benefits: –Accurate dependence info if predictor is accurate –Early presentation of dependence predictions –Fast communication through register infrastructure

st X ld Y st Z ld X ld Z I1 I4 I5 I2 Memory Renaming Example Renamed dependence edges operate at bypass speed Load/store address stream becomes “checker” stream –Need only be high-B/W (if predictor performs well) –Risky to remove memory accesses completely st X ld Y st Z ld X ld Z I1ld Y I4 I5 I2

Memory Renaming Implementation Speculative loads require recovery mechanism Enhancements muddy boundaries between dependence, address, and value prediction –Long lived edges reside in rename table as addresses –Semi-constants also promoted into rename table Dependence Predictor store/load PC’s predicted edge name (5-9 bit tag) ID Edge Rename Table) REN physical storage assignment (destination for stores, source for loads) one entry per edge

Experimental Evaluation Implemented on SimpleScalar 2.0 baseline Dynamic scheduling timing simulation (sim-outorder) –256 instruction RUU –Aggressive front end –Typical 2-level cache memory hierarchy Aggressive memory renaming support –4k entries in dependence predictor –512 edge names, LRU allocated Load speculation support –Squash recovery –Selective re-execution recovery

Dependence Predictor Performance Good coverage of “in-flight” communication Lots of room for improvement

Dependence Prediction Breakdown

Program Performance Performance predicated on: –High-B/W fetch mechanism –Efficient mispeculation recovery mechanism Better speedups with: –Larger execution windows –Increased store forward latency –Confidence mechanism

Additional Work Turning of the crank - continue to improve base mechanisms –Predictors (loop carried dependencies, better stack/global prediction) –Improve mispeculation recovery performance Value-oriented memory hierarchy Data value speculation Compiler-based renaming (tagged stores and loads): store r1,(r2):t1 store r3,(r4):t2 load r5,(r6):t1