Lecture 13 Slide 1 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.

Lecture 13 Slide 1 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch EECS 470 Lecture 13 Memory Speculation Winter 2015 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, Univerity of Pennsylvania, and University of Wisconsin.

Lecture 13 Slide 2 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Tomasulo-Style Scheduler Implementation Scheduler Logic Results Input/Result Networks Reservation Stations Inputs Network Control Valid Bits Tag V V Value FlagsOp Synchronization managed by scheduler logic Communication through input/output networks Infrastructure geared towards register communication

Lecture 13 Slide 3 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Out-of-Order Memory Operations Scheduling is straightforward in out-of-order… r Register inputs only r Register renaming captures all true dependences r Tags tell you exactly when you can execute … except loads r Register and memory inputs (older stores) r Register renaming does not tell you all dependences r How do loads find older in-flight stores to same address (if any)?  Issue of finding if addresses match is called “memory disambiguation”

Lecture 13 Slide 4 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch The Good: Register Communication Directly specified dependencies (contained in inst) r Accurate description of communication m no false or missing dependency edges m permits realization of dataflow schedule r Early description of communication m know dependencies upon decode m allows scheduler logic to be pipelined without impacting speed of communication Small communication name space (32-64 usually) r Fast access to communication storage m possible to map entire communication space (no tags) m possible to bypass communication storage

Lecture 13 Slide 5 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch The Bad (and the ugly): Memory Scheduling Loads/stores also have dependencies through memory r Described by effective addresses Cannot directly leverage existing (register) infrastructure r Indirectly specified memory dependencies m Memory dependencies are a function of program computation, prevents early accurate description of communication m Pipelined scheduler slow to react to addresses r Large communication space (2 32-64 bytes!) m Cannot fully map communication space, requires more complicated cache and/or store forward network Memory latency is variable Complicates scheduling

Lecture 13 Slide 6 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Requirements for a Solution Accurate description of memory dependencies r No (or few) missing or false dependencies r Permit realization of dataflow schedule Early presentation of dependencies r Permit pipelining of scheduler logic Fast access to communication space r Preferably as fast as register communication (zero cycles)

Lecture 13 Slide 8 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Some trivial ways to handle Loads Allow only one load or store in OoO core r Stall other operations at dispatch – very slow r No need for LSQ Load/store only issues when LSQ/ROB head (in-order scheduling) r Stall other operations at dispatch r Loads always get value from cache, only 1 outstanding load More aggressive options for forwarding: r Conservative load-to-store forwarding, when addresses known r Optimistic load-to-store forwarding – requires rewind mechanism Aggressive options for memory: r Aggressive fetching from memory if no load-to-store forwarding.

Lecture 13 Slide 9 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Implementation Several hardware realizations: r Unified LSQ (easier to understand, but nasty hardware) r Separate LQ* and SQ (more complicated, but fairly elegant) We’ll start with a unified LSQ and move to separate LB and SQ. *Might end up with a load buffer rather than a queue…

Lecture 13 Slide 10 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch In-order Load/Store Scheduling Schedule all loads and stores in program order r Cannot violate true data dependencies (non-speculative) Capabilities/limitations: r Overly restrictive – likely to add many false dependencies r Early presentation of dependencies (no addresses) r Not fast, all communication through memory structures Found in in-order issue pipelines st X ld Y st Z ld X ld Z truerealized Dependencies program order

Lecture 13 Slide 11 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch ld Y st X In-order Load/Store Scheduling Example st X ld Y st Z ld X ld Z truerealized Dependencies program order time ld Y st Z ld X ld Z st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z

Lecture 13 Slide 12 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Consider the LSQ cases SlotLSQ addresses (top is head) AStore 0x44 BLoad 0x20 CLoad 0x44 DStore ???? ELoad 0x44 FStore 0x20 GLoad 0x20 Which of the loads are we sure we will fulfill via D$/Memory? Which of the loads will we fulfill via load-to- store forwarding? Which aren’t we sure of? Identify what to do with each load.

Lecture 13 Slide 13 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Unified Load/Store Queue Operates as a circular FIFO r Allocate on dispatch r De-allocate on retirement Calc address in register dataflow order A NxN comparator matrix detects memory address dependence (also considers relative age of entries) r Store ops are held until retirement r Load ops are issued when no dependency exists & all older store addresses known = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = address calculation+ translation

Lecture 13 Slide 14 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Unified Load/Store Queue Questions When do we search for store-to-load forwarding?  As soon as we have the load address What could happen once we have the load address?  There is a store whose data we’ll use  There is no store whose data we’ll use  We aren’t sure which store’s data, if any, we’ll use. What should we do for each of those three cases?

EECS 470 Lecture 9 Slide 15 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Split LQ and SQ D$/TLB + structures to handle in-flight loads/stores Performs four functions In-order store retirement Writes stores to D$ in order Basic, implemented by store queue (SQ) Store-load forwarding Allows loads to read values from older un-retired stores Data provided to LQ from SQ. Memory ordering violation detection Checks load speculation (more later) Advanced, implemented by load queue (LQ) Memory ordering violation avoidance Advanced, implemented by dependence predictors

EECS 470 Lecture 9 Slide 16 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Simple Data Memory FU: D$/TLB + SQ Just like any other FU 2 register inputs (addr, data in) 1 register output (data out) 1 non-register input (load pos)? Store queue (SQ) In-flight store address/value In program order (like ROB) Addresses associatively searchable Size heuristic: 15-20% of ROB But what does it do? valueaddress == age D$/TLB head tail load position address data indata out Store Queue (SQ)

EECS 470 Lecture 9 Slide 17 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch “Out-of-Order” Load Execution In parallel with D$ access Send address to SQ Compare with all store addresses CAM: like FA$, or RS tag match Select all matching addresses Age logic selects youngest store that is older than load Uses load position input Three possibilities There is a store we can forward from Do so! There is no store we can forward from Get from D$ We don’t know. ????? valueaddress == age D$/TLB head tail load position address data indata out Store Queue (SQ)

EECS 470 Lecture 9 Slide 18 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch D$/TLB + SQ + LQ Load queue (LQ) In-flight load addresses In program-order (like ROB,SQ) Associatively searchable Size heuristic: 20-30% of ROB == D$/TLB head tail load queue (LQ) address == tail head age store positionflush? SQ

EECS 470 Lecture 9 Slide 19 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Conservative Memory Scheduling Schedule loads and stores when all dependencies known satisfied Conservative - won’t violate true dependencies (non-speculative) Capabilities/limitations: Accurate only if addresses arrive early Late presentation of dependencies (verified with addresses) Not fast, all communication through memory and/or complex store forward network Better for small windows st X ld Y st?Z ld X ld Z truerealized Dependencies program order

EECS 470 Lecture 9 Slide 20 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch ld Y st X Conservative Dataflow Scheduling st X ld Y st?Z ld X ld Z truerealized Dependencies program order time ld Y st?Z ld X ld Z st?Z ld X ld Z ld Y st X st?Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z ld Y st X st Z ld X ld Z Z stall cycle

EECS 470 Lecture 9 Slide 21 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Opportunistic Memory Scheduling Observe: on average, < 10% of loads forward from SQ Even if older store address is unknown, chances are it won’t match Let loads execute in presence of older “ambiguous stores” +Increases performance But what if ambiguous store does match? Memory ordering violation: load executed too early Must detect… And fix (e.g., by flushing/refetching instruction starting at load)

EECS 470 Lecture 9 Slide 22 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Opportunistic Memory Speculation Schedule loads and stores when register dependencies satisfied May violate true data dependencies (speculative) Capabilities/limitations: Accurate - if little in-flight communication through memory Early presentation of dependencies (no dependencies!) Not fast, all communication through memory structures Most common with small windows st X ld Y st Z ld X ld Z truerealized Dependencies program order

EECS 470 Lecture 9 Slide 23 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Less aggressive speculation Once you have a load address, check if you will forward for certain (from somewhere, even if you aren’t sure where) If not, get from memory/D$ Don’t send to CDB until certain.

EECS 470 Lecture 9 Slide 24 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch Yet another idea… Let’s use a predictor Predict if a load is likely to be forwarding from a store If not, get it from D$ and send it to the CDB If so, wait. Thoughts on cost/benefit of speculation? How should that impact our predictor? “Memory dependence prediction”

Lecture 13 Slide 1 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.

Similar presentations

Presentation on theme: "Lecture 13 Slide 1 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 13 Slide 1 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.

Similar presentations

Presentation on theme: "Lecture 13 Slide 1 EECS 470 © 2015 -- Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch."— Presentation transcript:

Similar presentations

About project

Feedback