Handling Stores and Loads

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ILP III Steve Ko Computer Sciences and Engineering University at Buffalo.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Lecture 13 Slide 1 EECS 470 © Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Dynamic Scheduling Why go out of style?

CSE 502: Computer Architecture

Multiscalar Processors

/ Computer Architecture and Design

PowerPC 604 Superscalar Microprocessor

OOO Execution of Memory Operations

OOO Execution of Memory Operations

Lecture: Out-of-order Processors

CS5100 Advanced Computer Architecture Hardware-Based Speculation

Pipeline Implementation (4.6)

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

ECS 154B Computer Architecture II Spring 2009

Morgan Kaufmann Publishers The Processor

High-level view Out-of-order pipeline

Lecture 6: Advanced Pipelines

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Module 3: Branch Prediction

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 11: Memory Data Flow Techniques

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Control unit extension for data hazards

Instruction Execution Cycle

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Instruction-Level Parallelism (ILP)

Overview Prof. Eric Rotenberg

Additional ILP Topics Prof. Eric Rotenberg

Control unit extension for data hazards

Wackiness Algorithm A: Algorithm B:

Control unit extension for data hazards

Update : about 8~16% are writes

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Spring 2019 Prof. Eric Rotenberg

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Sizing Structures Fixed relations Empirical (simulation-based)

ECE 721 Modern Superscalar Microarchitecture

Spring 2019 Prof. Eric Rotenberg

Dynamic Scheduling Physical Register File ready bits Issue Queue (IQ)

Presentation transcript:

Handling Stores and Loads Three key principles Commit stores to memory in program order Dynamic memory disambiguation: determine which store a load depends on Store-load forwarding: forward store value to dependent load All three are facilitated by the Store Queue (SQ) Complication Also need a Load Queue (LQ) to detect mispredicted loads A mispredicted load is a load that executed OOO with respect to a prior store on which it depends. ECE 721, Spring 2019 Prof. Eric Rotenberg

Examples: All prior store addresses known Scenario 1 store B store C load A Scenario 2 store A store C load A Scenario 3 store A load A Get data from D$. Get data from 1st store. Get data from 2nd store. ECE 721, Spring 2019 Prof. Eric Rotenberg

Examples: Unknown Prior Store Addresses Scenario 4 store ? store A load A Scenario 5 store B store ? load A Scenario 6 store A store ? load A Get data from 2nd store. Get data from D$. (speculative) Get data from 1st store. (speculative) ECE 721, Spring 2019 Prof. Eric Rotenberg

Load/Store Execution Lane AGEN unit for computing load and store addresses Three structures L1 D$ (and L1 D-TLB) Store Queue (SQ): contains all active stores in program order Stores are speculative until they reach head of Active List SQ commits stores to D$ non-speculatively and in-order Loads search SQ for store values on which they depend Load Queue (LQ): contains all active loads in program order Loads may execute out-of-order with respect to prior stores Executed load gets wrong value if it depends on an older store that hasn’t executed yet Stores search LQ for mispredicted loads ECE 721, Spring 2019 Prof. Eric Rotenberg

The store will get the following indices at dispatch time: SQ_index = SQ_tail: The store’s entry in the SQ. When the store executes later, it uses SQ_index to place its address and value into the SQ. In turn, these are needed for store-load forwarding and committing stores. LQ_index = LQ_tail: Index of first load after the store, in program order. When the store executes later, it searches the LQ for mispredicted loads: loads after the store, in program order, that depend on the store but executed before the store. Loads between LQ_index and LQ_tail are after the store in program order. ECE 721, Spring 2019 Prof. Eric Rotenberg

The load will get the following indices at dispatch time: LQ_index = LQ_tail: The load’s entry in the LQ. When the load executes later, it uses LQ_index to place its address into the LQ. In turn, the address is needed for detecting mispredicted loads. SQ_index = SQ_tail - 1: Index of the immediately preceding store, in program order. When the load executes later, it searches the SQ for a dependence on a prior store. It only considers stores between SQ_head and SQ_index: these are the stores before the load, in program order. ECE 721, Spring 2019 Prof. Eric Rotenberg

(1) Place load or store in Active List at tail. fetch decode rename (1) Place load or store in Active List at tail. (2) Place load or store in Issue Queue (IQ). (3) Place load or store in Load Queue (LQ) or Store Queue (SQ), respectively, at tail. A load gets LQ tail (LQ_index: where it resides in LQ) and SQ tail minus 1 (SQ_index: index of immediately preceding store in SQ). A store gets SQ tail (SQ_index: where it resides in SQ) and LQ tail (LQ_index: index of to-be-dispatched, immediately succeeding load in LQ). dispatch schedule (1) Calculate address (AGEN). (2) Load: Use address to access D$ and search SQ for matching addresses (D$ and SQ accessed in parallel); based on result of SQ search, load gets value from SQ (closest matching store) or D$ (no matching store in SQ). Also record load’s address in LQ. Store: Use address to search LQ for matching addresses; if there is a future load that already executed, and its address matches, mark that load in the Active List as “mispredicted”. Also record store’s address and value in the SQ. register read execute writeback Load: If marked as “mispredicted”, initiate recovery actions (e.g., use “Approach #1” or “Approach #2”); otherwise commit load the same way as other register-producing instructions. (Note: Re-executing a mispredicted load after recovery will succeed because all prior stores have committed.) Store: Signal the store at the head of the SQ to write its value to the D$ at its address. (Note: The store at the head of the Active List is the same as the store at the head of the SQ.) After load or store successfully commits, pop from LQ or SQ, respectively. retire ECE 721, Spring 2019 Prof. Eric Rotenberg

Store execution datapath B C 13 1 d … 14 15 ECE 721, Spring 2019 Prof. Eric Rotenberg

Store execution datapath (cont.) ECE 721, Spring 2019 Prof. Eric Rotenberg

Load execution datapath B C 15 1 d … 13 14 store value ECE 721, Spring 2019 Prof. Eric Rotenberg

Load execution datapath (cont.) ECE 721, Spring 2019 Prof. Eric Rotenberg

Speculative Load Handling: A Rich Design Space A ready load is speculative if there are unknown store addresses between it and closest matching store address (if any) Four dimensions of speculative load handling Memory Dependence Prediction Store-load synchronization strategy Load misprediction recovery strategy Impact of store execution (split stores vs. no split stores) ECE 721, Spring 2019 Prof. Eric Rotenberg

Memory Dependence Prediction Static prediction policies Always predict no dependence with prior unexecuted stores (this was our initial policy) Always speculatively execute the load Always predict a dependence with a prior unexecuted store Always stall the load Dynamic memory dependence prediction A speculative load is either stalled or speculatively executed based on history Examples: Table of sticky-bits indexed by load PC (a mispredicted load sets its sticky-bit; periodically reset all sticky-bits to retrain) Store Sets (learn dependencies between store PCs and load PCs) ECE 721, Spring 2019 Prof. Eric Rotenberg

Store-Load Synchronization Strategy How and when is a stalled load, unstalled? Two example approaches IQ based synchronization Augment the Issue Queue (IQ) to synchronize stores and their predicted-dependent loads Linking stores and loads in the IQ requires a memory dependence predictor like Store Sets to set up the linkages LQ based synchronization A load issues unimpeded. If SQ search is inconclusive (speculative load) and prediction is “stall”, don’t complete the load (like a cache miss). Periodically replay the load from the LQ until the SQ search is conclusive; or replay the load when it gets near or reaches head of active list; etc. ECE 721, Spring 2019 Prof. Eric Rotenberg

Load Misprediction Recovery Strategies Squash: Wait for mispredicted load to reach head of Active List Squash pipeline, rollback RMT, etc. Restart fetching from PC of load Selective Re-execution: When store detects a mispredicted load, “replay” the load and its dependent instructions from either the IQ (must hold onto IQ entries until proven non-speculative) or a Replay Buffer How exactly? Wait until value prediction lectures for details on: (1) identifying a load’s dependent instructions, (2) re-injecting load-dependent instructions back into IQ. ECE 721, Spring 2019 Prof. Eric Rotenberg

Impact of Store Issue Policy “Split Stores” Crack a store instruction into its address-generation micro-op (agen) and its value-read micro-op (val) The store takes two IQ entries, for its two independent micro-ops The store takes one SQ entry (agen and val recombine in the SQ) A stalled val does not stall a ready agen Permits stores to deposit their addresses in the SQ as soon as possible Loads have the best information possible when they search the SQ. Prevents unnecessary mispredictions. ECE 721, Spring 2019 Prof. Eric Rotenberg

Interactions Accurate memory dependence prediction Suggests simple, low-performance recovery (easy and low-cost, doesn’t occupy IQ entries longer than needed) May render split stores unnecessary (split stores increase IQ pressure and issue width pressure) Split stores, selective re-execution With these aggressive mechanisms, is explicit memory dependence prediction unnecessary? I.e., always assume no dependencies with unknown store addresses? ECE 721, Spring 2019 Prof. Eric Rotenberg