Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1 1 Intel 2 Portland State University

2 Problem: tolerating miss latencies Increasing miss latencies to memory –large instruction windows tolerate latencies –naïve window scaling impractical Resource efficient large instruction windows –sustain 1000s of instructions in-flight –need small register files and schedulers –do not address memory buffers efficiency Must track all memory operations Memory consistency, ordering, and forwarding

3 Why is this a problem? Memory operations tracked in load & store buffers –buffers require CAMs for scanning and matching –CAMs have high area and power requirements Don’t always need large memory buffers –L2 cache hit  small buffers sufficient –L2 cache miss  large buffers necessary Scaling CAM is difficult Why pay the price when not necessary? Must eliminate CAMs from large buffers

4 Loads: Unordered buffer Hierarchical load buffers Conventional level one load buffer –effective in the absence of a miss Un-ordered level two load buffer –used only when long latency miss occurs –set-associative cache structure no scan, only indexed lookup necessary –does not track precise order of loads sufficient to know if violation occurred (not where) checkpoint rollback

5 Stores: CAM-free buffers Hierarchical store queue Conventional level one store queue –effective in the absence of a miss CAM-free level two store queue –used only when long latency miss occurs –used only for ordering  no scanning or matching necessary in queue Decouple ordering from forwarding 1. Redo stores to enforce order 2. Forward from cache instead of queue

6 Outline Motivation Resource efficient processors –Continual Flow Pipelines –memory buffer demands Store processing Results Summary

7 Implications of a miss Long latency misses to memory –place pressure on critical resources –pipeline quickly stalls due to blocked resources Large instruction window processors –execute useful instructions in shadow of miss –tolerate latency by overlapping miss with useful work –naïve scaling impractical Resource-efficient instruction windows –scale window to thousands –do not require scaled cycle-critical structures

8 Resource-efficient latency tolerance Significant fraction of instructions in the shadow of a miss are independent of the miss Exploit above program property Treat and process miss-dependent and miss- independent instructions differently

9 Continual Flow Pipeline processor Miss dependent instructions –release critical resources –leave pipeline, and wait outside pipeline in slice buffer Miss independent instructions –execute –release critical resources and retire When miss returns –miss-dependent instructions re-acquire resources –execute and retire After miss-dependent instructions execute –results automatically integrated

10 Continual Flow Pipeline processor Critical resource efficient –don’t require large register files, large schedulers Need to track all memory operations –large load buffer  large CAM footprint and power –hierarchical store queue small, fast L1 store queue (32 entries) large, slow L2 store queue (~512 entries)  large CAM foot print  high leakage power good performance

11 Why track all memory operations? Stores must update in program order Load/store dependence speculation Multiprocessor memory consistency Continual Flow Pipeline processors –execute independents ahead of dependents –aggressively reorder memory operations execution

12 Outline Motivation Resource efficient processors Store processing –store queue overview –SRL key idea –SRL workings Results Summary

13 Functions of a store queue Ordering –ensure memory updates are in program order –correctness Forwarding –provide data to subsequent loads –performance –CAM X Z Y Y K X Z Y Y K AD STQ Z LD AD Fwd. data Z Match

14 Conventional store queue Single structure for ordering, forwarding Large sizes increase CAM area & leakage –CAM contribution to area and power dominates Efficiency  Eliminate CAMs

15 Decoupling ordering from forwarding CAM L2 STQ AD AD SRAM Store Redo Log (SRL) FIFO Program Order No CAM Data Cache Forwarding No CAM No CAMs for ordering/forwarding!

16 Store Redo Log workings (1) In shadow of a miss Allocate FIFO L2 store queue (SRL) entry for all stores –records program order for stores Dependent stores –not ready, release L1 store queue entry, and enter SRL Independent stores –update cache temporarily, and enter SRL Loads –independent loads forward from cache & retire –dependent loads go to slice buffer –do not scan L2 store queue for forwarding

17 Store Redo Log workings (2) When miss returns Discard all independent store updates to cache –these stores don’t re-execute –their dependents don’t re-execute Drain the SRL in program order –reconstruct memory live-outs –program order maintained –no re-execution, only re-update no extra cache ports required

18 Hazards Write after Write (WAW) Write after Read (WAR) Read After Write (RAW)

19 Handling hazards: WAW ST X12ST XST Y17 Y2 X38 17 12 ST X5 512 SRL Cache L1 STQ Miss returns ST XST YST X Program Order

20 Handling hazards: WAR LD XST XST Y Program Order ST X5LDST Y17 Y X 2 385 17 LD X 38 L1 STQL1 LDQSlice Buffer SRL Cache Miss returns

21 Handling hazards: RAW Detect by snooping completed stores Restart execution in case of violations –restore to checkpoint

22 Outline Motivation Latency tolerant processor background Store processing Results Summary

23 Evaluation Ideal store queue –large L1 STQ (Latency = 3 cycles) –gives upper-bound (impractical to build) Hierarchical store queue –L1 STQ (Latency = 3 cycles) –L2 STQ (with CAMs) (Latency = 8 cycles) SRL store processing –L1 STQ (Latency = 3 cycles) –FIFO CAM-free Store Redo Log Baseline –L1 STQ (Latency = 3 cycles)

24 SRL performance Performance within 6% of ideal store queue

25 Power and area comparison Hierarchical store queue –90nm CMOS technology –SPICE simulations –circuit optimized to reduce leakage power –banked structure to reduce dynamic power SRL over Hierarchical STQ –more than 50% reduction in leakage power –more than 90% reduction in dynamic power –75% reduction in the area

26 Summary CAM-free secondary structures Set-associative L2 Load buffer FIFO L2 Store queue –Don’t constantly enforce order –Ensure correct order by redoing the stores 75% area and 50% leakage power savings No CAM  scalable design

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Similar presentations

Presentation on theme: "Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Similar presentations

Presentation on theme: "Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1."— Presentation transcript:

Similar presentations

About project

Feedback