Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

SoC CAD 1 Simultaneous Continual Flow Pipeline Architecture 徐子傑 Hsu,Zi Jei Department of Electrical Engineering National Cheng Kung University Tainan,

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

ISCA-36 :: June 23, 2009 Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Multiscalar processors

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

CS203 – Advanced Computer Architecture ILP and Speculation.

Dynamic Scheduling Why go out of style?

CSL718 : Superscalar Processors

Multiscalar Processors

PowerPC 604 Superscalar Microprocessor

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Decoupled Store Completion Silent Deterministic Replay Enabling Scalable Data Memory for CPR/CFP Processors Andrew Hilton, Amir Roth University of Pennsylvania.

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Out of Order Processors

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Hardware Multithreading

Address-Value Delta (AVD) Prediction

Lecture 11: Memory Data Flow Techniques

Out-of-Order Commit Processor

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

How to improve (decrease) CPI

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1 1 Intel 2 Portland State University

2 Problem: tolerating miss latencies Increasing miss latencies to memory –large instruction windows tolerate latencies –naïve window scaling impractical Resource efficient large instruction windows –sustain 1000s of instructions in-flight –need small register files and schedulers –do not address memory buffers efficiency Must track all memory operations Memory consistency, ordering, and forwarding

3 Why is this a problem? Memory operations tracked in load & store buffers –buffers require CAMs for scanning and matching –CAMs have high area and power requirements Don’t always need large memory buffers –L2 cache hit  small buffers sufficient –L2 cache miss  large buffers necessary Scaling CAM is difficult Why pay the price when not necessary? Must eliminate CAMs from large buffers

4 Loads: Unordered buffer Hierarchical load buffers Conventional level one load buffer –effective in the absence of a miss Un-ordered level two load buffer –used only when long latency miss occurs –set-associative cache structure no scan, only indexed lookup necessary –does not track precise order of loads sufficient to know if violation occurred (not where) checkpoint rollback

5 Stores: CAM-free buffers Hierarchical store queue Conventional level one store queue –effective in the absence of a miss CAM-free level two store queue –used only when long latency miss occurs –used only for ordering  no scanning or matching necessary in queue Decouple ordering from forwarding 1. Redo stores to enforce order 2. Forward from cache instead of queue

6 Outline Motivation Resource efficient processors –Continual Flow Pipelines –memory buffer demands Store processing Results Summary

7 Implications of a miss Long latency misses to memory –place pressure on critical resources –pipeline quickly stalls due to blocked resources Large instruction window processors –execute useful instructions in shadow of miss –tolerate latency by overlapping miss with useful work –naïve scaling impractical Resource-efficient instruction windows –scale window to thousands –do not require scaled cycle-critical structures

8 Resource-efficient latency tolerance Significant fraction of instructions in the shadow of a miss are independent of the miss Exploit above program property Treat and process miss-dependent and miss- independent instructions differently

9 Continual Flow Pipeline processor Miss dependent instructions –release critical resources –leave pipeline, and wait outside pipeline in slice buffer Miss independent instructions –execute –release critical resources and retire When miss returns –miss-dependent instructions re-acquire resources –execute and retire After miss-dependent instructions execute –results automatically integrated

10 Continual Flow Pipeline processor Critical resource efficient –don’t require large register files, large schedulers Need to track all memory operations –large load buffer  large CAM footprint and power –hierarchical store queue small, fast L1 store queue (32 entries) large, slow L2 store queue (~512 entries)  large CAM foot print  high leakage power good performance

11 Why track all memory operations? Stores must update in program order Load/store dependence speculation Multiprocessor memory consistency Continual Flow Pipeline processors –execute independents ahead of dependents –aggressively reorder memory operations execution

12 Outline Motivation Resource efficient processors Store processing –store queue overview –SRL key idea –SRL workings Results Summary

13 Functions of a store queue Ordering –ensure memory updates are in program order –correctness Forwarding –provide data to subsequent loads –performance –CAM X Z Y Y K X Z Y Y K AD STQ Z LD AD Fwd. data Z Match

14 Conventional store queue Single structure for ordering, forwarding Large sizes increase CAM area & leakage –CAM contribution to area and power dominates Efficiency  Eliminate CAMs

15 Decoupling ordering from forwarding CAM L2 STQ AD AD SRAM Store Redo Log (SRL) FIFO Program Order No CAM Data Cache Forwarding No CAM No CAMs for ordering/forwarding!

16 Store Redo Log workings (1) In shadow of a miss Allocate FIFO L2 store queue (SRL) entry for all stores –records program order for stores Dependent stores –not ready, release L1 store queue entry, and enter SRL Independent stores –update cache temporarily, and enter SRL Loads –independent loads forward from cache & retire –dependent loads go to slice buffer –do not scan L2 store queue for forwarding

17 Store Redo Log workings (2) When miss returns Discard all independent store updates to cache –these stores don’t re-execute –their dependents don’t re-execute Drain the SRL in program order –reconstruct memory live-outs –program order maintained –no re-execution, only re-update no extra cache ports required

18 Hazards Write after Write (WAW) Write after Read (WAR) Read After Write (RAW)

19 Handling hazards: WAW ST X12ST XST Y17 Y2 X ST X5 512 SRL Cache L1 STQ Miss returns ST XST YST X Program Order

20 Handling hazards: WAR LD XST XST Y Program Order ST X5LDST Y17 Y X LD X 38 L1 STQL1 LDQSlice Buffer SRL Cache Miss returns

21 Handling hazards: RAW Detect by snooping completed stores Restart execution in case of violations –restore to checkpoint

22 Outline Motivation Latency tolerant processor background Store processing Results Summary

23 Evaluation Ideal store queue –large L1 STQ (Latency = 3 cycles) –gives upper-bound (impractical to build) Hierarchical store queue –L1 STQ (Latency = 3 cycles) –L2 STQ (with CAMs) (Latency = 8 cycles) SRL store processing –L1 STQ (Latency = 3 cycles) –FIFO CAM-free Store Redo Log Baseline –L1 STQ (Latency = 3 cycles)

24 SRL performance Performance within 6% of ideal store queue

25 Power and area comparison Hierarchical store queue –90nm CMOS technology –SPICE simulations –circuit optimized to reduce leakage power –banked structure to reduce dynamic power SRL over Hierarchical STQ –more than 50% reduction in leakage power –more than 90% reduction in dynamic power –75% reduction in the area

26 Summary CAM-free secondary structures Set-associative L2 Load buffer FIFO L2 Store queue –Don’t constantly enforce order –Ensure correct order by redoing the stores 75% area and 50% leakage power savings No CAM  scalable design