Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Slides:

Advertisements

Similar presentations

Characterization of Silent Stores Gordon B.Bell Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Performance of Cache Memory

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CSCI 232© 2005 JW Ryder1 Cache Memory Systems Introduced by M.V. Wilkes (“Slave Store”) Appeared in IBM S360/85 first commercially.

ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Cache Organization of Pentium

Multiprocessor Cache Coherency

Revisiting Load Value Speculation:

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Speculation Amir Roth University of Pennsylvania.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Eager Writeback — A Technique for Improving Bandwidth Utilization

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

COMPSYS 304 Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Implementing Optimizations at Decode Time

CSC 4250 Computer Architectures

Cache Memory Presentation I

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lu Peng, Jih-Kwon Peir, Konrad Lai

PIII Data Stream Power Saving Modes Buses Memory Order Buffer

Temporally Silent Stores (Alternatively: Louder Silent Stores)

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture 8: Dynamic ILP Topics: out-of-order processors

How to improve (decrease) CPI

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

* From AMD 1996 Publication #18522 Revision E

Avoiding Initialization Misses to the Heap

Update : about 8~16% are writes

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Introduction Recent work shows that many memory writes do not update the system state “Silent Stores” are memory writes which are writing the same value that already exists at that memory location Intuitively, we might be able to exploit this observation for performance benefit

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Silent Stores—Is this for Real? Percentage of silent stores is non-trivial in all cases, 20%-68%

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Motivation Silent Stores are real and non-trivial 20-60% of dynamic stores are silent Multiprocessor benefits: Reduced address and data bus traffic Uniprocessor benefits: Reduced writebacks, pressure on write buffers Write port utilization, etc.

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Standard Store Verifies Issue a store verify (SV) for every store Fetch Decode/ Rename Dispatch EX/ Agen WB Commit D-Cache Store Verify (Load) Hit Silent? Store “Standard” Store Verifies are expensive Load, compare, (store) overhead for every store Increase cache port utilization Can block loads that may be on critical path =

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Is There a Better Way? Predict which stores are likely to be silent and only store verify those Subject of ongoing research Find lower cost mechanisms for verifying stores Exploit OoO core  -arch features Exploit core reliability features for deep- submicron technology trends Silent Stores for Free: Reducing the cost of store verification

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Outline OoO core enabled Free Silent Store Squashing (FSSS) Mechanisms Read port stealing Temporal & spatial locality in the load/store queue (LSQ) FSSS in ECC cache architectures Data cache protection methods ECC-L1-D$ FSSS Trading FSSS for physical bandwidth Conclusions

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Performance--Machine Model SimpleScalar PISA w/realistic memory system 8 issue; 64 entry RUU; 64K entry Gshare 64KB each L1 I/D cache; 512KB unified L2 32 entry load/store queue Two fully-pipelined memory access ports 32B L1-L2 interface, single cycle occupancy Write-through-allocate L1, write-back L2 2 Write buffers, 32B write-combining

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Read Port Stealing Only issue a store verify if a cache port is available (schedule ready loads/stores first) If a store reaches the head of the ROB before it can be verified, assume it is non-silent Fetch Decode/ Rename Dispatch EX/ Agen WB Commit D-Cache Read Port Available? Hit Silent? Store Similar to standard SV, but does not delay ready loads/stores and does not capture all silent stores =

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Read Port Stealing--Opportunities Captures minimum of 84% of store verify opportunities

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 LSQs OoO cores implement LSQs to track in- memory dependences for improved performance store forwarding consistency model violations LSQs provide temporal and spatial context for a memory operation Surrounds an operation with other references local to it in dynamic program order

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 LSQ Temporal Locality We can exploit temporal locality (same address aliases) in the LSQ to verify stores WAW: Can forward store to load, why not store to store? WAR: Load allocates data from the cache, use it to squash a subsequent store RAW: In many  -arch, cache port scheduled before aliasing to an entry in the LSQ is known, use the port to store verify

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 LSQ Spatial Locality Obtaining a wide datapath to L1-D$ is possible due to on-chip caches Assume a memory reference can provide an entire cacheline of data Exploit spatial locality to issued memory references WAR: Load allocates an entire line WAW: Use read port stealing to allocate RAW: Load allocates an entire line

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 LSQ Squashing--Silent Stores Captured Over 90% of silent stores captured; Greater than 40% in most cases using locality in LSQ

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Memory Storage Soft Errors Detecting and correcting soft errors is becoming more important Deep-submicron manufacturing Uptime & system reliability concerns Many methods exist for ECC: Rely on redundancy for detection/correction Coding: Keep extra bits that allow both detection & correction Explicit copies: Keep multiple copies with extra bits for detection, correct by loading the copy

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Redundant Data ECC for L1-D$ Duplication of parity protected L1-D$ High overhead--100% over L1-D$ with parity 2x read bandwidth vs. write bandwidth Leads to configurations with higher load throughput Parity Check L1-D$ w/Parity Address Parity Check L1-D$ w/Parity Address Ok ! Ok Load Datapath L1-D$ w/Parity L1-D$ w/Parity Store Datapath Address Data

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Redundant Data ECC for L1-D$ Write-through parity protected L1-D$ with inclusive (ECC code protected) L2 Write-through creates high demand on the L1-L2 interface Can use previous FSSS techniques to reduce stores (and hence write-throughs) L1-D$ w/Parity L2 w/ECC Address Parity Check Ok ! Ok Load Datapath Store Datapath L1-D$ w/Parity L2 w/ECC Address Data

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Coding ECC for L1-D$ Protect the L1-D$ with ECC directly ECC-data words relatively large to reduce overhead (ex: 64-bit in 21264, RS64-III) Sub-ECC-word store datapath Sub-ECC-word stores consist of four operations: Read original ECC-word, Merge, ECC-gen, Write D-Cache ECC Logic (Correction) Data Reg. Address Check bits Data bits ECC Logic (Generate)

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 ECC L1 Free Silent Store Squash If sub-ECC-word stores are read-modify- writes, why not squash? Sub-ECC-word store datapath with ECC-FSSS Store verify in parallel with correction & check bit generation gives ECC-FSSS D-Cache ECC Logic (Correction) Data Reg. Address Check bits Data bits ECC Logic (Generate) = (!Silent || ECC Error)

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Effectiveness of ECC-L1 FSSS Can detect 100% of sub-ECC-word L1-D$ hits store-byte (8b), store-half (16b), store-word (32b) in 64b-ECC-data-word  -arches Can also capture many more which might not be so obvious IBM RS64-III (Pulsar) has maximal 32b integer stores in 32b mode (common for user programs) All of these can be captured with ECC-FSSS

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Increasing Write-Through Bandwidth via FSSS We expect squashing silent stores to reduce pressure on the L1-L2 interface Can we implement a narrower/slower L1-L2 physical interface and exploit FSSS for greater effective interface bandwidth? Potentially reduce power consumption Ease circuit & physical design

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Increasing Write-Through BW-- Write-Through Reduction 15% average write-through traffic reduction

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Increasing Write-Through BW--IPC 75% lower physical BW+FSSS yields 9% IPC improvement over fast physical interface without FSSS

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Conclusions Standard store verifies are expensive Three methods of squashing silent stores for reduced cost Using read port stealing Exploiting temporal and spatial locality in the LSQ Using ECC logic in the L1 data cache These methods verify a large fraction of silent stores for non-trivial speedups Trade implementation of silent store squashing for higher physical BW between L1 & L2

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Current and Future Work Silent stores in MPs, as well as program structure and message passing store value locality [Lepak & Lipasti, ISCA-2k] Characterizing & Critical silent stores [Bell et. al PACT-2k] Silence confidence mechanism(s) Exploiting predictable stores in MP systems Applying all types of store value locality in different system paradigms...

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Backup Slides

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Read Port Stealing--IPC HM improvement of 10%, 0-56% range across benchmarks

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 LSQ Squashing--IPC HM improvement of 11%, 0-56% range across benchmarks

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 LSQ Temporal--Silent Stores Captured Captures an average of 30% of silent stores across benchmarks

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 FSSS Method Comparison Read Port Stealing and LSQ squashing provide similar performance results However, LSQ squashing reduces the percent of store verifies issued to the memory system by 50% Temporal LSQ squashing is not effective in isolation for this machine May be useful to reduce sharing ECC squashing is truly free

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 LSQ Cache Design Assume FIFO LSQ cache operated in lock-step with LSQ Avoids explicit tags, replacement policy considerations MPs: Flush on memory barriers (WC) MPs: Use existing LSQ logic for SC to invalidate (e.g. R10K)

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Terminology Program Structure Store Value Locality (PSSVL): The value locality exhibited by a given static store (can write to many addresses) Message Passing Store Value Locality (MPSVL): The value locality exhibited for a specific memory location (can be written by many PCs) Stochastically Silent Store: A store value which is trivially predictable by any well known method

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 MPSVL and PSSVL Percentage of stochastically silent (PSSVL, MPSVL) stores is non-trivial 27%-72% for PSSVL, 39%-70% for MPSVL

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Multiprocessor Sharing Measurable reduction in true/false sharing for simple update silent squashing (UFS) Substantial reductions by squashing update silent store hits and misses (UFS-P) and stochastically silent stores (SFS) Squashing store misses (UFS-P) can be substantially better than simple UFS Motivates silence confidence mechanism for store misses

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Multiprocessor Traffic Measurable reduction in invalidate traffic for simple update silent store squashing (UFS)— more effective than Exclusive state Substantial reduction for UFS-P and Stochastic False Sharing (SFS) Writeback data traffic reduction by squashing update silent store hits and misses (UFS-P) 5%-82% in oltp 16%-17% in ocean 5%-16% in barnes

December 11, 2000Kevin Lepak and Mikko Lipasti--PHARM Team, University of WisconsinMICRO-33 Multiprocessor Sharing