Stash: Have Your Scratchpad and Cache it Too

Slides:



Advertisements
Similar presentations
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Advertisements

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Caches J. Nelson Amaral University of Alberta. Processor-Memory Performance Gap Bauer p. 47.
Caching and Demand-Paged Virtual Memory
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
CSCI206 - Computer Organization & Programming
Gwangsun Kim, Jiyun Jeong, John Kim
Cache Coherence: Directory Protocol
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Cache Coherence: Directory Protocol
ECE232: Hardware Organization and Design
CS161 – Design and Architecture of Computer
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Today How was the midterm review? Lab4 due today.
The University of Adelaide, School of Computer Science
Lecture: Cache Hierarchies
Memory Hierarchy Virtual Memory, Address Translation
Lecture: Cache Hierarchies
CSE 153 Design of Operating Systems Winter 2018
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
Paging and Segmentation
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
What we need to be able to count to tune programs
CMSC 611: Advanced Computer Architecture
Lecture 21: Memory Hierarchy
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
From Address Translation to Demand Paging
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Virtual Memory Overcoming main memory size limitation
Lecture 25: Multiprocessors
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 22: Cache Hierarchies, Memory
CS 3410, Spring 2014 Computer Science Cornell University
Operation of the Basic SM Pipeline
Lecture: Cache Hierarchies
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Lecture 21: Memory Hierarchy
The University of Adelaide, School of Computer Science
CSE 471 Autumn 1998 Virtual memory
Main Memory Background
CSE 153 Design of Operating Systems Winter 2019
Lecture 9: Caching and Demand-Paged Virtual Memory
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The University of Adelaide, School of Computer Science
Synonyms v.p. x, process A v.p # index Map to same physical page
Address-Stride Assisted Approximate Load Value Prediction in GPUs
CSE 542: Operating Systems
Border Control: Sandboxing Accelerators
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Stash: Have Your Scratchpad and Cache it Too Matthew D. Sinclair et. al UIUC Presenting by Sharmila Shridhar

SoCs Need an Efficient Memory Hierarchy Energy-efficient memory hierarchy is essential Heterogeneous SoCs use specialized memories E.g., scratchpads, FIFOs, stream buffers, … Scratchpad Cache Directly addressed: no tags/TLB/conflicts  X Compact storage: no holes in cache lines

SoCs Need an Efficient Memory Hierarchy Energy-efficient memory hierarchy is essential Heterogeneous SoCs use specialized memories E.g., scratchpads, FIFOs, stream buffers, … Can specialized memories be globally addressable, coherent? Can we have our scratchpad and cache it too? Scratchpad Cache Directly addressed: no tags/TLB/conflicts  X Compact storage: no holes in cache lines Global address space: implicit data movement Coherent: reuse, lazy writebacks

Can We Have Our Scratchpad and Cache it Too? Stash Directly addressable Compact storage Global address space Coherent Make specialized memories globally addressable, coherent Efficient address mapping Efficient coherence protocol Focus: CPU-GPU systems with scratchpads and caches Up to 31% less execution time, 51% less energy

Outline Motivation Background: Scratchpads & Caches Stash Overview Implementation Results Conclusion

Global Addressability Scratchpads Part of private address space: not globally addressable Explicit movement Cache Globally addressable: part of global address space Implicit copies, no pollution, support for conditional accesses , pollution, poor conditional accs support GPU CPU Registers Scratchpad Cache Cache L2 $ Bank L2 $ Bank Interconnection n/w

Coherence: Globally Visible Data Scratchpads Part of private address space: not globally visible Eager writebacks and invalidations on synchronization Cache Globally visible: data kept coherent Lazy writebacks as space is needed, reuse data across synch

Stash – A Scratchpad, Cache Hybrid Directly addressed: no tags/TLB/conflicts  X Compact storage: no holes in cache lines Global address space: implicit data move. Coherent: reuse, lazy writebacks

Outline Motivation Background: Scratchpads & Caches Stash Overview Implementation Results Conclusion

Stash: Directly & Globally Addressable // A is global mem addr // scratch_base == 500 for (i = 500; i < 600; i++) { reg ri = load[A+i-500]; scratch[i] = ri ; } reg r = scratch_load[505]; // A is global mem addr // Compiler info: stash_base[500] -> A (M0) // Rk = M0 (index in map) reg r = stash_load[505, Rk ]; Accelerator Accelerator 500 505 500 505 Map … … … … Scratchpad Stash 500A M0 Generate load[A+5] Like scratchpad: directly addressable (for hits) Like cache: globally addressable (for misses) Implicit loads, no cache pollution

Stash: Globally Visible Stash data can be accessed by other units Needs coherence support Like cache Keep data around – lazy writebacks Intra- or inter-kernel data reuse on the same core GPU CPU Registers Stash Map $ Cache L2 $ Bank L2 $ Bank Interconnection n/w

Stash: Compact Storage … . Global … Stash Caches: cache line granularity storage (“holes”  waste) Do not compact data Like scratchpad, stash compacts data

Outline Motivation Background: Scratchpads & Caches Stash Overview Implementation Results Conclusion

Stash Software Interface Software gives a mapping for each stash allocation AddMap(stashBase, globalBase, fieldSize, objectSize, rowSize, strideSize, numStrides, isCoherent)

Stash Hardware Stash-Map Stash instruction VP-map VA Map index table … VP-map VA TLB … RTLB Data Array S t a e Map index table PA/VA Map Index per thread block. Shared Stash Map. Six arithmetic operations to get VA. V Stash base VA base Field size, Object size Row size, Stride size, #strides isCoh #Dirty Data

Stash Instruction Example MISS Stash-Map stash_load[505, Rk]; … VP-map VA TLB HIT … RTLB Data Array S t a e Map index table PA Map Index per thread block. Shared Stash Map. Six arithmetic operations to get VA. V Stash base VA base Field size, Object size Row size, Stride size, #strides isCoh #Dirty Data

Lazy writebacks Stash writebacks happen lazily Chunks of 64B with per chunk dirty bit On store miss, for the chunk set dirty bit update stash map index Increment #DirtyData counter On eviction, Get PA using stash map index and writeback Decrement #DirtyData counter

Coherence Support for Stash Stash data needs to be kept coherent Extend a coherence protocol for three features Track stash data at word granularity Capability to merge partial lines when stash sends data Modify directory to record the modifier and stash-map ID Extension to DeNovo protocol Simple, low overhead, hybrid of CPU and GPU protocols

[DeNoVo: Rethinking the Memory Hierarchy for Disciplined Parallelism] DeNovo Coherence (1/3) [DeNoVo: Rethinking the Memory Hierarchy for Disciplined Parallelism] Designed for Deterministic code w/o conflicting access Line granularity tags, word granularity coherence Only three coherence states Valid, Invalid, Registered Explicit self invalidation at the end of each phase Lines written in previous phase-> Registered state Keep valid data or registered core ID in Shared LLC

DeNovo Coherence (2/3) Private L1, shared L2; single word line Data-race freedom at word granularity Read No transient states No invalidation traffic No directory storage overhead No false sharing (word coherence) Invalid Valid Registered Read Write Read, Write

DeNovo Coherence (3/3) Extenstions for Stash Store Stash Map ID along with registered core ID Newly written data in Registered state At the end of the kernel, self-invalidate entries that are not registered In contrast, scratchpad invalidates all the entries Only three states, 4th state used for Writeback

Outline Motivation Background: Scratchpads & Caches Stash Overview Implementation Results Conclusion

Evaluation Simulation Environment Workloads: GEMS + Simics + Princeton Garnet N/W + GPGPU-Sim Extend McPAT and GPUWattch for energy evaluations Workloads: 4 microbenchmarks: implicit, reuse, pollution, on-demand Heterogeneous workloads: Rodinia, Parboil, SURF 1 CPU Core (15 for microbenchmarks) 15 GPU Compute Units (1 for microbenchmarks) 32 KB L1 Caches, 16 KB Stash/Scratchpad Implicit- implicit loads and lazy writebacks Pollution- two AoS arrays. B can fit inside cache only without pollution from A. On demand- Reads n writes only 1 element out of 32 Reuse- Subsequent kernels reuse the data

Evaluation (Microbenchmarks) – Execution Time Implicit Implicit Scr = Baseline configuration C = All requests use cache Scr+D = All requests use scratchpad w/ DMA St = Converts scratchpad requests to stash

Evaluation (Microbenchmarks) – Execution Time Implicit No explicit loads/stores

Evaluation (Microbenchmarks) – Execution Time Implicit Pollution No cache pollution

Evaluation (Microbenchmarks) – Execution Time Implicit Implicit Pollution On-Demand Reuse Only bring needed data

Evaluation (Microbenchmarks) – Execution Time Implicit Implicit Pollution On-Demand Reuse Data compaction, reuse

Evaluation (Microbenchmarks) – Execution Time Implicit Implicit Pollution On-Demand Reuse Average Avg: 27% vs. Scratch, 13% vs. Cache, 14% vs. DMA

Evaluation (Microbenchmarks) – Energy Implicit Pollution On-Demand Reuse Average Avg: 53% vs. Scratch, 36% vs. Cache, 32% vs. DMA

Evaluation (Apps) – Execution Time SURF BP NW PF SGEMM ST AVERAGE 106 102 103 103 Scr = Reqs use type specified by original app C = All reqs use cache St = Converts scratchpad reqs to stash

Evaluation (Apps) – Execution Time LUD SURF BP NW PF SGEMM ST AVERAGE 121 106 102 103 103 Avg: 10% vs. Scratch, 12% vs. Cache (max: 22%, 31%) Source: implicit data movement Comparable to Scratchpad+DMA

Evaluation (Apps) – Energy LUD SURF BP NW PF SGEMM ST AVERAGE 168 120 180 126 128 108 Avg: 16% vs. Scratch, 32% vs. Cache (max: 30%, 51%)

Conclusion Make specialized memories globally addressable, coherent Efficient address mapping (only for misses) Efficient software-driven hardware coherence protocol Stash = scratchpad + cache Like scratchpads: Directly addressable and compact storage Like caches: Globally addressable and globally visible Reduced execution time and energy Future Work: More accelerators & specialized memories; consistency models

Critique In GPUs, data in shared memory has the visibilty per thread block. Use syncthreads to ensure data is available. How is that behavior implemented? Else multiple threads can encounter miss on same data. How is it handled? Why don’t they compare Scratchpad+DMA for GPU applications results?