Presented by: Isaac Martin APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Presented by: Isaac Martin
GPU Overview Streaming Multiprocessors (SM) Dozens of cores each (128*) GPU has multiple SMs Single Instruction Multiple Thread (SIMT) Many threads run on same code (1024-2048 per SM*), kernels Threads grouped into warps Limited cache space per SM (16-48KB*) Results in lots of cache misses, memory latency in GPU Device Memory How to improve?
Two Common Types of Loads Small Memory Range Strong locality, same or very close address Ex: Single variable shared across all warps Large Memory Range w/ Striding Address only accessed once, evenly spaced Common for image processing, using thread index to access data Ex: Reading pixel values from an image in parallel SIMT Design In good SIMT code, all threads in warp execute same instruction (performance suffers if they diverge) All threads in warp should have same PC for kernel
Cache Misses Cold Misses Cache block is empty, unavoidable Conflict Misses Associativity scheme Cache slot already occupied by other data Capacity Misses Out of space How to avoid dumping important data? Compute vs. Memory Intensive Compute mostly cold misses Memory intensive has lots of capacity & conflict misses
Adaptive PREfetching and Scheduling (APRES) Architecture solution to improve hit rate, try and reduce latency caused by the two common load types Group sets of warps based on load type Short memory range If warps load same address at same PC, and data is in cache, no memory latency expected Prioritize these warps, they will complete sooner Long memory range w/ striding Loads for this data usually miss first time If PC the same, can guess address next warp will use w/ striding Compare, calculate predicted address of warps at PC Prefetch address into cache
Hardware Solution - LAWS & SAP
Locality Aware Warp Scheduler (LAWS)
Scheduling Aware Prefetching (SAP)
APRES Impact on Baseline GPU Performance 31.7% improvement over baseline GPU 7.2% improvement over state-of-the-art predicting & scheduling tools Hardware Overhead Additional hardware only 2.06% of standard L1 cache Additional functional units (4 int adders, 1 int multiply, 1 int divider) negligible compared to Fused Multiply- Add (FMA) functional units in CUDA cores (NVIDIA GPUs). Questions?