Cache-Conscious Wavefront Scheduling

Slides:



Advertisements
Similar presentations
Bypass and Insertion Algorithms for Exclusive Last-level Caches
Advertisements

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems Tayler H. Hetherington ɣ Timothy G. Rogers ɣ Lisa Hsu* Mike.
Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.
Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Prefetch-Aware Cache Management for High Performance Caching
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
Internet Cache Pollution Attacks and Countermeasures Yan Gao, Leiwen Deng, Aleksandar Kuzmanovic, and Yan Chen Electrical Engineering and Computer Science.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Predicting Inter-Thread Cache Contention on a Chip Multi-Processor Architecture Dhruba Chandra Fei Guo Seongbeom Kim Yan Solihin Electrical and Computer.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.
Divergence-Aware Warp Scheduling
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
Exploiting Compressed Block Size as an Indicator of Future Reuse
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Sunpyo Hong, Hyesoon Kim
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Part 3: Research Directions 2 3 Advancing Computer Systems without Technology Progress DARPA/ISAT Workshop, March 26-27, 2012 Mark Hill & Christos Kozyrakis.
- 세부 1 - 이종 클라우드 플랫폼 데이터 관리 브로커 연구 및 개발 Network and Computing Lab.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.
Reducing Memory Interference in Multicore Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Prefetch-Aware Cache Management for High Performance Caching
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
GPU Computing Architecture
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
CARP: Compression Aware Replacement Policies
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Managing GPU Concurrency in Heterogeneous Architectures
CARP: Compression-Aware Replacement Policies
Massachusetts Institute of Technology
University of Wisconsin-Madison
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
Operation of the Basic SM Pipeline
CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads S. –Y Lee, A. A. Kumar and C. J Wu ISCA 2015.
Warp Scheduling.
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presented by Ondrej Cernin
Presentation transcript:

Cache-Conscious Wavefront Scheduling Timothy G. Rogers1 Mike O’Connor2 Tor M. Aamodt1 1The University of British Columbia 2AMD Research

… … … … … … Wavefronts and Caches High Level Overview of a GPU DRAM 10’s of thousands concurrent threads High bandwidth memory system Include data caches L2 cache Threads in Wavefront … Compute Unit … Compute Unit W1 W2 … … Memory Unit L1D ALU … Wavefront Scheduler ALU ALU

Motivation Improve performance of highly parallel applications with irregular or data dependent access patterns on GPU Breadth First Search (BFS) K-Means (KMN) Memcached-GPU (MEMC) Parallel Garbage Collection (GC) These workloads can be highly cache-sensitive Increase 32k L1D to 8M Minimum 3x speedup Mean speedup >5x Tim Rogers Cache-Conscious Wavefront Scheduling

Where does the locality come from? Classify two types of locality Intra-wavefront locality Inter-wavefront locality Wave0 Wave1 Wave0 LD $line (X) LD $line (X) LD $line (X) LD $line (X) Hit Hit Data Cache Data Cache

Inter-Wavefront Hits PKI Intra-Wavefront Hits PKI Quantifying intra-/inter-wavefront locality 120 Misses PKI 100 Inter-Wavefront Hits PKI 80 (Hits/Miss) PKI 60 Intra-Wavefront Hits PKI 40 20 AVG-Highly Cache Sensitive

Greedy then Oldest Scheduler Observation Issue-level scheduler chooses the access stream Round Robin Scheduler Greedy then Oldest Scheduler Wave0 Wave1 Wave0 Wave1 Wavefront Scheduler Wavefront Scheduler ld A ,B,C,D… ld Z,Y,X,W ld A,B,C,D… ... ... ... DC B A WX Y Z ld A,B,C,D ld Z,Y,X,W ld A,B,C,D… DC B A DC B A Memory System Memory System

Need a better replacement Policy? Difficult Access Stream Need a better replacement Policy? Optimal Replacement using RR scheduler 4 hits A,B,C,D E,F,G,H I,J,K,L A,B,C,D E,F,G,H I,J,K,L W0 W1 W2 W0 W1 W2 A B C D E L F LRU replacement 12 hits A,B,C,D A,B,C,D E,F,G,H E,F,G,H I,J,K,L I,J,K,L W0 W0 W1 W1 W2 W2

Why miss rate is more sensitive to scheduling than replacement 1024 threads = thousands of memory accesses Ld A … Ld C … … Ld E … Ld B Ld D Ld F Ld A Ld C Ld E 1 2 A W0 W1 W31 … Replacement Policy Decision limited to one of A possible ways Wavefront Scheduler Decision picks from thousands of potential accesses

AVG-Highly Cache-Sensitive Does this ever Happen? Consider two simple schedulers 10 20 30 40 50 60 70 80 90 AVG-Highly Cache-Sensitive Loose Round Robin with LRU Belady Optimal Greedy Then Oldest with LRU MPKI Tim Rogers Cache-Conscious Wavefront Scheduling

Key Idea Use the wavefront scheduler to shape the access pattern Greedy then Oldest Scheduler Cache-Conscious Wavefront Scheduler Wave0 Wave1 Wave0 Wave1 Wavefront Scheduler Wavefront Scheduler ld A,B,C,D ld Z,Y,X,W ld A,B,C,D… ld Z,Y,X,W… ... ... ... ... WX Y Z DC B A WX Y Z ld A,B,C,D ld Z,Y,X,W ld A,B,C,D… ld Z,Y,X,W… WX Y Z DC B A DC B A Memory System Memory System

More Details in the Paper CCWS Components W2 Locality Scoring System W2 W1 Balances cache miss rate and overall throughput Score W1 W0 More Details in the Paper W0 Time Lost Locality Detector Detects when wavefronts have lost intra-wavefront locality L1 victim tags organized by wavefront ID Victim Tags W0 Tag Tag W1 Tag Tag W2 Tag Tag

CCWS Implementation Wave Scheduler Locality Scoring System W2: ld Y No W2 loads W2 W2 W1 Wave Scheduler Locality Scoring System Score W1 W1 W0 W2: ld Y W0: ld X W0: ld X W0 W0 … Memory Unit Time Cache W0 detected lost locality Tag Y X 2 WID Data … Tag WID Data Victim Tags More Details in the Paper ProbeW0,X W0,X W0 X Tag Tag W1 Tag Tag W2 Tag Tag

Methodology GPGPU-Sim (version 3.1.0) 30 Compute Units (1.3 GHz) 32 wavefront contexts (1024 threads total) 32k L1D cache per compute unit 8-way 128B lines LRU replacement 1M L2 unified cache Stand Alone GPGPU-Sim Cache Simulator Trace-based cache simulator Fed GPGPU-Sim traces Used for oracle replacement

Performance Results 2 LRR GTO CCWS 1.5 1 0.5 Also Compared Against 2 LRR GTO CCWS A 2-LVL scheduler Similar to GTO performance A profile-based oracle scheduler Application and input data dependent CCWS captures 86% of oracle scheduler performance Variety of cache-insensitive benchmarks No performance degradation 1.5 Speedup 1 0.5 HMEAN-Highly Cache-Sensitive

Full Sensitivity Study in Paper Cache Miss Rate 10 20 30 40 50 60 70 80 90 AVG-Highly Cache-Sensitive CCWS less cache misses than other schedulers optimally replaced MPKI Full Sensitivity Study in Paper

Related Work Wavefront Scheduling Gerogia Tech - GPGPU Workshop 2010 UBC - HPCA 2011 UT Austin - MICRO 2011 UT Austin/NVIDIA/UIUC/Virginia - ISCA 2011 OS-Level Scheduling SFU – ASPLOPS 2010 Intel/MIT – ASPLOPS 2012

Conclusion Different approach to fine-grained cache management Good for power and performance High level insight not tied specifics of a GPU Any system with many threads sharing a cache can potentially benefit Questions?