Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Slides:

Advertisements

Similar presentations

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.

Improving Cache Performance by Exploiting Read-Write Disparity

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

The Evicted-Address Filter

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Cache Replacement Championship

A Case for Toggle-Aware Compression for GPU Systems

µC-States: Fine-grained GPU Datapath Power Management

Improving Cache Performance using Victim Tag Stores

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

ISPASS th April Santa Rosa, California

Multilevel Memories (Improving performance using alittle “cash”)

Mosaic: A GPU Memory Manager

Managing GPU Concurrency in Heterogeneous Architectures

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Short Circuiting Memory Traffic in Handheld Platforms

Energy-Efficient Address Translation

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Application Slowdown Model

Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Presented by: Isaac Martin

MASK: Redesigning the GPU Memory Hierarchy

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

CARP: Compression-Aware Replacement Policies

` A Framework for Memory Oversubscription Management in Graphics Processing Units Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao.

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Demystifying Complex Workload–DRAM Interactions: An Experimental Study

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Haonan Wang, Adwait Jog College of William & Mary

Presented by Florian Ettinger

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu

Overview of This Talk Problem: memory divergence Key Observations: Threads execute in lockstep, but not all threads hit in the cache A single long latency thread can stall an entire warp Key Observations: Memory divergence characteristic differs across warps Some warps mostly hit in the cache, some mostly miss Divergence characteristic is stable over time L2 queuing exacerbates memory divergence problem Our Solution: Memory Divergence Correction Uses cache bypassing, cache insertion and memory scheduling to prioritize mostly-hit warps and deprioritize mostly-miss warps Key Results: 21.8% better performance and 20.1% better energy efficiency compared to state-of-the-art caching policy on GPU

Outline Background Key Observations Memory Divergence Correction (MeDiC) Results

Latency Hiding in GPGPU Execution GPU Core Status Time GPU Core Warp A Warp B Active Warp C Warp D Lockstep Execution Stall Thread Active

Problem: Memory Divergence Warp A Cache Hit Stall Time Time Main Memory Cache Miss Cache Hit

Outline Background Key Observations Memory Divergence Correction (MeDiC) Results

Observation 1: Divergence Heterogeneity Mostly-hit warp All-hit warp Mostly-miss warp All-miss warp Reduced Stall Time Time Key Idea: Convert mostly-hit warps to all-hit warps Convert mostly-miss warps to all-miss warps Cache Miss Cache Hit

Observation 2: Stable Divergence Char. Warp retains its hit ratio during a program phase Hit ratio  number of hits / number of access

Observation 2: Stable Divergence Char. Warp retains its hit ratio during a program phase

Observation 3: Queuing at L2 Banks Shared L2 Cache Bank 0 Bank 1 Bank 2 Bank n Request Buffers Memory Scheduler To DRAM …… …… 45% of requests stall 20+ cycles at the L2 queue Long queuing delays exacerbate the effect of memory divergence

Outline Background Key Observations Memory Divergence Correction (MeDiC) Results

Our Solution: MeDiC Key Ideas: Convert mostly-hit warps to all-hit warps Convert mostly-miss warps to all-miss warps Reduce L2 queuing latency Prioritize mostly-hit warps at the memory Maintain memory bandwidth

Memory Divergence Correction Warp-type-aware Cache Bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Warp-type-aware Memory Scheduler Bank 1 Low Priority N Bank 2 Warp Type Identification Logic Memory Request Bypassing Logic High Priority …… Y To DRAM Bank n Any Requests in High Priority Memory Scheduler Warp-type-aware Cache Insertion Policy

Mechanism to Identify Warp Type Profile hit ratio for each warp Group warp into one of five categories Periodically reset warp-type All-hit Mostly-hit Balanced Mostly-miss All-miss Higher Priority Lower Priority

Warp-type-aware Cache Bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Warp-type-aware Memory Scheduler Bank 1 Low Priority N Bank 2 Warp Type Identification Logic Memory Request Bypassing Logic High Priority …… Y To DRAM Bank n Any Requests in High Priority Warp-type-aware Cache Insertion Policy

Warp-type-aware Cache Bypassing Goal: Convert mostly-hit warps to all-hit warps Convert mostly-miss warps to all-miss warps Our Solution: All-miss and mostly-miss warps  Bypass L2 Adjust how we identify warps to maintain miss rate Key Benefits: More all-hit warps Reduce queuing latency for all warps

Warp-type-aware Cache Insertion Warp-type-aware Cache Bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Warp-type-aware Memory Scheduler Bank 1 Low Priority N Bank 2 Warp Type Identification Logic Memory Request Bypassing Logic High Priority …… Y To DRAM Bank n Any Requests in High Priority Warp-type-aware Cache Insertion Policy

Warp-type-aware Cache Insertion Goal: Utilize the cache well Prioritize mostly-hit warps Maintain blocks with high reuse Our Solution: All-miss and mostly-miss  Insert at LRU All-hit, mostly-hit and balanced  Insert at MRU Benefits: All-hit and mostly-hit are less likely to be evicted Heavily reused cache blocks from mostly-miss are likely to remain in the cache

Warp-type-aware Memory Sched. Warp-type-aware Cache Bypassing Mostly-miss, All-miss Shared L2 Cache Bank 0 Warp-type-aware Memory Scheduler Bank 1 Low Priority N Bank 2 Warp Type Identification Logic Memory Request Bypassing Logic High Priority …… Y To DRAM Bank n Any Requests in High Priority Warp-type-aware Cache Insertion Policy

Not All Blocks Can Be Cached Despite best efforts, accesses from mostly-hit warps can still miss in the cache Compulsory misses Cache thrashing Solution: Warp-type-aware memory scheduler

Warp-type-aware Memory Sched. Goal: Prioritize mostly-hit over mostly-miss Mechanism: Two memory request queues High-priority  all-hit and mostly-hit Low-priority  balanced, mostly-miss and all-miss Benefits: Mostly-hit warps stall less

MeDiC: Example Mostly-miss Warp All-miss Warp Mostly-hit Warp Bypass Cache Cache Queuing Latency Lower queuing latency DRAM Queuing Latency Cache/Mem Latency Insert at LRU Mostly-hit Warp All-hit Warp High Priority Insert at MRU Lower stall time

Outline Background Key Observations Memory Divergence Correction (MeDiC) Results

Methodology Modified GPGPU-Sim modeling GTX480 15 GPU cores 6 memory partition 16KB 4-way L1, 768KB 16-way L2 Model L2 queue and L2 queuing latency 1674 MHz GDDR5 Workloads from CUDA-SDK, Rodinia, Mars and Lonestar suites

Comparison Points FR-FCFS baseline [Rixner+, ISCA’00] Cache Insertion: EAF [Seshadri+, PACT’12] Tracks blocks that are recently evicted to detect high reuse and inserts them at the MRU position Does not take divergence heterogeneity into account Does not lower queuing latency Cache Bypassing: PCAL [Li+, HPCA’15] Uses tokens to limit number of warps that gets to access the L2 cache  Lower cache thrashing Warps with highly reuse access gets more priority PC-based and Random bypassing policy

Results: Performance of MeDiC 21.8% MeDiC is effective in identifying warp-type and taking advantage of divergence heterogeneity

Results: Energy Efficiency of MeDiC 20.1% Performance improvement outweighs the additional energy from extra cache misses

Other Results in the Paper Breakdowns of each component of MeDiC Each component is effective Comparison against PC-based and random cache bypassing policy MeDiC provides better performance Analysis of combining MeDiC+reuse mechanism MeDiC is effective in caching highly-reused blocks Sensitivity analysis of each individual components Minimal impact on L2 miss rate Minimal impact on row buffer locality Improved L2 queuing latency

Conclusion Problem: memory divergence Key Observations: Threads execute in lockstep, but not all threads hit in the cache A single long latency thread can stall an entire warp Key Observations: Memory divergence characteristic differs across warps Some warps mostly hit in the cache, some mostly miss Divergence characteristic is stable over time L2 queuing exacerbates memory divergence problem Our Solution: Memory Divergence Correction Uses cache bypassing, cache insertion and memory scheduling to prioritize mostly-hit warps and deprioritize mostly-miss warps Key Results: 21.8% better performance and 20.1% better energy efficiency compared to state-of-the-art caching policy on GPU

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu

Backup Slides

Queuing at L2 Banks: Real Workloads 53.8%

Adding More Banks 5%

Queuing Latency Reduction 69.8%

MeDiC: Performance Breakdown

MeDiC: Miss Rate

MeDiC: Row Buffer Hit Rate

MeDiC-Reuse

L2 Queuing Penalty

Divergence Distribution

Divergence Distribution

Stable Divergence Characteristics Warp retains its hit ratio during a program phase Heterogeneity Control Divergence Memory Divergence Edge cases on the data the program is operating on Coalescing Affinity to different memory partition Stability Temporal + spatial locality

Warps Can Fetch Data for Others All-miss and mostly-miss warps can fetch cache blocks for other warps Blocks with high reuse Shared address with all-hit and mostly-hit warps Solution: Warp-type aware cache insertion

Warp-type Aware Cache Insertion Future Cache Requests A1 A2 MRU LRU LRU MRU A7 B3 A8 A9 B2 A3 A4 MRU LRU LRU B1 A5 A6 B2 MRU L2 Cache

Warp-type Aware Memory Sched. Memory Request Warp Type All-hit, Mostly-hit Balanced, Mostly-miss, All-miss High Priority Queue Memory Request Queue Low Priority Queue FR-FCFS Scheduler Memory Scheduler FR-FCFS Scheduler Main Memory