Managing GPU Concurrency in Heterogeneous Architectures MICRO 2014
Baseline heterogeneous architecture Throughput-optimized GPU cores and latency- optimized CPU cores on the same chip Cores are connected to the LLC and MCs via an interconnect
Motivation – application interference 20% Performance loss <- contention in shared hardware resources GPU: the dominant consumer of shared resources due to high TLP CPU applications are affected much more compared to GPU applications. 85%
Motivation – latency tolerance High GPU thread level parallelism: Memory system congestion Low CPU performance GPU can tolerate latency due to multi-threading Better performance at lower TLP GPU TLP management
Effects of GPU TLP on GPU performance Reduce GPU TLP, GPU performance can be Better:less cache thrashing and congestion in memory Worse: reduced parallelism and latency tolerance Unchanged
Effects of GPU TLP on CPU performance Reduce GPU TLP, GPU performance can be Better: less congestion in memory subsystem Unchanged
: Proposal I – CM-CPU CPU-Centric Concurrency Management Main goal: reduce GPU concurrency to boost CPU performance 2 metrics Memory congestion: # of stalled requests due to an MC being full Network congestion: # of stalled requests due to reply network being full CPU performance : congestion
Proposal I – CM-CPU Congestion level: low, medium or high At least one metric is high: decrease # of warps Both metrics are low: increase # of warps Otherwise: # of warps unchanged Downside: insufficient GPU latency tolerance due to low TLP
Proposal II – CM-BAL Balanced Concurrency Management stallGPU: # of cycles that GPU core fails to issue a warp Latency tolerance of GPU cores Low latency tolerance High memory contention
× Proposal II – CM-BAL Part 1: the same as CM-CPU Part 2: override CM-CPU TLP: stallGPU by more than k Higher k: more difficult to improve the latency tolerance ×
GPU performance results GPU/CPU DYNCTA: +2% CM-CPU: -11% CM-BAL1: +7%
CPU performance results GPU/CPU DYNCTA: +2% CM-CPU: +24% CM-BAL1: +7%
System performance Overall System Speedup = (1 − α) × WSCPU + α × SUGPU α is between 0 and 1 Higher α -> higher GPU importance CM-CPU CM-BAL
Conclusions Sharing the memory hierarchy leads to CPU and GPU applications to interfere with each other Existing GPU TLP management techniques are not well-suited for heterogeneous architectures Propose two GPU TLP management techniques for heterogeneous architectures CM-CPU reduces GPU TLP to improve CPU performance CM-BAL is similar to CM-CPU, but increases GPU TLP when it detects low latency tolerance in GPU cores TLP can be tuned based on user’s preference for higher CPU or GPU performance