Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs fast Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)
GPU deployment in the cloud GPU multitasking support
GPU spatial multitasking GPU with 20 streaming multiprocessors (SMs) Apps execute on different SMs Even SM partitioning (widely used) Same SM count per app 2 applications App1 - 10 SMs App2 - 10 SMs
Alternative SM partitioning GPU with 20 streaming multiprocessors (SMs) Even SM partitioning App1 - 10 SMs App2 - 10 SMs Uneven SM partitioning Power-gating SMs
Why SM partitioning matters Exploring uneven SM partitioning 24 SMs in GPU Even and uneven SM partitioning 3 typical workloads GAUSSIA_LEU Performance opportunity LBM_DWT2D Energy opportunity BINO_LEU No opportunity How to find the most effective SM partition? X_Y - the number of SMs assigned to the left and right application
Workload classification Heterogeneous workloads Memory-intensive apps LBM, DWT2D, LAVAMD Compute-intensive apps DXTC, LEU, MERGE Key take-away messages Opportunity: optimize system throughput (STP) Shift SMs from memory-intensive app to compute-intensive app overall system performance can be significantly improved through SM partitioning for heterogeneous workload mixes system performance is maximized by assigning a relatively small fraction of the available SMs to the memory-intensive application
Workload classification Memory-intensive workloads Memory-intensive apps LBM, DWT2D, LAVAMD, GAUSSIAN Key take-away messages Opportunity: reduce power consumption Fewer SMs should be assigned to memory-intensive apps Performance may also improve Cache-sensitive apps (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention.
Workload classification Compute-intensive workloads Compute-intensive apps BINO, LEU, DXTC, MERGE Key take-away messages Cannot optimize performance Cannot optimize power Should keep even SM partitioning (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention. Workloads offer different opportunities Towards classification-driven search
SM sensitivity - compute apps Performance increases linearly
SM sensitivity - memory apps Performance saturates quickly How to classify applications?
BW utilization is not enough LLC APKI (accesses per kilo instructions) & LLC Miss Rate & Memory BW utilization Previously used memory bandwidth utilization does not work well 18.2 15.2 120.7 The classification based on a single metric is not effective
Off-SM bandwidth model SM part Total bandwidth per app needs Off-SM part Total bandwidth GPU provides Bandwidth demand < Provided bandwidth Performance increases linearly L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network Bandwidth demand > Provided bandwidth Performance saturates quickly L2 MC L2 MC SM Part Off-SM Part
Data bandwidth GPU provides Total data bandwidth Memory bandwidth L2 cache bandwidth NoC bandwidth Bandwidth Formula Min {NoC_BW, F (MEM, L2))} Build F (MEM, L2) based on L2 bandwidth L2 cache hit ratio Memory bandwidth utilization L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 MC Off-SM Part
Data bandwidth per app needs Data bandwidth demand #SM count Bandwidth demand per SM Memory accesses per cycle IPC L2 accesses per kilo insts SM operating frequency ALUs Instruction Cache Warp Scheduler Register File Load/Store Units Shared Memory/ L1 Cache Bandwidth demand IPC L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 APKI L2 MC
CD-search: Performance mode GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app Compute-intensive app How it works? Start with even partitioning Record performance Iteratively stall the SMs assigned to mem-app Stop stalling when performance of mem-app changes Assign stalled SMs to compute-app
CD-search: Energy mode GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app1, app2 How it works? Start with even partition Record performance Stall all but leave only one SM for each mem-app Estimate the SM count based on the IPC10SMs and IPC1SM Iteratively resume one SM until saturation Power-gate stalled SMs Power- gate
Experimental setup GPGPU-sim 24 SMs - Fermi-like architecture 14 applications Rodina, Parboil, SDK et.al. 7 compute-intensive apps 7 memory-intensive apps 91 2-app workloads 49 heterogeneous workloads 21 memory workloads 21 compute workloads GPGPU-sim 24 SMs - Fermi-like architecture Crossbar NoC Bisection bandwidth - 538 GB/s 6 MCs Memory bandwidth - 177 GB/s 2 LLC slices per MC Total LLC size - 768 KB GPUWattch 40 nm technology node
Performance results Heterogeneous workloads STP improvement 10.4% on average ANTT improvement 22% on average
Energy results Homogeneous workloads Power reduction 25% on average
Performance neutral on average Energy results Homogeneous workloads STP improves due to cache-sensitive applications STP degrades due to the time-varying execution behavior of few applications Performance neutral on average
streaming multiprocessors (SMs) SMK discussion GPUs with 20 streaming multiprocessors (SMs) SMK [ISCA’16] Co-executing 2 applications on one SM 2 applications App1 - 20 SMs App2 - 20 SMs Maestro [ASPLOS’17] Combines SMK with spatial multitasking Still assumes even SM partitioning
SMK discussion SMK severely degrades performance for some applications CD-search and Maestro both improve performance CD-search + Maestro achieves highest performance 24.4% performance improvement on average
Conclusions We demonstrate the opportunity for uneven SM partitioning Performance potential Power potential We propose CD-search which First classifies the workloads Then identifies a suitable SM partitioning We propose an off-SM bandwidth model to classify apps In the evaluation, CD-search Improves system throughput by 10.4% for heterogeneous workloads Reduces power by 25% for homogeneous workloads
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)