Xia Zhao, Zhiying Wang+, Lieven Eeckhout

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs fast Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)

GPU deployment in the cloud
GPU multitasking support

GPU spatial multitasking
GPU with 20 streaming multiprocessors (SMs) Apps execute on different SMs Even SM partitioning (widely used) Same SM count per app 2 applications App SMs App SMs

Alternative SM partitioning
GPU with 20 streaming multiprocessors (SMs) Even SM partitioning App SMs App SMs Uneven SM partitioning Power-gating SMs

Why SM partitioning matters
Exploring uneven SM partitioning 24 SMs in GPU Even and uneven SM partitioning 3 typical workloads GAUSSIA_LEU Performance opportunity LBM_DWT2D Energy opportunity BINO_LEU No opportunity How to find the most effective SM partition? X_Y - the number of SMs assigned to the left and right application

Workload classification
Heterogeneous workloads Memory-intensive apps LBM, DWT2D, LAVAMD Compute-intensive apps DXTC, LEU, MERGE Key take-away messages Opportunity: optimize system throughput (STP) Shift SMs from memory-intensive app to compute-intensive app overall system performance can be significantly improved through SM partitioning for heterogeneous workload mixes system performance is maximized by assigning a relatively small fraction of the available SMs to the memory-intensive application

Memory-intensive workloads Memory-intensive apps LBM, DWT2D, LAVAMD, GAUSSIAN Key take-away messages Opportunity: reduce power consumption Fewer SMs should be assigned to memory-intensive apps Performance may also improve Cache-sensitive apps (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention.

Compute-intensive workloads Compute-intensive apps BINO, LEU, DXTC, MERGE Key take-away messages Cannot optimize performance Cannot optimize power Should keep even SM partitioning (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention. Workloads offer different opportunities Towards classification-driven search

SM sensitivity - compute apps
Performance increases linearly

SM sensitivity - memory apps
Performance saturates quickly How to classify applications?

BW utilization is not enough
LLC APKI (accesses per kilo instructions) & LLC Miss Rate & Memory BW utilization Previously used memory bandwidth utilization does not work well 18.2 15.2 120.7 The classification based on a single metric is not effective

Off-SM bandwidth model
SM part Total bandwidth per app needs Off-SM part Total bandwidth GPU provides Bandwidth demand < Provided bandwidth Performance increases linearly L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network Bandwidth demand > Provided bandwidth Performance saturates quickly L2 MC L2 MC SM Part Off-SM Part

Data bandwidth GPU provides
Total data bandwidth Memory bandwidth L2 cache bandwidth NoC bandwidth Bandwidth Formula Min {NoC_BW, F (MEM, L2))} Build F (MEM, L2) based on L2 bandwidth L2 cache hit ratio Memory bandwidth utilization L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 MC Off-SM Part

Data bandwidth per app needs
Data bandwidth demand #SM count Bandwidth demand per SM Memory accesses per cycle IPC L2 accesses per kilo insts SM operating frequency ALUs Instruction Cache Warp Scheduler Register File Load/Store Units Shared Memory/ L1 Cache Bandwidth demand IPC L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 APKI L2 MC

CD-search: Performance mode
GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app Compute-intensive app How it works? Start with even partitioning Record performance Iteratively stall the SMs assigned to mem-app Stop stalling when performance of mem-app changes Assign stalled SMs to compute-app

CD-search: Energy mode
GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app1, app2 How it works? Start with even partition Record performance Stall all but leave only one SM for each mem-app Estimate the SM count based on the IPC10SMs and IPC1SM Iteratively resume one SM until saturation Power-gate stalled SMs Power- gate

Experimental setup GPGPU-sim 24 SMs - Fermi-like architecture
14 applications Rodina, Parboil, SDK et.al. 7 compute-intensive apps 7 memory-intensive apps 91 2-app workloads 49 heterogeneous workloads 21 memory workloads 21 compute workloads GPGPU-sim 24 SMs - Fermi-like architecture Crossbar NoC Bisection bandwidth GB/s 6 MCs Memory bandwidth GB/s 2 LLC slices per MC Total LLC size KB GPUWattch 40 nm technology node

Performance results Heterogeneous workloads STP improvement
10.4% on average ANTT improvement 22% on average

Energy results Homogeneous workloads Power reduction 25% on average

Performance neutral on average
Energy results Homogeneous workloads STP improves due to cache-sensitive applications STP degrades due to the time-varying execution behavior of few applications Performance neutral on average

streaming multiprocessors (SMs)
SMK discussion GPUs with 20 streaming multiprocessors (SMs) SMK [ISCA’16] Co-executing 2 applications on one SM 2 applications App SMs App SMs Maestro [ASPLOS’17] Combines SMK with spatial multitasking Still assumes even SM partitioning

SMK discussion SMK severely degrades performance for some applications
CD-search and Maestro both improve performance CD-search + Maestro achieves highest performance 24.4% performance improvement on average

Conclusions We demonstrate the opportunity for uneven SM partitioning
Performance potential Power potential We propose CD-search which First classifies the workloads Then identifies a suitable SM partitioning We propose an off-SM bandwidth model to classify apps In the evaluation, CD-search Improves system throughput by 10.4% for heterogeneous workloads Reduces power by 25% for homogeneous workloads

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)

Xia Zhao, Zhiying Wang+, Lieven Eeckhout

Similar presentations

Presentation on theme: "Xia Zhao, Zhiying Wang+, Lieven Eeckhout"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

Similar presentations

Presentation on theme: "Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*"— Presentation transcript:

Similar presentations

About project

Feedback

Xia Zhao, Zhiying Wang+, Lieven Eeckhout

Presentation on theme: "Xia Zhao, Zhiying Wang+, Lieven Eeckhout"— Presentation transcript: