Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

Similar presentations


Presentation on theme: "Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*"— Presentation transcript:

1 Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs fast Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)

2 GPU deployment in the cloud
GPU multitasking support

3 GPU spatial multitasking
GPU with 20 streaming multiprocessors (SMs) Apps execute on different SMs Even SM partitioning (widely used) Same SM count per app 2 applications App SMs App SMs

4 Alternative SM partitioning
GPU with 20 streaming multiprocessors (SMs) Even SM partitioning App SMs App SMs Uneven SM partitioning Power-gating SMs

5 Why SM partitioning matters
Exploring uneven SM partitioning 24 SMs in GPU Even and uneven SM partitioning 3 typical workloads GAUSSIA_LEU Performance opportunity LBM_DWT2D Energy opportunity BINO_LEU No opportunity How to find the most effective SM partition? X_Y - the number of SMs assigned to the left and right application

6 Workload classification
Heterogeneous workloads Memory-intensive apps LBM, DWT2D, LAVAMD Compute-intensive apps DXTC, LEU, MERGE Key take-away messages Opportunity: optimize system throughput (STP) Shift SMs from memory-intensive app to compute-intensive app overall system performance can be significantly improved through SM partitioning for heterogeneous workload mixes system performance is maximized by assigning a relatively small fraction of the available SMs to the memory-intensive application

7 Workload classification
Memory-intensive workloads Memory-intensive apps LBM, DWT2D, LAVAMD, GAUSSIAN Key take-away messages Opportunity: reduce power consumption Fewer SMs should be assigned to memory-intensive apps Performance may also improve Cache-sensitive apps (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention.

8 Workload classification
Compute-intensive workloads Compute-intensive apps BINO, LEU, DXTC, MERGE Key take-away messages Cannot optimize performance Cannot optimize power Should keep even SM partitioning (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention. Workloads offer different opportunities Towards classification-driven search

9 SM sensitivity - compute apps
Performance increases linearly

10 SM sensitivity - memory apps
Performance saturates quickly How to classify applications?

11 BW utilization is not enough
LLC APKI (accesses per kilo instructions) & LLC Miss Rate & Memory BW utilization Previously used memory bandwidth utilization does not work well 18.2 15.2 120.7 The classification based on a single metric is not effective

12 Off-SM bandwidth model
SM part Total bandwidth per app needs Off-SM part Total bandwidth GPU provides Bandwidth demand < Provided bandwidth Performance increases linearly L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network Bandwidth demand > Provided bandwidth Performance saturates quickly L2 MC L2 MC SM Part Off-SM Part

13 Data bandwidth GPU provides
Total data bandwidth Memory bandwidth L2 cache bandwidth NoC bandwidth Bandwidth Formula Min {NoC_BW, F (MEM, L2))} Build F (MEM, L2) based on L2 bandwidth L2 cache hit ratio Memory bandwidth utilization L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 MC Off-SM Part

14 Data bandwidth per app needs
Data bandwidth demand #SM count Bandwidth demand per SM Memory accesses per cycle IPC L2 accesses per kilo insts SM operating frequency ALUs Instruction Cache Warp Scheduler Register File Load/Store Units Shared Memory/ L1 Cache Bandwidth demand IPC L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 APKI L2 MC

15 CD-search: Performance mode
GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app Compute-intensive app How it works? Start with even partitioning Record performance Iteratively stall the SMs assigned to mem-app Stop stalling when performance of mem-app changes Assign stalled SMs to compute-app

16 CD-search: Energy mode
GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app1, app2 How it works? Start with even partition Record performance Stall all but leave only one SM for each mem-app Estimate the SM count based on the IPC10SMs and IPC1SM Iteratively resume one SM until saturation Power-gate stalled SMs Power- gate

17 Experimental setup GPGPU-sim 24 SMs - Fermi-like architecture
14 applications Rodina, Parboil, SDK et.al. 7 compute-intensive apps 7 memory-intensive apps 91 2-app workloads 49 heterogeneous workloads 21 memory workloads 21 compute workloads GPGPU-sim 24 SMs - Fermi-like architecture Crossbar NoC Bisection bandwidth GB/s 6 MCs Memory bandwidth GB/s 2 LLC slices per MC Total LLC size KB GPUWattch 40 nm technology node

18 Performance results Heterogeneous workloads STP improvement
10.4% on average ANTT improvement 22% on average

19 Energy results Homogeneous workloads Power reduction 25% on average

20 Performance neutral on average
Energy results Homogeneous workloads STP improves due to cache-sensitive applications STP degrades due to the time-varying execution behavior of few applications Performance neutral on average

21 streaming multiprocessors (SMs)
SMK discussion GPUs with 20 streaming multiprocessors (SMs) SMK [ISCA’16] Co-executing 2 applications on one SM 2 applications App SMs App SMs Maestro [ASPLOS’17] Combines SMK with spatial multitasking Still assumes even SM partitioning

22 SMK discussion SMK severely degrades performance for some applications
CD-search and Maestro both improve performance CD-search + Maestro achieves highest performance 24.4% performance improvement on average

23 Conclusions We demonstrate the opportunity for uneven SM partitioning
Performance potential Power potential We propose CD-search which First classifies the workloads Then identifies a suitable SM partitioning We propose an off-SM bandwidth model to classify apps In the evaluation, CD-search Improves system throughput by 10.4% for heterogeneous workloads Reduces power by 25% for homogeneous workloads

24 Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)

25


Download ppt "Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*"

Similar presentations


Ads by Google