Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.

Slides:

Advertisements

Similar presentations

Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Hadi Goudarzi and Massoud Pedram

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Lecture 6: Multicore Systems

Optimization on Kepler Zehuan Wang

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Veynu Narasiman The University of Texas at Austin Michael Shebanow

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Sunpyo Hong, Hyesoon Kim

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

Sunpyo Hong, Hyesoon Kim

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Determining Optimal Processor Speeds for Periodic Real-Time Tasks with Different Power Characteristics H. Aydın, R. Melhem, D. Mossé, P.M. Alvarez University.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Zorua: A Holistic Approach to Resource Virtualization in GPUs

vCAT: Dynamic Cache Management using CAT Virtualization

Gwangsun Kim, Jiyun Jeong, John Kim

CPU Scheduling CSSE 332 Operating Systems

Employing compression solutions under openacc

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

Ph.D. in Computer Science

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

ISPASS th April Santa Rosa, California

Lecture 5: GPU Compute Architecture

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Chapter 8: Main Memory.

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Chapter 3: Principles of Scalable Performance

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

CPU Scheduling G.Anuradha

Objective of This Course

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Project Title Team Members EE/CSCI 451: Project Presentation

Massachusetts Institute of Technology

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Maximizing Speedup through Self-Tuning of Processor Allocation

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

6- General Purpose GPU Programming

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Page Main Memory.

Presentation transcript:

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won Woo Ro ❖ Murali Annavaram* * University of Southern California, ✝ San Jose State University, ❖ Yonsei University

2 GPU Accelerated Computing Platforms GPU accelerated platforms [1] [1] GTC 2016, Jen-hsun Huang, GPU-Accelerated Computing Changing the World App 1 App 2 App N … CPU + GPU Servers Many vendors now providing cloud computing for GPU services Multiple applications need to run on the same server

3 Ever-increasing amount of GPU resources Increasing number of execution resources in each GPU Resource underutilization is a growing problem Share resources across multiple kernels

4 Resource usage variation across applications Utilization imbalance of register file and execution units Breakdown of pipeline stalls due to various reasons Idea: Exploit diverse application characteristics to alleviate resource imbalance

GPGPU Overview | Introduction | 5 GPU Background GPU CTA: Cooperative Thread Array a.k.a. thread block

6 Previous work: spatial multitasking [1] [1] J. T. Adriaens, K. Compton, N.S.Kim and M.J.Schulte, “The case for GPGPU spatial multitasking,” HPCA, SM1SM2 SM3 SM4 GPU Spatial Multitasking ( Inter-SM Slicing ) Kernel 1 Kernel 2 75% 100% SM1 SM2 SM3 SM4 GPU Intra-SM Slicing 50% Total: 350% Total: 400%

7 Example: a memory intensive kernel causes underutilization inside SM Warp Scheduler Register File Shared Memory INT FP LDST SFU All the warps from Kernel 1 are stalled due to memory bottleneck FULL IDLE Can we use the idle resources? Kernel 1 (M) GPU SM

8 Warped-Slicer: collocate with a compute intensive kernel Warp Scheduler Register File Shared Memory INT FP LDST SFU Warps from Kernel 1 (Stalled) Warps from Kernel 1 (Stalled) Kernel 1 is memory intensive BUSY Warps from Kernel 2 (Execute) Warps from Kernel 2 (Execute) Kernel 2 is compute intensive BUSY FULL

Intra-SM slicing strategies Baseline allocation strategy: State-of-the-art GPU provides basic hardware support for multi-kernel co-location in the same SM Follows a leftover allocation policy Another reasonable comparison point: Even partitioning Our main contribution: Warped-slicer – a more efficient resource sharing technique than the above policies 9

10 Warped-Slicer: an efficient resource partitioning A A A A A A A A A A A A A A A A A A B B B B B B A A B B We can assign more register files to kernel A, while giving more shared memory to kernel B to maximize usage Different from Even partitioning, warped-slicer does ‘uneven’ partitioning

11 Warped-slicer techniques Intelligently decide whether to choose between intra and inter SM slicing ?

12 Warped-slicer techniques Part 1 Kernel co-location optimization Part 2 Water-Filling algorithm Part 3 Dynamic profiling

13 Performance vs. CTA curve Compute Intensive: More CTAs are better Memory Intensive: Performance saturates quickly L1 Cache Sensitive: Performance decreases when L1 cache is over-utilized. Challenges: performance is NOT directly proportional to resources

14 Part 1: Kernel co-location optimization function We propose to assign T i thread blocks to kernel i by optimizing: So that: P: Performance (IPC) R: Resources, including register files, shared memory, maximum threads and maximum # of thread blocks in each SM.

15 Naïve brute force algorithm Search all the combinations of allocations Assume maximum N CTAs from K kernels in an SM The brute force algorithm has the time complexity of

16 Part 2: Water-filling algorithm General idea: Greedily assign the minimum amount of the remaining resources to the kernel with the MIN performance so that the performance of that kernel increases Kernels Performance # 1# 2 # 3# 4 We propose an efficient algorithm to solve the MAX MIN problem

Kernels Performance # 1# 2 # 3 Kernels CTAs Allocated # 1# 2 # 3 Part 2: Water-filling algorithm (iteration 1) min MaxCTA= 8 CurCTA = 3 17

Kernels Performance # 1# 2 # 3 Kernels CTAs Allocated # 1# 2 # 3 Part 2: Water-filling algorithm (iteration 2) min MaxCTA= 8 CurCTA =

Kernels # 1# 2 # 3 Kernels CTAs Allocated # 1# 2 # 3 Part 2: Water-filling algorithm (iteration 3) 20 min MaxCTA= 8 CurCTA = Performance

Kernels # 1# 2 # 3 Kernels CTAs Allocated # 1# 2 # 3 Part 2: Water-filling algorithm (iteration 4) min MaxCTA= 8 CurCTA = Performance

Kernels # 1# 2 # 3 Kernels CTAs Allocated # 1# 2 # 3 Part 2: Water-filling algorithm (iteration 5) min MaxCTA= 8 CurCTA = Performance

Kernels # 1# 2 # 3 Kernels CTAs Allocated # 1# 2 # 3 Part 2: Water-filling algorithm (iteration 6) 30 min MaxCTA= 8 CurCTA = Performance

Kernels # 1# 2 # 3 Kernels CTAs Allocated # 1# 2 # 3 Part 2: Water-filling algorithm (iteration 6) 30 min MaxCTA= 8 CurCTA = Complexity is

24 Part 2: Water-filling algorithm Limitation: Need to set performance loss upper bound Solution: Set a threshold (60% for two kernels) Fall back on spatial multitasking when the performance loss exceeds the threshold.

25 How do we get CTA#-vs-IPC? Dynamic Profiling Profile the performance point when different # of CTAs are assigned Recall that GPUs have many identical SMs Profile Concurrently!

26 Part 3: Dynamic profiling We split SMs into two equal groups. We start with one CTA SM0.

27 Part 3: Dynamic profiling 2 CTAs SM1

28 Part 3: Dynamic profiling 3 CTAs SM2

29 Part 3: Dynamic profiling 4 CTAs SM3 Now we can plot the two curves for the two kernels For >2 kernels: time sharing one SM to run with a different number of CTAs sequentially

How to deal with interference? Challenge: The curve may be distorted: We assume independent accesses for each SM But in fact, the L2 and memory accesses are shared across all SM *Please refer to detailed derivation in the paper. We design a scaling factor* to correct the curve.

Evaluation Methodology Simulating Environment:  GPGPU-Sim v3.2.2  Simulator front end is extensively modified to support multiprogramming Benchmarks:  10 GPU applications from image processing, math, data mining, scientific computing and finance domains Experiments:  We kept the total instructions executed for each workload the same across different strategies  Oracle partitioning: acquired by thousands of experiments of all possible allocations

32 Performance result – 2 kernels Inter-SM (Spatial) 6% over Leftover Warped-slicer 23% better. Warped-slicer within 4% of Oracle

33 Resource utilization results – 2 kernels 10-20% better resource utilization18% reduction in stall cycles

34 Performance result – 3 kernels Warped-slicer out-performs even partitioning by 21%

35 Summary We studied various intra-SM slicing techniques for GPU multiprogramming. We proposed warped-slicer, which achieves 23% performance improvement for 2 kernels and 40% performance improvement for 3 kernels over Leftover policy. We implemented the water-filling algorithm and various counters in Verilog. The hardware overhead is minimum.

University of Southern California