Gwangsun Kim, Jiyun Jeong, John Kim

Slides:

Advertisements

Similar presentations

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Advertisements

Supporting x86-64 Address Translation for 100s of GPU Lanes Jason Power, Mark D. Hill, David A. Wood UW-Madison Computer Sciences 2/19/2014.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Multi-GPU System Design with Memory Networks

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

ThreadsThreads operating systems. ThreadsThreads A Thread, or thread of execution, is the sequence of instructions being executed. A process may have.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Virtual Memory 1 1.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Sunpyo Hong, Hyesoon Kim

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Seth Pugsley, Jeffrey Jestes,

Sathish Vadhiyar Parallel Programming

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

EECE571R -- Harnessing Massively Parallel Processors ece

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs

ISPASS th April Santa Rosa, California

Stash: Have Your Scratchpad and Cache it Too

Mosaic: A GPU Memory Manager

/ Computer Architecture and Design

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Linchuan Chen, Xin Huo and Gagan Agrawal

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

RegLess: Just-in-Time Operand Staging for GPUs

Rachata Ausavarungnirun

Presented by: Isaac Martin

/ Computer Architecture and Design

Operation of the Basic SM Pipeline

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Department of Computer Science University of California, Santa Barbara

6- General Purpose GPU Programming

Virtual Memory 1 1.

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Gwangsun Kim, Jiyun Jeong, John Kim Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs Gwangsun Kim, Jiyun Jeong, John Kim Mark Stephenson

GPU Background Kernel grid GPU SM ... ... CTA CTA CTA CTA ... CTA ... CTA CTA CTA CTA ... CTA ... CTA CTA CTA CTA ... CTA ... CTA CTA CTA CTA ... CTA ... ... ... ... ... ... CTA CTA CTA CTA ... CTA SM (Streaming Multiprocessor) core Control logic thread CTA (Cooperative Thread Array) or Thread block

Different Stages of GPU Workloads Prelude: input data initialization (e.g., from an SSD). Postlude: writing output data (e.g., to an SSD). Non-kernel overhead takes ~77% of runtime on average [Zhang et al., PACT’15]. Time *H2D: Host-to-Device *D2H: Device-to-Host Prelude H2D copy Kernel D2H copy Postlude Host Memory Device Memory ~290 GB/s (GDDR5) CPU GPU SSD ~5 GB/s ~32 GB/s (PCIe 3.0)

Overlapping Stages Prior work enabled overlapping the different stages. GPUfs [ASPLOS’13]: file I/O from GPUs. GPUnet [OSDI’14]: network I/O from GPUs. Full/empty bit approach [HPCA’13] can overlap memcpy and kernel. HSA (Heterogeneous System Architecture) allows page faults from GPUs. Time Prelude H2D copy Kernel D2H copy Postlude Significant reduction in runtime Prelude H2D copy Kernel D2H copy Postlude

Limitation with Multiple Dependent Kernels Many workloads have multiple dependent kernels. (e.g., 2/3 of workloads from Rodinia and Parboil benchmark suites) Dependent kernels are serialized  limited speedup. Prelude H2D copy Kernel 0 Postlude D2H copy Kernel 1 Kernel 2 Kernel 3 Time Implicit synchronization barriers

Our Contributions Overlap multiple dependent kernels without any programmer effort. Coarse-grained Reference Counting-based Scoreboarding (CRCS): Enabling mechanism to track dependency across kernels. Pipeline Parallelism-aware CTA Scheduler (PPCS): Properly scheduling CTAs from the overlapped kernels: Time Prelude H2D copy Limited overlap Kernel 0 Kernel 1 Kernel 2 Kernel 3 D2H copy Postlude

Outline Introduction/Background Coarse-grained Reference Counting-based Scoreboarding (CRCS) Pipeline Parallelism-aware CTA Scheduler (PPCS) Methodology Results Conclusion

Enabling Overlapped Kernel Execution Coarse-grained Reference Counting-based Scoreboarding (CRCS) Existing scoreboard: dependency between instructions through registers. CRCS: dependency between CTAs through pages. Owner kernel Reference counter 2 1 Page 0 Page 1 Page 2 Address space ... CTA 0 CTA 1 CTA 2 CTA 3 Kernel 0 CTA 0 CTA 1 CTA 2 CTA 3 Kernel 1 Which kernel owns this page? How many CTAs from current owner kernel access this page?

Enabling Overlapped Kernel Execution Coarse-grained Reference Counting-based Scoreboarding (CRCS) Existing scoreboard: dependency between instructions through registers. CRCS: dependency between CTAs through pages. Owner kernel Reference counter Address space Kernel 0 Kernel 1 CTA 0 CTA 0 1 2 Page 0 -1 1 Page 1 1 3 2 CTA 1 CTA 1 -1 Page 2 1 1 1 -1 CTA 2 CTA 2 CTA 3 CTA 3 ... Two types of information are needed. How many CTAs access each page? (to initialize the counter)  pre-profiling Which pages are accessed by this CTA? (to decrement the counter)  post-profiling

Profiling Memory Access Range Determine the memory access range of each CTA. We use a sampling method [Kim et al., PPoPP’11]. Only inspect ”corner” threads of each CTA  low overhead. Obtain the union for multiple statements. We assume all pages in the range are accessed by the CTA. Profiler kernels are generated through source-to-source translation. 1D CTA 2D CTA 3D CTA thread

Pre-profiling ... Compute the reference count for each page. One pre-profiler kernel for each original kernel launched. Executed before the corresponding kernel. Can be overlapped with other kernels. Reference count table for page 0 Kernel ID Read ref. count Write ref. count 1 Page 0 Page 1 Page 2 Address space ... CTA 0 CTA 1 CTA 2 CTA 3 Kernel 0 CTA 0 CTA 1 CTA 2 CTA 3 Kernel 1 2 2 1 1 Reference count table for page 1 Kernel ID Read ref. count Write ref. count 1 2 2 2 2

Post-profiling To decrement reference counters after each CTA is finished. Keeping all CTA-page dependency information is very costly. (Max. number of CTAs per kernel = ~1019 for NVIDIA Kepler GPU) Redo profiling, but for this CTA only. Remaining read CTA counter Remaining write CTA counter Owner kernel Page 0 Page 1 Page 2 Address space ... CTA 0 CTA 1 CTA 2 CTA 3 Kernel 0 Kernel 1 2 1 -1 1 2 -1 1 2 1 -1 2 1 -1 Reference count table for page 1 Kernel ID Read ref. count Write ref. count 2 1

Baseline Execution (No Overlap) : Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA Initializing page 0 during prelude P0 P1 … P7 Prelude A CTA processing page 0 CTA slot 0 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 SM0 CTA slot 1 CTA slot 0 SM1 CTA slot 1 Postlude P0 P1 … P7 Time

Execution with CRCS + FIFO Scheduler : Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA P4 SMs remain idle Should not run two dependent kernels on an SM P5 P0 P1 P2 P3 P4 P5 P6 P7 Prelude P6 There are some overlaps P2 P3 P4 P0 P1 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P6 P7 P0 P1 P2 P3 P4 P5 CTA slot 0 P0 P1 … SM0 P7 CTA slot 1 CTA slot 0 SM1 CTA slot 1 Postlude No kernel overlap during most of prelude Time

Execution with CRCS + PPCS : Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA Schedule the kernel with the largest value of (page share) – (SM share). Prelude CTA slot 0 SM0 Page share of a kernel: the portion of pages owned by the kernel out of all initialized pages. SM share of a kernel: the portion of SMs running the kernel. CTA slot 1 CTA slot 0 SM1 CTA slot 1 Postlude Time

Execution with CRCS + PPCS : Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA Schedule the kernel with the largest value of (page share) – (SM share). P0 P1 P2 P3 P4 P5 P6 P7 Prelude P2 P3 P0 P1 P0 P1 P0 P1 P0 P1 P4 P5 P6 P7 … CTA slot 0 SM0 Idle CTA slot 1 P2 P3 P2 P3 CTA slot 0 P2 P3 SM1 CTA slot 1 Postlude P0 P1 P2 P3 Time

Methodology Modified GPGPU-sim version 3.0.1. Configurations: 64 SMs @ 700 MHz, 6 Memory controllers. I/O device: Two SSDs (throughput: 500 MB/s each). PCIe 3.0: 15.75 GB/s in each direction. Prior work model: A model with no memory copy between host and device. A model with perfect single-kernel overlap (first and last kernels). Profiler code generator based on clang from the LLVM compiler. Focus on workloads with multiple dependent kernels.

Performance Result 50% 51% 33% 39% 19% 14% 2%

Impact of Kernel Portion in Runtime The number of kernels is varied for Hotspot from Rodinia. 51% 39% 30% 27% 26% 17% 9% 11% 12% 14% (portion of kernel execution in runtime) (9%) (66%) (23%) (33%) (50%)

Overhead Storage overhead: Assumptions: CRCS: 1 KB per SM. PPCS: 3.25 KB per GPU. Assumptions: 64 SMs. 128-entry TLB in each SM. Max. number of kernels to overlap: 1024. 0.77% storage overhead for the entire GPU.

Conclusion System performance can be improved by overlapping different stages of GPU workloads. Prior work cannot overlap multiple dependent kernels. Coarse-grained Reference counting-based Scoreboarding (CRCS) enables overlapped execution of multiple dependent kernels. Pipeline Parallelism-aware CTA Scheduler (PPCS) further improves performance by properly scheduling CTAs across kernels. Combining CRCS with PPCS resulted in up to 67% speedup and 33% on average.