(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Intermediate GPGPU Programming in CUDA
Threads, SMP, and Microkernels
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Computer Systems/Operating Systems - Class 8
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Process Concept An operating system executes a variety of programs
Veynu Narasiman The University of Texas at Austin Michael Shebanow
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Extracted directly from:
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CUDA Dynamic Parallelism
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Department of Computer Science and Software Engineering
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
Operation of the SM Pipeline
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CUDA programming Performance considerations (CUDA best practices)
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Multiprogramming. Readings r Chapter 2.1 of the textbook.
Introduction to Operating Systems Concepts
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Gwangsun Kim, Jiyun Jeong, John Kim
Processes and threads.
Process concept.
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Intro to Processes CSSE 332 Operating Systems
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
Threads, SMP, and Microkernels
Chapter 15, Exploring the Digital Domain
Lecture 4- Threads, SMP, and Microkernels
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
CUDA Execution Model – III Streams and Events
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
CUDA Execution Model - II
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

(2) Objectives Understand the operation of major microarchitectural stages in launching and executing a kernel on a GPU device Enumerate generic information that must be maintained to track and sequence the execution of a large number of fine grained threads  What types of information is maintained at each step  Where can we augment functionality, e.g., scheduling

(3) Reading CUDA Programming Guide  J. Wang, A. Sidelink, N Rubin, and S. Yalamanchili, “Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs”, IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015, (Section 2)

(4) Execution Model Overview GPU: Many-Core Architecture CUDA: NVIDIA’s GPU Programming Model …… Thread Thread Block (TB) CUDA Kernel (1D-3D grid of TBs) Thread Block Branch divergence Warp (32 threads) Reconverge …… Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs on GPU Device Memory

(5) Baseline Microarchitecture Modeled after the Kepler GK 110 (best guess) 14 stream multiprocessors (SMX) 32 threads/warp 2048 maximum number of threads/SMX 16 maximum number of TBs Remaining stats  4 Warp Scheduler  Registers  64K L1/Share Memory From Kepler GK110 Whitepaper

(6) Kernel Launch CUDA streams Host Processor HW Queues Kernel Management Unit (Device) Commands by host issued through streams  Kernels in the same stream executed sequentially  Kernels in different streams may be executed concurrently Streams are mapped to hardware queues in the device in the kernel management unit (KMU)  Multiple streams mapped to each queue  serializes some kernels Kernel dispatch CUDA Context

(7) CUDA Context Analogous to a CPU process Encapsulate all CUDA resources and actions  Streams  Memory objects  Kernels Distinct address space Created by runtime during initialization and shared by all host threads in the same CPU process Different CUDA context for different processes  E.g. MPI rank

(8) Baseline Microarchitecture: KMU Kernels are dispatched from the head of the HWQs Next from the same HWQ cannot be dispatched before the prior kernel from the same queue completes execution (today)

(9) Kernel Distributor Same #entries as #HWQs  maximum number of independent kernels that can be dispatched by the KMU

(10) Kernel Distributor (2) Kernel distributor provides one entry on a FIFO basis to the SMX scheduler Entry PC Grid and CTA Dimension Info Parameter Addresses #CTAs to complete

(11) Kernel Parameters Managed by runtime and driver Dynamically allocated in device memory and mapped to constant parameter address space 4KB per kernel launching.entry foo(.param.u32 len ) {.reg.u32 %n; ld.param.u32 %n, [%len]; Load from parameter address space... }

(12) SMX Scheduler Initialize control registers Distribute thread blocks (TB) to SMXs  Limited by #registers, shared memory, #TBs, #threads Update KDE and SMX scheduler entries as kernels complete execution KDEINextBL KDE Index Next TB to be scheduled SMX Scheduler Control Registers (SSCR)

(13) SMX Scheduler: TB Distribution Distribute TBs in every cycle  Get the next TB from control registers  Determine the destination SMX oRound-robin (today)  Distribute if enough available resource, update control registers  Otherwise wait till next cycle KDEINextBL KDE Index Next TB to be scheduled SMX Scheduler Control Registers

(14) SMX Scheduler: TB Control Each SMX has a set of TBCRs, each for one TB executed on the SMX (max. 16)  Record KDEI and TB ID for this TB  After TB execution finishes, notify SMX scheduler Support for multiple kernels KDEIBLKID Thread Block Control Registers (TBCR) per SMX KDE Index Scheduled TB ID (in execution)

(15) Streaming Multiprocessor (SMX) Thread blocks organized as warps (here 32 threads) Select from a pool of ready warps and issue instruction to the 32-lane SMX Interleave warps for fine grain multithreading Control flow divergence and memory divergence Dispatch ALU FPU LD/SD Result Queue A core, lane, or stream processor (SP)

(16) Warp Management Warp Context  PC  Active masks for each thread  Control divergence stack  Barrier information Warp context are stored in a set of hardware registers per SMX for fast warp context switch  Max. 64 warp context registers (one for each warp)

(17) Warp Scheduling Ready warps  There is no unresolved dependency for the next instruction  E.g. memory operand is ready Round-robin  Switch to next ready warp after executing one instruction in each warp Greedy-Then-Oldest (GTO)  Execute current warp until there is and unresolved dependency  Then move to the oldest warp in the ready warp pool

(18) Concurrent Kernel Execution If a kernel does not occupy all SMXs, distribute TBs from the next kernel When SMX scheduler issues next TB  Can issue from the kernel in the KDE FIFO if current kernel has all TBs issued On each SMX there can be TBs from different kernel  Recall SMX TBCR KDEIBLKID Thread Block Control Registers (TBCR) per SMX KDE Index Scheduled TB ID (in execution)

(19) Executing Kernels Kernel K K K K K TB Warps TB Warps K TB Warps Concurrent Kernel Execution

(20) Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs Kernel 1Kernel 2Kernel 3Kernel 100 …… SMX0SMX1SMX12

(21) Occupancy, TBs, Warps, & Utilization Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs Kernel 1 Kernel 2Kernel 3Kernel 100 …… SMX0SMX1SMX12 One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

(22) Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs Kernel 1 Kernel 3Kernel 100 …… SMX0SMX1SMX12 Kernel 2

(23) Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs Kernel 1 Kernel 14Kernel 100 …… SMX0SMX1SMX12 Kernel 2Kernel 13

(24) Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs Kernel 1 Kernel 15Kernel 100 …… SMX0SMX1SMX12 Kernel 2Kernel 13Kernel 14

(25) Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs Kernel 1 Kernel 33Kernel 100 …… SMX0SMX1SMX12 Kernel 2Kernel 13Kernel 14Kernel 15Kernel 26Kernel 27Kernel 28Kernel 32 Wait till previous kernel finishes

(26) Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs Kernel 1 Kernel 33Kernel 100 …… SMX0SMX1SMX12 Kernel 2Kernel 13Kernel 14Kernel 15Kernel 26Kernel 27Kernel 28Kernel 32 Achieved SMX Occupancy: 32 kernels * 2 warps/kernel / (64 warps/SMX * 13 SMXs) = Wait till previous kernel finishes

(27) Balancing Resources Limit: 32 for concurrent kernels TB SMX underutilized

(28) Performance: Kernel Launch For synchronous host kernel launch  Launching time is ~10us (measured on K20c) Launching time breakdown  Software: oRuntime: resource allocation, e.g. parameters, streams, kernel information oDriver: SW-HW interactive data structure allocation  Hardware: oKernel scheduling