©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Intermediate GPGPU Programming in CUDA
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Computer Systems/Operating Systems - Class 8
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Processes and Resources
Process Concept An operating system executes a variety of programs
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Extracted directly from:
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
(1) Register File Organization ©Sudhakar Yalamanchili unless otherwise noted.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CUDA Dynamic Parallelism
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Department of Computer Science and Software Engineering
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Intra-Warp Compaction Techniques.
(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.
Operation of the SM Pipeline
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Multiprogramming. Readings r Chapter 2.1 of the textbook.
Introduction to Operating Systems Concepts
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Gwangsun Kim, Jiyun Jeong, John Kim
Module 12: I/O Systems I/O hardware Application I/O Interface
Processes and threads.
Process concept.
Sathish Vadhiyar Parallel Programming
EEE Embedded Systems Design Process in Operating Systems 서강대학교 전자공학과
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Process Management Presented By Aditya Gupta Assistant Professor
Intro to Processes CSSE 332 Operating Systems
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Threads & multithreading
Chapter 4: Threads.
Operating System Concepts
Mattan Erez The University of Texas at Austin
Threads, SMP, and Microkernels
Chapter 15, Exploring the Digital Domain
Operating System Concepts
Lecture 4- Threads, SMP, and Microkernels
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
CUDA Execution Model – III Streams and Events
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
General Purpose Graphics Processing Units (GPGPUs)
CUDA Execution Model - II
Foundations and Definitions
Lecture 5: Synchronization and ILP
GPU Scheduling on the NVIDIA TX2:
6- General Purpose GPU Programming
Module 12: I/O Systems I/O hardwared Application I/O Interface
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Objectives Understand the operation of major microarchitectural stages in launching and executing a kernel on a GPU device Enumerate generic information that must be maintained to track and sequence the execution of a large number of fine grained threads What types of information is maintained at each step Where can we augment functionality, e.g., scheduling

Reading CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract J. Wang, A. Sidelink, N Rubin, and S. Yalamanchili, “Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs”, IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015, (Section 2) You can find all papers, patents, and documents in Tsquare under the corresponding Module Directory

Execution Model Overview GPU: Many-Core Architecture CUDA: NVIDIA’s GPU Programming Model Thread Block …… Branch divergence …… Thread Block (TB) Warp (32 threads) …… Reconverge Thread …… Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory SMXs on GPU CUDA Kernel (1D-3D grid of TBs) Device Memory

Baseline Microarchitecture Modeled after the Kepler GK 110 (best guess) 14 stream multiprocessors (SMX) 32 threads/warp 2048 maximum number of threads/SMX 16 maximum number of TBs Remaining stats 4 Warp Scheduler 65536 Registers 64K L1/Share Memory From Kepler GK110 Whitepaper

Kernel Launch Commands by host issued through streams CUDA streams CUDA Context Kernel Management Unit (Device) Host Processor Kernel dispatch HW Queues Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently Streams are mapped to hardware queues in the device in the kernel management unit (KMU) Multiple streams mapped to each queue  serializes some kernels

Devices and the CUDA Context Each device is accessed through a construct called a its context Allocated Memory Streams Loaded modules etc. Runtime API Statefull API Primary context Most applications just use the Runtime API which hides context management Driver API (explicit context management)

CUDA Context Each CUDA device has a context: consider multi-CPU configurations Encapsulate all CUDA resources and actions Streams Memory objects Kernels Distinct address space Created by runtime during initialization and shared by all host threads in the same CPU process Created on the first API all Different CUDA context for different processes E.g. MPI rank

Multiple Contexts Each thread accesses a device through its context Single thread that swaps amongst contexts Host threads Host thread Context Context Context Context Context Context GPU GPU GPU GPU GPU GPU CPU thread management is independent of GPU management

More on Contexts Created on the device selected by a cudasetDevice() call  for multi-GPU programming A GPU can belong to multiple contexts but can execute calls from only one context at a time Device driver manages switching among contexts Kernels launched in different contexts cannot execute concurrently

Baseline Microarchitecture: KMU Kernels are dispatched from the head of the HWQs Next from the same HWQ cannot be dispatched before the prior kernel from the same queue completes execution (today)

Kernel Distributor Same #entries as #HWQs  maximum number of independent kernels that can be dispatched by the KMU

Grid and CTA Dimension Info Kernel Distributor (2) Entry PC #CTAs to complete Grid and CTA Dimension Info Parameter Addresses Kernel distributor provides one entry on a FIFO basis to the SMX scheduler

Kernel Parameters Parameter space (.param) is used to pass input arguments from host to the kernel Managed by runtime and driver Dynamically allocated in device memory and mapped to constant parameter address space Exact location is implementation specific 4KB per kernel launch .entry foo( .param .u32 len ) { .reg .u32 %n; ld.param.u32 %n, [%len]; Load from parameter address space ... }

SMX Scheduler Initialize control registers Control Registers (SSCR) KDEI NextBL KDE Index Next TB to be scheduled Initialize control registers Distribute thread blocks (TB) to SMXs Limited by #registers, shared memory, #TBs, #threads Update KDE and SMX scheduler entries as kernels complete execution

SMX Scheduler: TB Distribution Distribute TBs in every cycle Get the next TB from control registers Determine the destination SMX Round-robin (today) Distribute if enough available resource, update control registers Otherwise wait till next cycle SMX Scheduler Control Registers KDEI NextBL KDE Index Next TB to be scheduled

SMX Scheduler: TB Control Thread Block Control Registers (TBCR) per SMX KDEI BLKID KDE Index Scheduled TB ID (in execution) Each SMX has a set of TBCRs, each for one TB executed on the SMX (max. 16) Record KDEI and TB ID for this TB After TB execution finishes, notify SMX scheduler Support for multiple kernels

Streaming Multiprocessor (SMX) Dispatch ALU FPU LD/SD Result Queue A core, lane, or stream processor (SP) Thread blocks organized as warps (here 32 threads) Select from a pool of ready warps and issue instruction to the 32-lane SMX Interleave warps for fine grain multithreading Control flow divergence and memory divergence

Warp Management Warp Context Active masks for each thread Control divergence stack Barrier information Warp context are stored in a set of hardware registers per SMX for fast warp context switch Max. 64 warp context registers (one for each warp)

Warp Scheduling Ready warps Round-robin Greedy-Then-Oldest (GTO) There is no unresolved dependency for the next instruction E.g. memory operand is ready Round-robin Switch to next ready warp after executing one instruction in each warp Greedy-Then-Oldest (GTO) Execute current warp until there is and unresolved dependency Then move to the oldest warp in the ready warp pool

Concurrent Kernel Execution If a kernel does not occupy all SMXs, distribute TBs from the next kernel When SMX scheduler issues next TB Can issue from the kernel in the KDE FIFO if current kernel has all TBs issued On each SMX there can be TBs from different kernel Recall SMX TBCR Thread Block Control Registers (TBCR) per SMX KDEI BLKID KDE Index Scheduled TB ID (in execution)

Executing Kernels Concurrent Kernel Execution TB TB TB TB TB TB TB TB Warps Warps K Kernel K K K Kernel K Kernel K Warps Warps TB TB TB TB TB TB TB TB TB Kernel Kernel

Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels …… Kernel 1 Kernel 2 Kernel 3 Kernel 100 SMXs Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels …… Kernel 2 Kernel 3 Kernel 100 SMXs Kernel 1 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels …… Kernel 3 Kernel 100 SMXs Kernel 1 Kernel 2 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels …… Kernel 14 Kernel 100 SMXs Kernel 1 Kernel 2 Kernel 13 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels …… Kernel 15 Kernel 100 SMXs Kernel 14 Kernel 1 Kernel 2 Kernel 13 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Wait till previous kernel finishes …… Kernel 33 Kernel 100 SMXs Kernel 27 Kernel 28 Kernel 32 Kernel 14 Kernel 15 Kernel 26 Kernel 1 Kernel 2 Kernel 13 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

Occupancy, TBs, Warps, & Utilization One TB has 64 threads or 2 warps On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels Achieved SMX Occupancy: 32 kernels * 2 warps/kernel / (64 warps/SMX * 13 SMXs) = 0.077 Wait till previous kernel finishes …… Kernel 33 Kernel 100 SMXs Kernel 27 Kernel 28 Kernel 32 Kernel 14 Kernel 15 Kernel 26 Kernel 1 Kernel 2 Kernel 13 Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory Register File Cores L1 Cache/Shared Memory …… SMX0 SMX1 SMX12

Limit: 32 for concurrent kernels Balancing Resources Limit: 32 for concurrent kernels SMX underutilized TB TB TB TB

Performance: Kernel Launch For synchronous host kernel launch Launching time is ~10us (measured on K20c) Launching time breakdown Software: Runtime: resource allocation, e.g. parameters, streams, kernel information Driver: SW-HW interactive data structure allocation Hardware: Kernel scheduling