Analyzing CUDA Workloads Using a Detailed GPU Simulator

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Analyzing CUDA Workloads Using a Detailed GPU Simulator

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Computer Engg, IIT(BHU)

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Lecture 5: Performance Considerations

Gwangsun Kim, Jiyun Jeong, John Kim

Reducing Memory Interference in Multicore Systems

Sathish Vadhiyar Parallel Programming

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ISPASS th April Santa Rosa, California

Basic CUDA Programming

Cache Memory Presentation I

Lecture 2: Intro to the simd lifestyle and GPU internals

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Some things are naturally parallel

Mattan Erez The University of Texas at Austin

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

NVIDIA Fermi Architecture

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Operation of the Basic SM Pipeline

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

General Purpose Graphics Processing Units (GPGPUs)

Graphics Processing Unit

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017

Article information Title: Analyzing CUDA workloads using a detailed GPU simulator Authors: Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt Affiliation: University of British Columbia (UBC), Vancouver, Canada Conference: 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2009) Citation (as of November 2017): 1031 https://scholar.google.com/citations?user=-TSkCDMAAAAJ&hl=en&oi=sra

GPGPU-Sim in a nutshell Motivations Presenting a novel microarchitecture performance simulator for NVIDIA GPUs (GPGPU-Sim1) characterizing following design decisions: Interconnect topology of caches Memory controller design Parallel workload distribution Memory request coalescing Outcomes Performance is more sensitive to interconnect bisection bandwidth rather than latency Running fewer number of threads might improve performance by reducing contention Our objective in this presentation is Learn what GPGPU-Sim simulates and describe how it works https://github.com/gpgpu-sim

CUDA programming model Kernel launch = Grid of blocks of threads Kernel <<<# blocks, # threads>>> blockDim, threadIdx, blockIdx Example thread identifier x = blockDim.x * blockIdx.x + threadIdx.x Declarations __global__ runs on GPU, callable from CPU __device__ only callable from GPU Transfer data from CPU (host) to GPU (device) malloc for CPUs, and cudaMalloc for GPUs Copy input data from host to device cudaMemcpy (Device destination address, Host source address, size, cudaMemcpyHostToDevice) Copy the results back from device to host cudaMemcpy (Host Destination address, Device source address, size, cudaMemcpyDeviceToHost) Free for CPUs, and cudaFree for GPUs https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ GPGPU-Sim Tutorial (MICRO 2012)

CUDA programming model (continue) In CUDA programming model, the GPU is threated as a co-processor onto which an application is running on a CPU can launch a massively parallel kernel. Revisiting the CUDA kernel The CUDA kernel is compromised of a grid of threads Within a grid, threads are groups into thread blocks Within a block threads have access to common fast memory e.g. shared memory or L1 cache and can perform barrier synchronization Each thread is given an unique identifier which can be used to help divide up work among the threads. A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

What GPGPU-Sim simulates GPGPU-Sim simulates Parallel Thread eXecution (PTX) Instruction Set Architecture (ISA) which is a pseudo-assembly language used in NVIDIA’s CUDA GPGPU-Sim also simulates SASS the Native ISA for NVIDIA GPUs which is another kind of low-level assembly language that is native to NVIDIA GPU hardware GPGPU-Sim uses PTXPlus which can represent both PTX and SASS instructions. CUDA program (.cu) nvcc PTX PTXAS Binary executable https://en.wikipedia.org/wiki/Parallel_Thread_Execution PTX ISA Application Guide Version 6.0 NVIDIA, 2017-09-xx A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

GPGPU-Sim timing model GPGPU-Sim simulates timing for CUDA cores Caches interconnection network Memory but not for CPU PCIe Graphics specific hardware like rasterizer For each kernel GPGPU-Sim Counts the # cycles spent running the kernel Excludes PCIe bus latency CPU may run concurrently with asynchronous kernels Time GPU HW CPU Async. Kernel Launch Done Sync. Kernel Launch Blocking GPGPU-Sim Tutorial (MICRO 2012)

Comparing thread hierarchy and GPU computing power hierarchy A CUDA kernel is equal to A grid of blocks of threads Each thread block can contain up to 1024 threads A Maxwell GPU microarchitecture consist of Four clusters of 16 Streaming Multiprocessors (SMs) Each SM is partitioned into four 32-CUDA cores (Streaming Processors (SPs)) processing blocks each with one warp scheduler1 For example, for an NVIDIA GeForce GTX 980 The warp size is 32 which is aligned to the number of CUDA cores Each SM has 4 * 32 = 128 cores Each SM can launch 128 threads in parallel In CUDA programming language, each thread block is dispatched on a SM as a unit of work. Then, SM warp schedulers schedule and run partitions of 32 threads on CUDA cores. 1. “Nvidia geforce gtx 980: Featuring maxwell, the most advanced gpu ever made,” White paper, NVIDIA Corporation, 2014

GPGPU-Sim microarchitecture A set of shader cores connected to a set of memory controllers via an interconnection network Each shader core represents a CUDA core (SM) Threads are distributed to shader cores at the granularity of thread blocks Registers, shared memory space, and thread slots are shared per thread block Multiple thread blocks can be assigned to a single CUDA core, sharing a common pipeline A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

GPGPU-Sim microarchitecture (continue) A shader core has a SIMD width of 8 and uses a 24-stage in-order pipeline without forwarding Thus, at least 192 (8 * 24) active threads are needed to avoid stalling A shader core has 6 logical pipeline stages: Fetch, decode, execute, memory1, memory2, writeback A set of 32 threads constitutes a warp All 32 threads in a given warp execute the same instruction with different data values over 4 consecutive clock cycles in all pipelines Immediate post-dominator reconvergence mechanism is used to handle branch divergence A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Revisiting NVIDIA GeForce GTX 980 (Maxwell) Streaming Multiprocessor (SM) A single SM 4 Graphics Processing Cluster (GPC)1 16 Maxwell SM per GPC 4 32-CUDA cores processing blocks per SM 128 CUDA cores per SM 2048 CUDA cores in total 4 warp schedulers per SM each capable of launching two instructions per cycle. The warp size is 32 equal to #cores per processing block In Maxwell microachitecture SMs both have 4 warp schedulers. On each cycle each warp scheduler picks an eligible warp (aka a warp that is not stalled) and issue 1 or 2 independent instructions from the warp.2 4 GPC of GTX 980 1. “Nvidia geforce gtx 980: Featuring maxwell, the most advanced gpu ever made,” White paper, NVIDIA Corporation, 2014 2 https://devtalk.nvidia.com/default/topic/743377/cuda-programming-and-performance/understanding-cuda-scheduling/post/4221610/

GPGPU-Sim memory unit Constant cache: A read-only cache for constant memory A warp can access one constant cache location in a single memory unit cycle Texture cache: A read-only cache with FIFO retirement Shared memory: A 16 KB low latency highly-banked per-core Threads within a block can cooperate via shared memory As fast as register files in absence of bank conflicts Each bank serves one address per cycle. Multiple accesses to a bank in a single cycle causes a bank conflict Global memory: Off-chip DRAM memory. Accesses must go through interconnect Global texture memory with a per-core texture cache Global constant memory with a per-core constant cache A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009. GPGPU-Sim Tutorial (MICRO 2012)

GPGPU-Sim interconnection network The on-chip interconnection network can be designed in various ways based on cost and performance. Cost is determined by complexity and number of routers as well as density and length of wires Performance depends on latency, bandwidth, and path diversity of the network. In order to access the global memory, memory requests must be sent via an interconnection network to a corresponding memory controller which are physically distributed over the chip. The memory controllers are laid out in a 6x6 mesh configuration A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

GPGPU-Sim thread scheduling Thread scheduling is done inside the shader core. Every 4 cycles, warps ready for execution are selected by the warp scheduler and scheduled in a round robin fashion. If a thread inside a warp faces a long latency operation, all threads in the warp are taken out of the scheduling pool until the long latency operation is over. The ready warps are sent to the pipeline for execution in a round robin order. Many threads running on each shader core thus allow a shader core to tolerate long latency operations without reducing throughput. A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

GPGPU-Sim thread block distribution Interleaving the execution of warps allows GPUs to tolerate memory access latencies. Running multiple smaller thread blocks on a shader core is better than using a single larger thread block: Threads from one thread block can make progress while threads from another thread block are waiting at a barrier synchronization point. Larger thread blocks require more register and increase shared memory usage. Running more threads blocks on a shader core provides additional memory latency tolerance. However, If a memory-intensive kernel completely fills up all thread block slots, it may increase contention in the interconnection network. Breadth-first strategy is used to distribute thread blocks among shader cores, selecting a core with the minimum number of running thread blocks. A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

GPGPU-Sim memory access coalescing (intra-warp) Objective: Combining memory accesses made by threads in a warp into fewer transactions Example: If threads in a warp are accessing consecutive 4-byte sized locations in memory Send one 128–byte request to DRAM (coalescing) Instead of 32 4-byte requests This reduces the number of transactions between CUDA cores and DRAM Less work for interconnect, memory partition, and DRAM. Warp One 128-Byte Transaction = 4-bytes in memory GPGPU-Sim Tutorial (MICRO 2012)

GPGPU-Sim read memory access coalescing (inter-warp) Cache accesses are tracked using the Miss Status Holding Registers (MSHR) MSHRs keep track of outstanding memory requests, merges simultaneous requests for the same cache block In GPGPU-Sim each cache has its own set of MSHR entries Each MSHR entry contains one or more request to the same memory address The number of MSHR entries are configurable Memory unit stalls if cache runs out of MSHR entries The inter-warp memory coalescing consolidates read memory requests from later warps that require access to data for which a memory request is already in progress due to another warp running on the same shader core. A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

GPGPU-Sim caching While coalescing memory requests captures spatial locality among threads, memory bandwidth requirements may be further reduced with caching if an application contains temporal locality or spatial locality within the access pattern of individual threads. A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Software dependencies Linux machine (Preferably Ubuntu 11.04 64 bit ) CUDA toolkit 4.2, and NVIDIA GPU Computing SDK 4.2 gcc 4.{4,5}.x, g++ 4.{4,5}.x, and gfortran 4.{4,5}.x Development tools (especially build-essential package), OpenMPI, Boost library, and etc. Can be cloned from: git clone git://dev.ece.ubc.ca/gpgpu-sim Git clone https://github.com/gpgpu-sim/gpgpu-sim_distribution GPGPU-Sim Tutorial (MICRO 2012) http://cs922.blogspot.com/2014/01/gpgpu-sim-is-tool-developed-for.html

Simulation setup for baseline GPU configuration Number of shader cores 28 Warp Size 32 SIMD pipeline width 8 # of Threads/CTAs/Registers per Core 1024 / 8 /16384 Shared Memory / Core 16KB (16 banks) Constant Cache / Core 8KB (2-way set assoc. 64B lines LRU) Texture Cache / Core 64KB (2-way set assoc. 64B lines LRU) Memory Channels BW / Memory Module 8 Byte/Cycle DRAM request queue size Memory Controller Out of order (FR-FCFS) Branch Divergence handling method Immediate Post Dominator Warp Scheduling Policy Round Robin among ready Warps A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Benchmarks GPGPU-Sim is a academic cycle-level simulator for modeling a modern GPU running non-graphics workloads 14 benchmarks written in CUDA. Needed software tuning or changes in hardware design Less than 50x reported speedups Advanced Encryption Standard (AES) Breadth First Search (BFS) Coulombic Potential (CP) gpuDG (DG) 3D Laplace Solver (LPS) LIBOR Monte Carlo (LIB) MUMmerGPU (MUM) Neural Network Digit Recognition (NN) N-Queens Solver (NQU) Ray Tracing (RAY) StoreGPU (STO) Weather Prediction (WP) Black-Scholes option pricing (BLK) Fast Walsh Transform (FWT) A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Benchmarks (continue) Abbr. Claimed Speedup AES Cryptography AES 12x Breadth First Search BFS 2x-3x Coulombic Potential CP 647x gpuDG DG 50x 3D Laplace Solver LPS LIBOR Monte Carlo LIB MUMmerGPU MUM 3.5x-10x Neural Network NN 10x N-Queens Solver NQU 2.5x Ray Tracing RAY 16x StoreGPU STO 9x Weather Prediction WP 20x A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Comparison between GPGPU-Sim and NVIDIA GeForce 8600 GTS Maximum GPGU-Sim IPC = 224 (28 cores * 8-wide pipeline) NVIDIA GeForce 8600 GTS IPC = # PTX instructions / (Runtime * Core frequency) 0.899 Correlation coefficient A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Interconnection network latency sensitivity Results for various mesh network latencies, adding extra pipeline latency of 4, 8, or 16 cycles Slight increase in interconnection latency has no severe effect of overall performance No need to overdesign interconnection to decrease latency A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Interconnection network bandwidth sensitivity Results for modifying mesh interconnect bandwidth by varying the channel bandwidth from 8 bytes to 64 bytes. Low Bandwidth decreases performance a lot (8B) Overall, performance is more sensitive to interconnect bandwidth than to latency In other words, restricting the channel bandwidth causes the interconnect to become a bottleneck A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Effects of varying number of thread blocks Results for exploring the effects of varying the resources that limit the number of threads including: the amount of registers, shared memory and threads. These limitations limit the number of thread blocks can run concurrently on a core. Most benchmarks do not benefit substantially Some benchmarks even perform better with fewer concurrent threads (e.g. AES) Less contention in DRAM Given the widely-varying workload-dependent behavior, always scheduling the maximal number of thread blocks supported by a core is not always the best scheduling policy. A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Summary GPGPU-Sim is a GPU simulator capable of simulating CUDA workloads. Various aspects of the GPGPU-Sim are studied including: Branch divergence Interconnect topology Interconnect latency and bandwidth DRAM utilization and efficiency Cache effect Thread block size Memory request coalescing Running fewer number of thread blocks can improve performance (less DRAM contention) A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009.

Key references A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator.“ IEEE ISPASS 2009. http://www.gpgpu-sim.org/micro2012-tutorial/