Download presentation
Presentation is loading. Please wait.
Published byEmmanuel St-Georges Modified over 6 years ago
1
GPU Introduction: Uses, Architecture, and Programming Model
Lee Barford firstname dot lastname at gmail dot com
2
Outline Why GPUs? What GPUs are and what they provide
Overview of GPU architecture Enough to orient the discussion of programming them Future changes Overview of tool chains We will cover NVIDIA’s CUDA
3
Power: energy used per unit time Dominant practicality and cost constraint
4
Insatiable need for floating point computing
Graphics gaming animation Simulation: electronics aerodynamics automotive biochemistry Machine learning More flops / sec more realistic, more accurate Power & cooling are the limits on more flops: Key metric is flops / sec / Watt Supercomputer (dominant to c. 1990) compute cluster (unaccelerated CPUs), dominant before c. 2010
5
Extreme throughput integer applications
Crytography Cryptocurrency mining Blockchain Profit set by rate of computations offset by energy costs Want to maximize integer operations / s / W
6
Exponentially growing gap
Serial App Performance GPUs designs maximize number of cores to improve ops/s/W Graph from UC Berkeley ParLab
7
Graphics Processor (GPU) as Parallel Accelerator
Commodity priced, massively parallel floating point Claimed performance on various problems x CPU running serial code Graph from
8
The GPU as a Co-Processor to the CPU: The physical and logical connections
Control actions & code (kernels) to run GPU I/Os: Video Ethernet USB hub Firewire … CPU chipset PCIe Slow Main memory GPU memory Running GPU code is like requesting asynchronous I/O
9
Now from AMD & Intel: Fusion of CPU and GPU
Multiple cores Hardware task scheduler Running GPU code will be like pending method pointers for future execution. (Like C++11, TBB, TPL, PPL). Main memory I/O subsystem
10
Programming implications
Write two programs, in two languages Main program on CPU: Startup, shutdown, I/O, networking, databases, non-GPU functionality Control passing of data between CPU and GPU Invoke code to run on GPU Kernels on GPU Term comes from simulation (partial differential equations) Computation-heavy subroutines GPU must save enough time to make work of moving data between CPU and GPU pay off
11
CUDA (NVIDIA) GPU Compute Architecture: Many Simple, Floating-Point Cores
12
Cores organized into groups
32 cores (Streaming Multiprocessor) share: Instruction stream Registers Execute same program (kernel) in lock step SPMD: ~ [Same place in same kernel at the same time] Act as ’s more cores by switching context instead of waiting for memory 1000’s of virtual cores executing same lines of code together, but Sharing limited resources
13
GPU has multiple SMs SMs run in parallel
Do not need to be executing same location in the same program at the same time In aggregate, many 1000’s of parallel copies of same kernel running simultaneously Total of up to 1Tflop/s at peak CENTRAL SOFTWARE ISSUES: How to generate and control this much parallelism How to avoid slowing down due to waiting for off-GPU DRAM memory access
14
GPU Programming Options
Libraries: called from CPU code. Write no GPU code. Examples: Image/video processing, dense & sparse matrix, FFT, random numbers Generic programming for GPU Thrust Like C++ Standard Template Library Specialize & use built-in data structures and algorithms NVIDIA GPUs only Programming GPU kernels in a special-purpose language (emphasis in this course) CUDA C/C++, PyCUDA, CUDA Fortran OpenCL, WebCL, …
15
Questions
16
Two Programming Environments that We’ll Cover
CUDA C/C++: Very efficient code Lots of fussy detail to get that efficiency Robust tool chains for Linux, Windows, MacOS Specific to NVIDIA Thrust: Easy to write Algorithms provided among the fastest (e.g., sort) NVIDIA GPUs only
18
BACKUP SLIDES
19
CUDA C/C++ vs OpenCL CUDA C/C++ OpenCL Proprietary (NVIDIA)
Code runs on NVIDIA GPUs Reportedly 10-50% faster than OpenCL Compiles at build time to binary code for particular targeted hardware Specific NVIDIA hardware architecture versions No compiler available at run time Open standard (Khronos) Code runs on NVIDIA & AMD GPUs, x86 multicore, FPGAs (academic research) at the same time Compiles at build time to intermediate form that is compiled at run time for the hardware that is present Compiler is available at run time Can execute downloaded or dynamically generated source code
20
Class Project Idea Accurate edge finding in a 1D signal
Journal paper published on multicore version Student project last year doing Thrust implementation Project: Do CUDA version + performance tests Paper combining previous student’s work with above: 60% probability of getting accepted in a particular IEEE conference 3 co-authors, including previous student & Lee Extended abstract due: Nov 6 Class project due during finals, same as everyone else Camera ready paper due: March 4 See or me in the next week or two if interested
21
Programming Tomorrow’s CPU will be Like Programming Today’s GPU
GPUs that compute will come “for free” with computers Slow step of moving data to/from GPU will be eliminated Hardware task scheduler for both CPU and GPU will Almost eliminate OS & I/O overhead for invoking GPU kernels Also almost eliminate OS overhead for invoking parallel tasks on CPU AMD laptop chip; Intel laptops (e.g. fall ‘12 refresh MacBook Pros) NVIDIA GPU+ARM chip available now for battery operated devices Both promise desktop chips in next year or two Programming models will probably evolve from what we’ll cover Course will use current, PCIe-based GPUs We will be dealing with overheads that will pass away over next few years
22
Teraflop GPU that runs on a (biggish) battery
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.