Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos ECE, Univ.

Similar presentations


Presentation on theme: "Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos ECE, Univ."— Presentation transcript:

1 Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos moshovos@eecg.toronto.edu ECE, Univ. of Toronto Summer 2010 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green and others as noted on slides

2 How to Get High Performance Computation –Calculations –Data communication/Storage Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency

3 Calculation capabilities How many calculation units can be built? Today’s silicon chips –About 1B transistors –30K transistors for a 52b multiplier ~30K multipliers –260mm^2 area (mid-range) –112microns^2 for FP unit (overestimated) ~2K FP units Frequency ~ 3Ghz common today –TFLOPs possible Disclaimer: back-on-the-envelop calculations – take with a grain of salt Can build lots of calculation units (ALUs) Tons of Compute Engines ?

4 How about Communication/Storage Need data feed and storage The larger the slower Takes time to get there and back –Multiple cycles even on the same die Tons of Compute Engines Tons of Slow Storage Unlimited Bandwidth Zero/Low Latency  

5 Is there enough parallelism? Keep this busy? –Needs lots of independent calculations Parallelism/Concurrency Much of what we do is sequential –First do 1, then do 2, then if X do 3 else do 4 Tons of Compute Engines Tons of Storage Unlimited Bandwidth Zero/Low Latency

6 Today’s High-End General Purpose Processors Localize Communication and Computation Try to automatically extract parallelism time Tons of Slow Storage Faster cache Slower Cache Automatically extract instruction level parallelism Large on-die caches to tolerate off-chip memory latency

7 Some things are naturally parallel

8 Sequential Execution Model int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; time Flow of control / Thread One instruction at the time Optimizations possible at the machine level

9 Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[index] = a[index] * fade; time This has been tried before: ILLIAC III, UIUC, 1966

10 Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time The model used in today’s Graphics Processors

11 CPU vs. GPU overview CPU: –Handles sequential code well –Can’t take advantage of massively parallel code –Off-chip bandwidth lower –Peak Computation capability lower GPU: –Requires massively parallel computation –Handles some control flow –Higher off-chip bandwidth –Higher peak computation capability

12 Programmer’s view GPU as a co-processor (2008) CPU Memory GPU GPU Memory 1GB on our systems 3GB/s – 8GB.s 6.4GB/sec – 31.92GB/sec 8B per transfer 141GB/sec

13 Target Applications int a[N]; // N is large for all elements of a compute a[i] = a[i] * fade Lots of independent computations –CUDA threads need not be independent

14 Programmer’s View of the GPU GPU: a compute device that: –Is a coprocessor to the CPU or host –Has its own DRAM (device memory) –Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

15 Why are threads useful? Parallelism Concurrency: –Do multiple things in parallel –Uses more hardware  Gets higher performance Needs more functional units

16 Why are threads useful #2 – Tolerating stalls Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost

17 GPU vs. CPU Threads GPU threads are extremely lightweight Very little creation overhead In the order of microseconds All done in hardware GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

18 Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel GPU / Device 2’. Synchronize with GPU 3. Copy from GPU mem CPU / Host

19 Programmer’s view First create data on CPU memory CPU Memory GPU GPU Memory

20 Programmer’s view Then Copy to GPU CPU Memory GPU GPU Memory

21 Programmer’s view GPU starts computation  runs a kernel CPU can also continue CPU Memory GPU GPU Memory

22 Programmer’s view CPU and GPU Synchronize CPU Memory GPU GPU Memory

23 Programmer’s view Copy results back to CPU CPU Memory GPU GPU Memory

24 Computation partitioning: At the highest level: –Think of computation as a series of loops: for (i = 0; i < big_number; i++) –a[i] = some function for (i = 0; i < big_number; i++) –a[i] = some other function for (i = 0; i < big_number; i++) –a[i] = some other function Kernels

25 Computation Partitioning -- Kernel CUDA exposes the hardware to the programmer Programmer must manually partition work appropriately Programmers view is hierarchical: –Think of data as an array

26 Per Kernel Computation Partitioning Computation Grid: 2D Case Threads within a block can communicate/synchronize –Run on the same multiprocessor Threads across blocks can’t communicate –Shouldn’t touch each others data –Behavior undefined Block thread

27 Thread Coordination Overview Race-free access to data

28 GBT: Grids of Blocks of Threads Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds Programmers view of data and computation partitioning

29 Block and Thread IDs Threads and blocks have IDs –So each thread can decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data –Convenience not necessity Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) IDs and dimensions are accessible through predefined “variables”, e.g., blockDim.x and threadIdx.x

30 Execution Model: Ordering Execution order is undefined Do not assume and use: block 0 executes before block 1 Thread 10 executes before thread 20 And any other ordering even if you can observe it –Future implementations may break this ordering –It’s not part of the CUDA definition –Why? More flexible hardware options

31 Programmer’s view: Memory Model Different memories with different uses and performance –Some managed by the compiler –Some must be managed by the programmer Arrows show whether read and/or write is possible

32 Execution Model Summary (for your reference) Grid of blocks of threads –1D/2D grid of blocks –1D/2D/3D blocks of threads All blocks are identical: –same structure and # of threads Block execution order is undefined Same block threads: –can synchronize and share data fast (shared memory) Threads from different blocks: –Cannot cooperate –Communication through global memory Threads and Blocks have IDs –Simplifies data indexing –Can be 1D, 2D, or 3D (threads) Blocks do not migrate: execute on the same processor Several blocks may run over the same processor

33 CUDA Software Architecture cuda…() cu…() e.g., fft()

34 Reasoning about CUDA call ordering GPU communication via cuda…() calls and kernel invocations –cudaMalloc, cudaMemCpy Asynchronous from the CPU’s perspective –CPU places a request in a “CUDA” queue –requests are handled in-order Streams allow for multiple queues –Order within each queue honored –No order across queues –More on this much later on

35 My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU

36 CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x; 1.Allocate CPU Data Structure 2.Initialize Data on CPU 3.Allocate GPU Data Structure 4.Copy Data from CPU to GPU 5.Define Execution Configuration 6.Run Kernel 7.CPU synchronizes with GPU 8.Copy Data from GPU to CPU 9.De-allocate GPU and CPU memory

37 1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);... } No memory allocated on the GPU side Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out More on this later cudaMallocHost (…)

38 2. Initialize CPU Data (dummy) float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;

39 3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side –NOT: da = cudaMalloc (…) Assignment is done internally: –That’s why we pass &da Space is allocated in Global Memory on the GPU

40 GPU Memory Allocation The host manages GPU memory allocation: –cudaMalloc (void **ptr, size_t nbytes) –Must explicitly cast to ( void **) cudaMalloc ((void **) &da, sizeof (float) * N); –cudaFree (void *ptr); cudaFree (da); –cudaMemset (void *ptr, int value, size_t nbytes); cudaMemset (da, 0, N * sizeof (int)); Check the CUDA Reference Manual

41 4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION

42 Host/Device Data Transfers The host initiates all transfers: cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction) Asynchronous from the CPU’s perspective –CPU thread continues In-order processing with other CUDA requests enum cudaMemcpyKind –cudaMemcpyHostToDevice –cudaMemcpyDeviceToHost –cudaMemcpyDeviceToDevice

43 5. Define Execution Configuration How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1; Alternatively: blocks = (N + threads_block – 1) / threads_block;

44 6. Launch Kernel & 7. CPU/GPU Synchronization Instructs the GPU to launch blocks x threads_block threads: darradd > (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait darradd: kernel name >> execution configuration –More on this soon (da, x, N): arguments –256 – 8 byte limit / No variable arguments

45 CPU/GPU Synchronization CPU does not block on cuda…() calls –Kernel/requests are queued and processed in-order –Control returns to CPU immediately Good if there is other work to be done –e.g., preparing for the next kernel invocation Eventually, CPU must know when GPU is done Then it can safely copy the GPU results cudaThreadSynchronize () –Block CPU until all preceding cuda…() and kernel requests have completed

46 8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);

47 The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = da[i] + x; } BlockIdx: Unique Block ID. –Numerically asceding: 0, 1, … BlockDim: Dimensions of Block = how many threads it has –BlockDim.x, BlockDim.y, BlockDim.z –Unused dimensions default to 0 ThreadIdx: Unique per Block Index –0, 1, … –Per Block

48 Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x; a[0]a[63]a[64]a[127]a[128]a[191]a[192] blockIdx.x = 0blockIdx.x = 1blockIdx.x = 2 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 threadIdx.x 63 threadIdx.x 0 i = 0i = 63i = 64i = 127i = 128i = 191 i = 192 Assuming blockDim.x = 64

49 CUDA Function Declarations __global__ defines a kernel function –Must return void –Can only call __device__ functions __device__ and __host__ can be used together –Two difference versions generated Executed on the: Only callable from the: __device__ float DeviceFunc() device __global__ void KernelFunc() devicehost __host__ float HostFunc() host

50 __device__ Example Add x to a[i] multiple times __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }

51 Kernel and Device Function Restrictions __device__ functions cannot have their address taken –e.g., f = &addmany; *f(…); For functions executed on the device: –No recursion darradd (…) { darradd (…) } –No static variable declarations inside the function darradd (…) { static int canthavethis; } –No variable number of arguments e.g., something like printf (…)

52 My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd >> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU

53 How to get high-performance #1 Programmer managed Scratchpad memory –Bring data in from global memory –Reuse –16KB/banked –Accessed in parallel by 16 threads Programmer needs to: –Decide what to bring and when –Decide which thread accesses what and when –Coordination paramount

54 How to get high-performance #2 Global memory accesses –32 threads access memory together –Can coalesce into a single reference –E.g., a[threadID] works well Control flow –32 threads run together –If they diverge there is a performance penalty Texture cache –When you think there is locality

55 Are GPUs really that much faster than CPUs 50x – 200x speedups typically reported Recent work found –Not enough effort goes into optimizing code for CPUs But: –The learning curve and expertise needed for CPUs is much larger

56 ECE Overview -ECE research Profile -Personnel and budget -Partnerships with industry Our areas of expertise -Biomedical Engineering -Communications -Computer Engineering -Electromagnetics -Electronics -Energy Systems -Photonics -Systems Control -Slides from F. Najm (Chair) and T. Sargent (Research Vice Chair)

57 About our group Computer Architecture –How to build the best possible system –Best: performance, power, cost, etc. Expertise in high-end systems –Micro-architecture –Multi-processor and Multi-core systems Current Research Support: –AMD, IBM, NSERC, Qualcomm (planned) Claims to fame –Memory Dependence Prediction Commercially implemented and licensed –Snoop Filtering: IBM Blue Gene

58

59 UofT-DRDC Partnership

60

61

62 Examples of industry research contracts with ECE in the past 8 years AMD Agile Systems Inc Altera ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL Science Eastman Kodak Electro Scientific Industries EMS Technologies Exar Corp FOX-TEK Firan Technology Group Fuji Electric 62 Fujitsu Gennum H2Green Energy Corporation Honeywell ASCa, Inc. Hydro One Networks Inc. IBM Canada Ltd. IBM IMAX Corporation Intel Corporation Jazz Semiconductor KT Micro LG Electronics Maxim MPB Technologies Microsoft Motorola Northrop Grumman NXP Semiconductors ON Semiconductor Ontario Lottery and Gaming Corp Ontario Power Generation Inc. Panasonic Semiconductor Singapore Peraso Technologies Inc. Philips Electronics North America Redline Communications Inc. Research in Motion Ltd. Right Track CAD Robert Bosch Corporation Samsung Thales Co., Ltd Semiconductor Research Corporation Siemens Aktiengesellschaft Sipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc. Telus Mobility Texas Instruments Toronto Hydro-Electric System Toshiba Corporation Xilinx Inc.

63 63 Eight Research Groups 1.Biomedical Engineering 2.Communications 3.Computer Engineering 4.Electromagnetics 5.Electronics 6.Energy Systems 7.Photonics 8.Systems Control ECE

64 Computer Engineering Group Human-Computer Interaction –Willy Wong, Steve Mann Multi-sensor information systems –Parham Aarabi Computer Hardware –Jonathan Rose, Steve Brown, Paul Chow, Jason Anderson Computer Architecture –Greg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright Jerger Computer Security –Davie Lie, Ashvin Goel

65 65 Biomedical Engineering Neurosystems –Berj L. Bardakjian, Roman Genov. –Willy Wong, Hans Kunov –Moshe Eizenman Rehabilitation –Milos Popovic, Tom Chau. Medical Imaging –Michael Joy, Adrian Nachman. –Richard Cobbold –Ofer Levi Proteomics –Brendan Frey. –Kevin Truong. Ca 2+

66 Communications Group Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed Three Sub-Groups: 1.Networks 2.Signal Processing 3.Information Theory

67 Sequence Analysis

68 Image Analysis and Computer Vision Computer vision and graphics Embedded computer vision Pattern recognition and detection

69 Networks

70 Quantum Cryptography and Computing

71 Computer Engineering System Software –Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li Computer-Aided Design of Circuits –Farid Najm, Andreas Veneris, Jianwen Zhu, Jonathan Rose

72 Electronics Group UofT-IBM Partnership 72 n 14 active professors; largest electronics group in Canada. n Breadth of research topics: l Electronic device modelling l Semiconductor technology l VLSI CAD and Systems l FPGAs l DSP and Mixed-mode ICs l Biomedical microsystems l High-speed and mm-wave ICs and SoCs n Lab for (on-wafer) SoC and IC testing through 220 GHz

73 73 Intelligent Sensory Microsystems n Mixed-signal VLSI circuits l Low-power, low-noise signal processing, computing and ADCs n On-chip micro-sensors l Electrical, chemical, optical n Project examples l Brain-chip interfaces l On-chip biochemical sensors l CMOS imagers

74 74 mm-Wave and 100+GHz systems on chip n Modelling mm-wave and noise performance of active and passive devices past 300 GHz. n 60-120GHz multi-gigabit data rate phased-array radios n Single-chip 76-79 GHz automotive radar n 170 GHz transceiver with on-die antennas

75 Electromagnetics Group Metamaterials: From microwaves to optics –Super-resolving lenses for imaging and sensing –Small antennas –Multiband RF components –CMOS phase shifters Electromagnetics of High-Speed Circuits –Signal integrity in high-speed digital systems Microwave integrated circuit design, modeling and characterization Computational Electromagnetics –Interaction of Electromagnetic Fields with Living Tissue Antennas –Telecom and Wireless Systems –Reflectarrays –Wave electronics –Integrated antennas –Controlled-beam antennas –Adaptive and diversity antennas

76 Super-lens capable of resolving details down to  Small and broadband antennas Scanning antennas with CMOS MTM chips METAMATERIALS (MTMs)

77 Computational Electromagnetics Fast CAD for RF/ optical structures Modeling of Metamaterials Plasmonic Left-Handed Media Leaky-Wave Antennas Microstrip spiral inductor Optical power splitter

78 78 Energy Systems Group Power Electronics –High power (> 1.2 MW) converters modeling, control, and digital control realization –Micro-Power Grids converters for distributed resources, dc distribution systems, and HVdc systems –Low-Power Electronics Integrated power supplies and power management systems-on-chip for low-power electronics –computers, cell phones, PDA-s, MP3 players, body implants –Harvesting Energy from humans

79 79 IC for cell phone power supplies U of T Matrix Converter for Micro-Turbine Generator Voltage Control System for Wind Power Generators Energy Systems Research

80 Photonics Group

81

82

83 Photonics Group: Bio-Photonics

84 Basic & applied research in control engineering World-leading group in Control theory _______________________________________ ________ Optical Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre amplifier design Analysis and design of digital watermarks for authentication Nonlinear control theory –application to magnetic levitation, micro positioning system distributed control of mobile autonomous robots –Formations, collision avoidance Systems Control Group


Download ppt "Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos ECE, Univ."

Similar presentations


Ads by Google