Download presentation
Presentation is loading. Please wait.
Published byJuniper Freeman Modified over 9 years ago
1
Chun-Yuan Lin Brief of GPU&CUDA 2015/12/16
2
Introduce to Myself Name: Chun-Yuan Lin ( 林俊淵 ) (1977) Education: Ph.D., Dept. CS, FCU Univ. Experience: Post Dr. Fellow, Institute of Molecular and Cellular Biology Post Dr. Fellow, Dept. CS NTHU Univ. Research: Parallel and Distributed Processing, Parallel and Distributed Programming Language, Algorithm Analysis, Information Retrieve, Genomics, Proteomics, and Bioinformatics. Contact: cyulin@mail.cgu.edu.twcyulin@mail.cgu.edu.tw http://sslab.cs.nthu.edu.tw/~cylin/ 2015/12/16
3
My family 2015/12/16
4
Please let me know who are you 2015/12/16
5
Introduction to Share memory programming 2015/12/16
6
Types of Parallel Computers Two principal types: Shared memory multiprocessor Distributed memory multicomputer Another type: Distributed Shared memory multiprocessor 2015/12/16
7
Shared Memory Multiprocessor System Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module : Processors Interconnection network Memory module One address space 2015/12/16
8
Simplistic view of a small shared memory multiprocessor Examples: Dual Pentiums Quad Pentiums ProcessorsShared memory Bus 2015/12/16
9
Programming Shared Memory Multiprocessors Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access variables declared outside threads. Example Pthreads Sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism. Example OpenMP - industry standard - needs OpenMP compiler 2015/12/16
10
Flynn (1966) created a classification for computers based upon instruction streams and data streams: Single instruction stream-single data stream (SISD) computer Single processor computer - single stream of instructions generated from program. Instructions operate upon a single stream of data items. Flynn’s Classifications 2015/12/16
11
Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer General-purpose multiprocessor system - each processor has a separate program and one instruction stream is generated from each program for each processor. Each instruction operates upon different data. Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification. 2015/12/16
12
Single Instruction Stream-Multiple Data Stream (SIMD) Computer A specially designed computer - a single instruction stream from a single program, but multiple data streams exist. Instructions from program broadcast to more than one processor. Each processor executes same instruction in synchronism, but using different data. Developed because a number of important applications that mostly operate upon arrays of data. 2015/12/16
13
Multiple Program Multiple Data (MPMD) Structure Within the MIMD classification, each processor will have its own program to execute: Program Processor Data Program Processor Data Instructions 2015/12/16
14
Single Program Multiple Data (SPMD) Structure Single source program written and each processor executes its personal copy of this program, although independently and not in synchronism. Source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. 2015/12/16
15
The MIMD category includes a wide class of computers. For this reason, in 1988, E. E. Johnson proposed a further classification of such machines based on their memory structure (global or distributed) and the mechanism used for communication/synchronization (shared variables or message passing). 2015/12/16
16
Programming with Shared Memory we outline the methods of programming systems that have shared memory, including the use of processes, threads, parallel programming languages, and sequential languages with compiler directives and library routines. standard UNIX process "fork-join" model IEEE thread standard Pthreads OpenMP, a widely accepted industry standard for parallel programming on a shared memory multiprocessor. 2015/12/16
17
Shared memory multiprocessor system Any memory location can be accessible by any of the processors. (dual- and quad-Pentium systems, cost-effective) A single address space exists, meaning that each memory location is given a unique address within a single range of addresses. Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections etc.) 2015/12/16
18
Threads The process created with UNIX fork is a "heavyweight" process; it is a completely separate program with its own variables, stack, and personal memory allocation. heavyweight processes are expensive in time and memory space. A much more efficient mechanism is one in which independent concurrent sequences are defined within a process, so-cal1ed threads. The threads al1 share the same memory space and global variables of the process and are much less expensive in time and memory space than the processes themselves. 2015/12/16
19
Each thread needs its own stack and also stores information regarding registers but shares the code and other parts. Creation of a thread can take three orders of magnitude less time than process creation. In addition, a thread will immediately have access to shared global variables. Equally important, threads can be synchronized much more efficiently than processes. Multithreading also helps alleviate the long latency of message-passing 2015/12/16
20
Differences between a process and threads 2015/12/16
21
Creating Shared Data The key aspect of shared memory programming is that shared memory provides the possibility of creating variables and data structures that can be accessed directly by every processor. If UNIX heavyweight processes are to share data. additional shared memory system calls are necessary. Typically, each process has its own virtual address space within the virtual memory management system. It is not necessary to create shared data items explicitly when using threads. Variables declared at the top of the main program (main thread) are global and are available to all threads. 2015/12/16
22
Accessing Shared Data Accessing shared data needs careful control. (process, thread) (write is a problem) Consider two processes each of which is to add one to a shared data item, x. Necessary for the contents of the location x to be read, x + 1 computed, and the result written back to the location: X = 10, answer is 12, but may get 11 2015/12/16
23
Critical Section A mechanism for ensuring that only one process accesses a particular resource at a time is to establish sections of code involving the resource as so-called critical sections and arrange that only one such critical section is executed at a time. (when a process reach a critical section…) This mechanism is known as mutual exclusion. This concept also appears in an operating systems. 2015/12/16
24
Deadlock Can occur with two processes when one requires a resource held by the other, and this process requires a resource held by the first process. 2015/12/16
25
Barriers process/thread synchronization is often needed in shared memory programs. Pthreads do not have a native barrier, so barriers have to be hand- coded using a condition variable and mutex. A global counter variable is incremented each time a thread reaches the barrier, and all the threads are released when the counter has reached a defined number of threads. The threads are released by the last thread reaching the barrier using broadcast signal 2015/12/16
26
Language Constructs for Parallelism Shared Data Shared memory variables might be declared as shared with, say, shared int x; shared int *p; 2015/12/16
27
par Construct For specifying concurrent statements: par { S1; S2;. Sn; } The keyword par indicates that statements in the body are to be executed concurrently. This is instruction-level parallelism. 2015/12/16
28
Multiple concurrent processes or threads could be specified by listing the routines that are to be executed concurrently. par { proc1; proc2;. procn; } 2015/12/16
29
forall Construct To start multiple similar processes together: forall (i = 0; i < n; i++) { S1; S2;. Sm; } which generates n processes each consisting of the statements forming the body of the for loop, S1, S2, …, Sm. Each process uses a different value of i. 2015/12/16
30
Example forall (i = 0; i < 5; i++) a[i] = 0; clears a[0], a[1], a[2], a[3], and a[4] to zero concurrently. 2015/12/16
31
OpenMP An accepted standard developed in the late 1990s by a group of industry specialists. Consists of a small set of compiler directives, augmented with a small set of library routines and environment variables using the base language Fortran and C/C++. The compiler directives can specify such things as the par and forall operations described previously. Several OpenMP compilers available. 2015/12/16
32
For C/C++, the OpenMP directives are contained in #pragma statements. The OpenMP #pragma statements have the format: #pragma omp directive_name... where omp is an OpenMP keyword. May be additional parameters (clauses) after the directive name for different options. Some directives require code to specified in a structured block (a statement or statements) that follows the directive and then the directive and structured block form a “construct”. 2015/12/16
33
Parallel Directive #pragma omp parallel structured_block creates multiple threads, each one executing the specified structured_block, either a single statement or a compound statement created with {...} with a single entry point and a single exit point. There is an implicit barrier at the end of the construct. The directive corresponds to forall construct. 2015/12/16
35
Number of threads in a team Established by either: 1. num_threads clause after the parallel directive, or 2. omp_set_num_threads() library routine being previously called, or 3. the environment variable OMP_NUM_THREADS is defined in the order given or is system dependent if none of the above. Number of threads available can also be altered automatically to achieve best use of system resources by a “dynamic adjustment” mechanism. 2015/12/16
36
Work-Sharing Three constructs in this classification: sections for single In all cases, there is an implicit barrier at the end of the construct unless a nowait clause is included. Note that these constructs do not start a new team of threads. That done by an enclosing parallel construct. 2015/12/16
37
Shared Memory Programming Performance Issues 2015/12/16
38
Shared Data in Systems with Caches All modern computer systems have cache memory, high- speed memory closely attached to each processor for holding recently referenced data and code. Cache coherence protocols Update policy - copies of data in all caches are updated at the time one copy is altered. Invalidate policy - when one copy of data is altered, the same data in any other cache is invalidated (by resetting a valid bit in the cache). These copies are only updated when the associated processor makes reference for it. 2015/12/16
39
False Sharing Different parts of block required by different processors but not same bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated though the actual data is not shared. 2015/12/16
40
Critical Sections Serializing Code High performance programs should have as few as possible critical sections as their use can serialize the code. Suppose, all processes happen to come to their critical section together. They will execute their critical sections one after the other. In that situation, the execution time becomes almost that of a single processor. 2015/12/16
41
What is GPU? Graphics Processing Units 2015/12/16
42
The Challenge Render infinitely complex scenes And extremely high resolution In 1/60 th of one second Luxo Jr. 1985 took 2-3 hours per frame to render on a Cray-1 supercomputer Today we can easily render that in 1/30 th of one second Over 300,000x faster Still not even close to where we need to be… but look how far we’ve come! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
43
PC/DirectX Shader Model Timeline Quake 3 Giants Halo Far Cry UE3Half-Life 1998199920002001200220032004 DirectX 6 Multitexturing Riva TNT DirectX 8 SM 1.x GeForce 3Cg DirectX 9 SM 2.0 GeForceFX DirectX 9.0c SM 3.0 GeForce 6 DirectX 5 Riva 128 DirectX 7 T&L TextureStageState GeForce 256 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
44
A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API GPU in every PC and workstation – massive volume and potential impact Why Massively Parallel Processor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
46
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU GeForce 8800 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
47
G80 Characteristics 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
48
Objective To understand the major factors that dictate performance when using GPU as an compute accelerator for the CPU The feeds and speeds of the traditional CPU world The feeds and speeds when employing a GPU To form a solid knowledge base for performance programming in modern GPU’s Knowing yesterday, today, and tomorrow The PC world is becoming flatter Outsourcing of computation is becoming easier… © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
49
Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered “ supercomputing applications ” Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
50
Stretching from Both Ends for the Meat New GPU’s cover massively parallel parts of applications better than CPU Attempts to grow current CPU architectures “out” or domain-specific architectures “in” lack success Using a strong combination on apps a compelling idea CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign
51
Bandwidth – Gravity of Modern Computer Systems The Bandwidth between key components ultimately dictates system performance Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance goes falls back to what the “speeds and feeds” dictate © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
52
Classic PC architecture Northbridge connects 3 components that must be communicate at high speed CPU, DRAM, video Video also needs to have 1 st -class access to DRAM Previous NVIDIA cards are connected to AGP, up to 2 GB/s transfers Southbridge serves as a concentrator for slower I/O devices CPU Core Logic Chipset © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign
53
PCI Bus Specification Connected to the southBridge Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate More recently 66 MHz, 64-bit, 512 MB/second peak Upstream bandwidth remain slow for device (256MB/s peak) Shared bus with arbitration Winner of arbitration becomes bus master and can connect to CPU or DRAM through the southbridge and northbridge © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign
54
An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Northbridge handles “primary” PCIe to video/GPU and DRAM. PCIe x16 bandwidth at 8 GB/s (4 GB each direction) 2015/12/16
55
Graphic Processor Unit 2015/12/16
56
Parallel Computing on a GPU NVIDIA GPU Computing Architecture Via a separate HW interface In laptops, desktops, workstations, servers G80 to G200 8-series GPUs deliver 50 to 200 GFLOPS on compiled parallel C applications Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism Tesla C870 Tesla S870 Tesla D870 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign Tesla C1060 1T GFLOPS Tesla S1070
57
TESLA S1070 NVIDIA ® Tesla™ S1070 : 4 teraflop 1U system 。 2015/12/16
58
What is GPGPU ? General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
59
DirectX 5 / OpenGL 1.0 and Before Hardwired pipeline Inputs are DIFFUSE, FOG, TEXTURE Operations are SELECT, MUL, ADD, BLEND Blended with FOG RESULT = (1.0-FOG)*COLOR + FOG*FOGCOLOR Example Hardware RIVA 128, Voodoo 1, Reality Engine, Infinite Reality No “ops”, “stages”, programs, or recirculation © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
60
The 3D Graphics Pipeline Application Scene Management Geometry Rasterization Pixel Processing ROP/FBI/Display Frame Buffer Memory Host GPU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
61
The GeForce Graphics Pipeline Host Vertex Control Vertex Cache VS/T&L Triangle Setup Raster Shader ROP FBI Texture Cache Frame Buffer Memory Matt 20 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
62
Traditional Graphics Pipeline Unified Shader Pipline (G80~) 2015/12/16
63
Feeding the GPU GPU accepts a sequence of commands and data Vertex positions, colors, and other shader parameters Texture map images Commands like “draw triangles with the following vertices until you get a command to stop drawing triangles”. Application pushes data using Direct3D or OpenGL GPU can pull commands and data from system memory or from its local memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
64
CUDA “Compute Unified Device Architecture” General purpose programming model GPU = dedicated super-threaded, massively data parallel co- processor Targeted software stack Compute oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Guaranteed maximum download & readback speeds Explicit GPU memory management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
65
TPC TEX SM SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1Data L1 Texture Processor Cluster Streaming Multiprocessor SM Shared Memory Streaming Processor Array … GeForce-8 Series HW Overview 2015/12/16
66
CUDA Programming Model: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
67
Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NDVIA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
68
CUDA Device Memory Space Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host The host can R/W global, constant, and texture memories 2015/12/16 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign
69
Global, Constant, and Texture Memories (Long Latency Accesses) Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Texture and Constant Memories Constants initialized by host Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Courtesy: NDVIA 2015/12/16 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign
70
SIMT SIMT (single instruction multi threads) Serial Code (CPU)... Serial Code (CPU) Parallel Kernel (GPU) 2015/12/16
71
What is Behind such an Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control DRAM Cache ALU Control ALU DRAM CPUGPU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign 2015/12/16
72
CPU v.s GPU 2015/12/16
73
Resource CUDA ZONE: http://www.nvidia.com.tw/object/cuda_home_tw.html# CUDA Course: http://www.nvidia.com.tw/object/cuda_university_courses_t w.html http://www.nvidia.com.tw/object/cuda_university_courses_t w.html 2015/12/16
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.