Download presentation
Presentation is loading. Please wait.
1
Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1
2
Graphics System GPU Architecture Memory Model – Vertex Buffer, Texture buffer GPU Programming Model – DirectX, OpenGL, OpenCL GP GPU Program – Introduction to Nvidia Cuda Programming 2
3
3D application 3D API: OpenGL DirectX/3D 3D API: OpenGL DirectX/3D 3D API Commands CPU-GPU Boundary GPU Command & Data Stream GPU Command GPU Command Primitive Assembly Primitive Assembly Rastereisation Interpolation Rastereisation Interpolation Raster Operation Raster Operation Frame Buffer Programmable Fragment Processors Programmable Fragment Processors Programmable Vertex Processor Programmable Vertex Processor Vertex Index Stream Assembled polygon, line & points Pixel Location Stream Pixel Updates Transformed Fragments Rastorized Pretransformed Fragments transformed Vertices Pretransformed Vertices 3
4
Memory System Memory System Texture Memory Texture Memory Frame Buffer Frame Buffer Vertex Processing Vertex Processing Pixel Processing Pixel Processing Vertices (x,y,z) Pixel R, G,B Vertex Shadder Vertex Shadder Pixel Shadder Pixel Shadder 4
5
Primitives are processed in a series of stages Each stage forwards its result on to the next stage The pipeline can be drawn and implemented in different ways Some stages may be in hardware, others in software Optimizations & additional programmability are available at some stages Modeling Transformations Illumination (Shading) Illumination (Shading) Viewing Transformation (Perspective / Orthographic) Viewing Transformation (Perspective / Orthographic) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display
6
Modeling Transformations Illumination (Shading) Illumination (Shading) Viewing Transformation (Perspective / Orthographic) Viewing Transformation (Perspective / Orthographic) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display
7
Graphics pipeline (simplified) Vertex Shader Pixel Shader Object spaceWindow spaceFramebuffer IN OUT Textures
8
The computing capacities of graphics processing units (GPUs) have improved exponentially in the recent decade. NVIDIA released a CUDA programming model for GPUs. The CUDA programming environment applies the parallel processing capabilities of the GPUs to medical image processing research.
9
CUDA Cores 480 – (Compute Unified Dev Arch) Microsoft® DirectX® 11 Support 3D Vision™ Surround Ready Interactive Ray Tracing 3-way SLI® Technology PhysX® Technology CUDA™ Technology 32x Anti-aliasing Technology PureVideo® HD Technology PCI Express 2.0 Support. Dual-link DVI Support, HDMI 1.4
10
Vertex Transforms Vertex Transforms This generation is the first generation of fully-programmable graphics cards Different versions have different resource limits on fragment/vertex programs Primitive Assembly Primitive Assembly Frame Buffer Frame Buffer Raster Operations Raster Operations Rasterization and Interpolation Rasterization and Interpolation AGP Programmable Vertex shader Programmable Vertex shader Programmable Fragment Processor Programmable Fragment Processor
11
Writing assembly is – Painful – Not portable – Not optimize-able High level shading language solves these – Cg, HLSL
12
CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Main Memory GPU Video Memory GPU Video Memory CPU Caches CPU Registers GPU Caches GPU Temporary Registers GPU Temporary Registers GPU Constant Registers GPU Constant Registers
13
Much more restricted memory access – Allocate/free memory only before computation – Limited memory access during computation (kernel) Registers – Read/write Local memory – Does not exist Global memory – Read-only during computation – Write-only at end of computation (pre-computed address) Disk access – Does not exist
14
At any program point – Allocate/free local or global memory – Random memory access Registers – Read/write Local memory – Read/write to stack Global memory – Read/write to heap Disk – Read/write to disk
15
Where is GPU Data Stored? – Vertex buffer – Frame buffer – Texture Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs
16
Each GPU memory type supports subset of the following operations – CPU interface – GPU interface
17
17 CPU interface – Allocate – Free – Copy CPU GPU – Copy GPU CPU – Copy GPU GPU – Bind for read-only vertex stream access – Bind for read-only random access – Bind for write-only framebuffer access
18
GPU (shader/kernel) interface – Random-access read – Stream read
19
Vertex Buffers GPU memory for vertex data Vertex data required to initiate render pass Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs
20
Supported Operations – CPU interface Allocate Free Copy CPU GPU Copy GPU GPU(Render-to-vertex-array) Bind for read-only vertex stream access – GPU interface Stream read (vertex program only)
21
Limitations – CPU No copy GPU CPU No bind for read-only random access No bind for write-only framebuffer access – GPU No random-access reads No access from fragment programs
22
Random-access GPU memory Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs
23
Supported Operations – CPU interface Allocate Free Copy CPU GPU Copy GPU CPU Copy GPU GPU(Render-to-texture) Bind for read-only random access (vertex or fragment) Bind for write-only framebuffer access – GPU interface Random read
24
Memory written by fragment processor Write-only GPU memory Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs
25
Fixed function pipeline – Made early games look fairly similar – Little freedom in rendering – “One way to do things” glShadeModel(GL_SMOOTH); Different render methods – Triangle rasterization, proved to be very efficiently implemented in hardware. – Raytracing, voxels, produce nice results, very slow and require large amounts of memory
26
DirectX before version 8 entirely fixed function OpenGL before version 2.0 entirely fixed function – Extensions were often added for different effects, but no real programmability on the GPU. OpenGL is just a specification – Vendors must implement the specification, but on whatever platform they wish DirectX is a library, Windows only – Direct3D is the graphics component
27
Direct3D 8.0 (2000), OpenGL 2.0 (2004) added support for assembly language programming of vertex and fragment shaders. – NVIDIA GeForce 3, ATI Radeon 8000 Direct3D 9.0 (2002) added HLSL (High Level Shader Language) for much easier programming of GPUs. – NVIDIA GeForce FX 5000, ATI Radeon 9000 Minor increments on this for a long time, with more capabilities being added to shaders.
28
Vertex data sent in by graphics API – Mostly OpenGL or DirectX Processed in vertex program – “vertex shader” Rasterized into pixels Processed in “fragment shader” Vertex Shader Vertex Shader Fragment Shader Fragment Shader Vertex Data Vertex Data Rasterize To Pixels Rasterize To Pixels Output
29
No longer need to write shaders in assembly GLSL, HLSL, Cg, offer C style programming languages Write two main() functions, which are executed on each vertex/pixel Declare auxiliary functions, local variables Output by setting position and color
30
Prior to Direct3D 10/GeForce 8000/Radeon 2000, vertex and fragment shaders were executed in separate hardware. Direct3D 10 (with Vista) brought shader unification, and added Geometry Shaders. – GPUs now used the same ‘cores’ to geometry/vertex/fragment shader code. CUDA comes out alongside GeForce 8000 line, allowing ‘cores’ to run general C code, rather than being restricted to graphics APIs.
31
Vertex Programs Geometry Programs Pixel Programs Compute Programs Rasterization Hidden Surface Removal GPU Programmable Unified Processors GPU memory (DRAM) Final Image 3D Geometric Primitives
32
CUDA the first to drop graphics API, and allows the GPU to be treated as a coprocessor to the CPU. – Linear memory accesses (no more buffer objects) – Run thousands of threads on separate scalar cores (with limitations) – High theoretical/achieved performance for data parallel applications ATI has Stream SDK – Closer to assembly language programming for Stream
33
Apple announces OpenCL initiative in 2008 – Officially owned by Khronos Group, the same that controls OpenGL – Released in 2009, with support from NVIDIA/ATI. – Another specification for parallel programming, not entirely specific to GPUs (support for CPU SSE instructions, etc.). DirectX11 (and Direct3D10 extension) add in DirectComputeshaders – Similar idea to OpenCL, just tied in with Direct3D 33 CS101 GPU Programming
34
DirectX11 also adds multithreaded rendering, and tessellation stages to the pipeline – Two new shader stages in the unified pipeline; Hull and Domain shaders – Allow high detail geometry to be created on the GPU, rather than flooding the PCI-E bus with geometry data. – More programmable geometry OpenGL 4 (specification just released) is close to feature parity with Direct3D11 – Namely also adds tessellation
35
Newest GPUs have incredible compute power – 1-3 TFlops, 100+ GB/s memory access bandwidth More parallel constructs – High speed atomic operations, more control over thread interaction/synchronization. Becoming easier to program – NVIDIA’s ‘Fermi’ architecture has support for C++ code, 64bit pointers, etc. GPU computing starting to go mainstream – Photoshop5, Video encode/decode, physics/fluid simulation, etc.
36
GPUs are fast… – 3.0 GHz dual-core Pentium4: 24.6 GFLOPS – NVIDIA GeForceFX 7800: 165 GFLOPs – 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s – ATI Radeon X850 XT Platinum Edition: 37.8 GB/s GPUs are getting faster, faster – CPUs: 1.4 × annual growth – GPUs: 1.7 × (pixels) to 2.3 × (vertices) annual growth
37
Modern GPUs are deeply programmable – Programmable pixel, vertex, video engines – Solidifying high-level language support Modern GPUs support high precision – 32 bit floating point throughout the pipeline – High enough for many (not all) applications
38
GPUs designed for & driven by video games – Programming model unusual – Programming idioms tied to computer graphics – Programming environment tightly constrained Underlying architectures are: – Inherently parallel – Rapidly evolving (even in basic feature set!) – Largely secret Can’t simply “port” CPU code!
39
Each fragment is shaded w/ SIMD program Each fragment is shaded w/ SIMD program Shading can use values from texture memory Shading can use values from texture memory Image can be used as texture on future passes Image can be used as texture on future passes Application specifies geometry rasterized Application specifies geometry rasterized
40
Run a SIMD kernel over each fragment Run a SIMD kernel over each fragment “Gather” is permitted from texture memory “Gather” is permitted from texture memory Resulting buffer can be treated as texture on next pass Resulting buffer can be treated as texture on next pass Draw a screen-sized quad stream Draw a screen-sized quad stream
41
Introduced November of 2006 Converts GPU to general purpose CPU Required hardware changes – Only available on N70 or later GPU GeForce 8000 series or newer Implemented as extension to C/C++ – Results in lower learning curve
42
16 Streaming Multiprocessors (SM) – Each one has 8 Streaming Processors (SP) – Each SM can execute 32 threads simultaneously – 512 threads execute per cycle – SPs hide instruction latencies 768 MB DRAM – 86.4 Gbps memory bandwidth to GPU cores – 4 Gbps memory bandwidth with system memory
43
Load/store Global Memory Thread Execution Manager Thread Execution Manager Input Assembler Host Texture Parallel Data Cache Load/store
44
CUDA Execution Model Starts with Kernel Kernel is function called on host that executes on GPU Thread resources are abstracted into 3 levels – Grid – highest level – Block – Collection of Threads – Thread – Execution unit
45
CUDA Execution Model
46
768 GB global memory – Accessible to all threads globally – 86.4 Gbps throughput 16 KB shared memory per SP – Accessible to all threads within a block – 384 Gbps throughput 32 KB register file per SM – Allocated to threads at runtime (local variables) – 384 Gbps throughput – Threads can only see their own registers
47
Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host
48
(From C/C++ function) Allocate memory on CUDA device Copy data to CUDA device Configure thread resources – Grid Layout (max 65536x65536) – Block Layout (3 dimensional, max of 512 threads) Execute kernel with thread resources Copy data out of CUDA device Free memory on CUDA device
49
Multiply matrices M and N to form result R General algorithm – For each row i in matrix R For each column j in matrix R – Cell (i, j) = dot product of row i of M and column j of N Algorithm runs in O(length 3 )
50
Each thread represents cell (i, j) Calculate value for cell (i, j) Use single block Should run in O(length) – Much better than O(length 3 )
51
M N P WIDTH
53
Max threads allowed per block is 512. Only supports max matrix size of 22x22 – 484 threads needed
54
Split result matrix into smaller blocks Utilizes more SM’s rather than the single block approach Better speed-up
55
Md Nd Pd Pd sub TILE_WIDTH WIDTH bx tx 01 TILE_WIDTH-1 2 012 by ty 2 1 0 TILE_WIDTH-1 2 1 0 TILE_WIDTHE WIDTH
57
Runs 10 times as fast as serial approach Solution runs 21.4 GFLOPS – GPU is capable of 384 GFLOPS – What gives?
58
Each block assigned to SP – 8 SPs to 1 SM SM executes single SP SM switches SPs when long-latency is found – Works similar to Intel’s Hyperthreading SM executes batch of 32 threads at a time – Batch of 32 threads called warp.
59
Global Memory bandwidth is 86.4 Gbps Shared Memory bandwidth is 384 Gbps Register File bandwidth is 384 Gbps Key is to use shared memory and registers when possible
60
Each SP has 16 KB shared memory Each SM has 32 KB register file Local variables in function take up registers Register file must support all threads in SM – If not enough registers, then less blocks are scheduled – Program still executes, but less parallelism occurs.
61
SM can only handle 768 threads SM can handle 8 blocks, 1 block for each SP Each block can have up to 96 threads – Max out SM resources
62
Intel’s new approach to a GPU Considered to be a hybrid between a multi- core CPU and a GPU Combines functions of a multi-core CPU with the functions of a GPU
63
63
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.