Download presentation
Presentation is loading. Please wait.
Published byZoe Black Modified over 9 years ago
1
General-Purpose Computation on Graphics Hardware David Luebke University of Virginia
2
Course Introduction The GPU on commodity video cards has evolved into an extremely flexible and powerful processor –Programmability –Precision –Power We are interested in how to harness that power for general-purpose computation
3
Motivation: Computational Power GPUs are fast… –3 GHz Pentium4 theoretical: 6 GFLOPS, 5.96 GB/sec peak –GeForceFX 5900 observed: 20 GFLOPS, 25.3 GB/sec peak –GeForce 6800 Ultra observed: 53 GFLOPS, 35.2 GB/sec peak –GeForce 8800 GTX: estimated at 520 GFLOPS, 86.4 GB/sec peak (reference here)here That’s almost 100 times faster than a 3 Ghz Pentium4! GPUs are getting faster, faster –CPUs: annual growth 1.5× decade growth 60× –GPUs: annual growth > 2.0× decade growth > 1000 Courtesy Kurt Akeley, Ian Buck & Tim Purcell, GPU Gems (see course notes)
4
Motivation: Computational Power Courtesy Naga Govindaraju GPU CPU
5
Motivation: Computational Power Courtesy Ian Buck GPU GFLOPS multiplies per second NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July 01Jan 02July 02Jan 03July 03Jan 04
6
An Aside: Computational Power Why are GPUs getting faster so fast? –Arithmetic intensity: the specialized nature of GPUs makes it easier to use additional transistors for computation not cache –Economics: multi-billion dollar video game market is a pressure cooker that drives innovation
7
Motivation: Flexible and precise Modern GPUs are deeply programmable –Programmable pixel, vertex, video engines –Solidifying high-level language support Modern GPUs support high precision –32 bit floating point throughout the pipeline –High enough for many (not all) applications
8
Motivation: The Potential of GPGPU The power and flexibility of GPUs makes them an attractive platform for general-purpose computation Example applications range from in-game physics simulation to conventional computational science Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
9
The Problem: Difficult To Use GPUs designed for and driven by video games –Programming model is unusual & tied to computer graphics –Programming environment is tightly constrained Underlying architectures are: –Inherently parallel –Rapidly evolving (even in basic feature set!) –Largely secret Can’t simply “port” code written for the CPU!
10
GPU Fundamentals: The Graphics Pipeline A simplified graphics pipeline –Note that pipe widths vary –Many caches, FIFOs, and so on not shown GPUCPU Application Transform Rasterizer Shade Video Memory (Textures) Vertices (3D) Xformed, Lit Vertices (2D) Fragments (pre-pixels) Final pixels (Color, Depth) Graphics State Render-to-texture
11
GPU Fundamentals: The Modern Graphics Pipeline Programmable vertex processor! Programmable pixel processor! GPUCPU Application Vertex Processor Rasterizer Pixel Processor Video Memory (Textures) Vertices (3D) Xformed, Lit Vertices (2D) Fragments (pre-pixels) Final pixels (Color, Depth) Graphics State Render-to-texture Vertex Processor Fragment Processor
12
GPU Pipeline: Transform Vertex Processor (multiple operate in parallel) –Transform from “world space” to “image space” –Compute per-vertex lighting
13
GPU Pipeline: Rasterizer Rasterizer –Convert geometric rep. (vertex) to image rep. (fragment) Fragment = image fragment –Pixel + associated data: color, depth, stencil, etc. –Interpolate per-vertex quantities across pixels
14
GPU Pipeline: Shade Fragment Processors (multiple in parallel) –Compute a color for each pixel –Optionally read colors from textures (images)
15
Importance of Data Parallelism GPU: Each vertex / fragment is independent –Temporary registers are zeroed –No static data –No read-modify-write buffers Data parallel processing –Best for ALU-heavy architectures: GPUs Multiple vertex & pixel pipelines –Hide memory latency (with more computation) Courtesy of Ian Buck
16
Arithmetic Intensity Lots of ops per word transferred GPGPU demands high arithmetic intensity for peak performance –Ex: solving systems of linear equations –Physically-based simulation on lattices –All-pairs shortest paths Courtesy of Pat Hanrahan
17
Data Streams & Kernels Streams –Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. –Provide data parallelism Kernels –Functions applied to each element in stream Transforms, PDE, … –No dependencies between stream elements Encourage high arithmetic intensity Courtesy of Ian Buck
18
Example: Simulation Grid Common GPGPU computation style –Textures represent computational grids = streams Many computations map to grids –Matrix algebra –Image & Volume processing –Physical simulation –Global Illumination ray tracing, photon mapping, radiosity Non-grid streams can be mapped to grids
19
Stream Computation Grid Simulation algorithm –Made up of steps –Each step updates entire grid –Must complete before next step can begin Grid is a stream, steps are kernels –Kernel applied to each stream element
20
Scatter vs. Gather Grid communication –Grid cells share information
21
Computational Resources Inventory Programmable parallel processors –Vertex & Fragment pipelines Rasterizer –Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Texture unit –Read-only memory interface Render to texture –Write-only memory interface
22
Vertex Processor Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather –Can change the location of current vertex –Cannot read info from other vertices –Can only read a small constant memory Future hardware enables gather! –Vertex textures
23
Fragment Processor Fully programmable (SIMD) Processes 4-vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter –No random access memory writes –Output address fixed to a specific pixel Typically more useful than vertex processor –More fragment pipelines than vertex pipelines –RAM read –Direct output
24
CPU-GPU Analogies CPU programming is familiar –GPU programming is graphics-centric Analogies can aid understanding
25
CPU-GPU Analogies CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample
26
CPU-GPU Analogies Loop body / kernel / algorithm step = Fragment Program CPUGPU
27
Feedback Each algorithm step depend on the results of previous steps Each time step depends on the results of the previous time step
28
CPU-GPU Analogies... Grid[i][j]= x;... Array Write = Render to Texture CPU GPU
29
GPU Simulation Overview Analogies lead to implementation –Algorithm steps are fragment programs Computational “kernels” –Current state variables stored in textures –Feedback via render to texture One question: how do we invoke computation?
30
Invoking Computation Must invoke computation at each pixel –Just draw geometry! –Most common GPGPU invocation is a full-screen quad
31
Typical “Grid” Computation Initialize “view” (so that pixels:texels::1:1) glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(0, 1, 0, 1, 0, 1); glViewport(0, 0, outTexResX, outTexResY); For each algorithm step: –Activate render-to-texture –Setup input textures, fragment program –Draw a full-screen quad (1x1)
32
Branching Techniques Fragment program branches can be expensive –No true fragment branching on NV3X & R3X0 Better to move decisions up the pipeline –Replace with math –Occlusion Query –Domain decomposition –Z-cull –Pre-computation
33
Branching with OQ Use it for iteration termination Do { // outer loop on CPU BeginOcclusionQuery { // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0 Can be used for subdivision techniques –Demo later
34
Domain Decomposition Avoid branches where outcome is fixed –One region is always true, another false –Separate FPs for each region, no branches Example: boundaries
35
Z-Cull In early pass, modify depth buffer –Write depth=0 for pixels that should not be modified by later passes –Write depth=1 for rest Subsequent passes –Enable depth test (GL_LESS) –Draw full-screen quad at z=0.5 –Only pixels with previous depth=1 will be processed Can also use early stencil test Not available on NV3X –Depth replace disables ZCull
36
Pre-computation Pre-compute anything that will not change every iteration! Example: arbitrary boundaries –When user draws boundaries, compute texture containing boundary info for cells –Reuse that texture until boundaries modified –Combine with Z-cull for higher performance!
37
High-level Shading Languages Cg, HLSL, & GLSlang –Cg: http://www.nvidia.com/cg –HLSL: http://msdn.microsoft.com/library/default.asp?url=/library/en- us/directx9_c/directx/graphics/reference/highlevellanguageshad ers.asp –GLSlang: http://www.3dlabs.com/support/developer/ogl2/whitepapers/inde x.html float3 cPlastic = Cd * (cAmbi + cDiff) + Cs*cSpec …. MULR R0.xyz, R0.xxx, R4.xyxx; MOVR R5.xyz, -R0.xyzz; DP3R R3.x, R0.xyzz, R3.xyzz; …
38
GPGPU Languages Why do we want them? –Make programming GPUs easier! Don’t need to know OpenGL, DirectX, or ATI/NV extensions Simplify common operations Focus on the algorithm, not on the implementation Sh –University of Waterloo –http://libsh.sourceforge.net –http://www.cgl.uwaterloo.ca Brook –Stanford University –http://brook.sourceforge.net –http://graphics.stanford.edu
39
Brook: general purpose streaming language stream programming model –enforce data parallel computing streams –encourage arithmetic intensity kernels C with stream extensions GPU = streaming coprocessor
40
system outline.br Brook source files brcc source to source compiler brt Brook run-time library
41
Brook language streams streams –collection of records requiring similar computation particle positions, voxels, FEM cell, … float3 positions ; float3 velocityfield ; –similar to arrays, but… index operations disallowed:position[i] read/write stream operators streamRead (positions, p_ptr); streamWrite (velocityfield, v_ptr); –encourage data parallelism
42
Brook language kernels kernels –functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a ; float b ; float c ; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i];
43
Brook language kernels kernels –functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } –no dependencies between stream elements –encourage high arithmetic intensity
44
Brook language kernels kernels arguments –input/output streams –constant parameters –gather streams –iterator streams kernel void foo (float a<>, float b<>, float t, float array[], iter float n<>, out float result<>) { result = array[a] + t*b + n; } float a ; float b ; float c ; float array iter float n = iter(0, 10); foo(a,b,3.2f,array,n,c);
45
Brook language kernels ray triangle intersection kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); }
46
Brook language reductions reductions –compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i];
47
Brook language reductions reductions –associative operations only (a+b)+c = a+(b+c) sum, multiply, max, min, OR, AND, XOR matrix multiply
48
Brook language reductions multi-dimension reductions –stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a ; float r ; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j];
49
Brook language stream repeat & stride kernel arguments of different shape –resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a ; float b ; float c ; foo(a,b,c); foo(a[0], b[0], c[0]) foo(a[2], b[0], c[1]) foo(a[4], b[1], c[2]) foo(a[6], b[1], c[3]) foo(a[8], b[2], c[4]) foo(a[10], b[2], c[5]) foo(a[12], b[3], c[6]) foo(a[14], b[3], c[7]) foo(a[16], b[4], c[8]) foo(a[18], b[4], c[9])
50
Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); M V V V T =
51
Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) {result += a; } float matrix ; float vector ; float tempmv ; float result ; mul(matrix,vector,tempmv); sum(tempmv,result); RT sum
52
Running Brook Compiling.br files Brook CG Compiler Version: 0.2 Built: Apr 24 2004, 18:11:59 brcc [-hvndktyAN] [-o prefix] [-w workspace] [-p shader ] foo.br -h help (print this message) -v verbose (print intermediate generated code) -n no codegen (just parse and reemit the input) -d debug (print cTool internal state) -k keep generated fragment program (in foo.cg) -t disable kernel call type checking -y emit code for ATI 4-output hardware -A enable address virtualization (experimental) -N deny support for kernels calling other kernels -o prefix prefix prepended to all output files -w workspace workspace size (16 - 2048, default 1024) -p shader cpu / ps20 / fp30 / cpumt (can specify multiple)
53
Running Brook BRT_RUNTIME selects platform CPU Backend: BRT_RUNTIME = cpu CPU Multithreaded Backend: BRT_RUNTIME = cpumt NVIDIA NV30 Backend: BRT_RUNTIME = nv30 OpenGL ARB Backend: BRT_RUNTIME = arb DirectX9 Backend:BRT_RUNTIME = dx9
54
Runtime accessing stream data for graphics aps –Brook runtime api available in c++ code –autogenerated.hpp files for brook code brook::initialize( "dx9", (void*)device ); // Create streams fluidStream0 = stream::create ( kFluidSize, kFluidSize ); normalStream = stream::create ( kFluidSize, kFluidSize ); // Get a handle to the texture being used by // the normal stream as a backing store normalTexture = (IDirect3DTexture9*) normalStream->getIndexedFieldRenderData(0); // Call the simulation kernel simulationKernel( fluidStream0, fluidStream0, controlConstant, fluidStream1 );
55
Applications Includes lots of sample applications –Ray-tracer –FFT –Image segmentation –Linear algebra
56
Brook performance 2-3x faster than CPU implementation compared against 3GHz P4: Intel Math Library FFTW Custom cached-blocked segment C code GPUs still lose against SSE cache friendly code. Super-optimizations ATLAS FFTW ATI Radeon 9800 XT NVIDIA GeForce 6800
57
Brook for GPUs Release v0.3 available on Sourceforge Project Page –http://graphics.stanford.edu/projects/brook Source –http://www.sourceforge.net/projects/brook Over 4K downloads! Brook for GPUs: Stream Computing on Graphics Hardware –Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Fly-fishing fly images from The English Fly Fishing ShopThe English Fly Fishing Shop
58
GPGPU Examples Fluids Reaction/Diffusion Multigrid Tone mapping
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.