GPU Computation Strategies & Tricks Ian Buck NVIDIA
2 Recent Trends
3 Compute is Cheap parallelism parallelism to keep 100s of ALUs per chip busy to keep 100s of ALUs per chip busy shading is highly parallel shading is highly parallel millions of fragments per frame millions of fragments per frame 90nm Chip 64-bit FPU (to scale) 12mm 0.5mm courtesy of Bill Dally
4...but Bandwidth is Expensive courtesy of Bill Dally latency tolerance latency tolerance to cover 500 cycle remote memory access time to cover 500 cycle remote memory access time locality locality to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth 90nm Chip 12mm 0.5mm 1 clock
5 Optimizing for GPUs shading is compute intensive shading is compute intensive 100s of floating point operations 100s of floating point operations output 1 32-bit color value output 1 32-bit color value compute to bandwidth ratio compute to bandwidth ratio arithmetic intensity arithmetic intensity 90nm Chip 12mm 0.5mm 1 clock courtesy of Bill Dally
6 Compute vs. Bandwidth R300R360 R420 GFLOPS GFloats/sec
7 Arithmetic Intensity R300R360 R420 GFLOPS GFloats/sec 7x Gap
8 Arithmetic Intensity GPU wins when… Arithmetic intensity Arithmetic intensity Segment Segment 3.7 ops per word 11 GFLOPS GeForce 7800 GTX Pentium GHz
9 Arithmetic Intensity Overlapping computation with communication Overlapping computation with communication
10 Memory Bandwidth GPU wins when… Streaming memory bandwidth Streaming memory bandwidth SAXPY SAXPY FFT GeForce 7800 GTX Pentium GHz
11 Memory Bandwidth Streaming Memory System Streaming Memory System Optimized for sequential performance Optimized for sequential performance GPU cache is limited GPU cache is limited Optimized for texture filtering Optimized for texture filtering Read-only Read-only Small Small Local storage Local storage CPU >> GPU CPU >> GPU GeForce 7800 GTX Pentium 4
12 Computational Intensity Considering GPU transfer costs: T r Considering GPU transfer costs: T r GPU Memory CPU
13 Computational Intensity Considering GPU transfer costs: T r Considering GPU transfer costs: T r Computational intensity: Computational intensity: to outperform the CPU: to outperform the CPU: speedup: s K cpu / K gpu K gpu / T r work per word transferred 1 s - 1 > >
14 Kernel Overhead Considering CPU cost to issuing a kernel Considering CPU cost to issuing a kernel Generating compute geometry Generating compute geometry Graphics driver Graphics driver CPUlimitedGPUlimited
15 Floating Point Precision NVIDIA FP32 NVIDIA FP32 s23e8 s23e8 ATI 24-bit float ATI 24-bit float s16e7 s16e7 NVIDIA FP16 NVIDIA FP16 s10e5 s10e5 mantissaexponents sign * 1.mantissa * 2 (exponent+bias)
16 Floating Point Precision Common Bug Common Bug Pack large 1D array in 2D texture Pack large 1D array in 2D texture Compute 1D address in shader Compute 1D address in shader Convert 1D address into 2D Convert 1D address into 2D FP precision will leave unaddressable texels! FP precision will leave unaddressable texels! NVIDIA FP32: 16,777,217 ATI 24-bit float: 131,073 NVIDIA FP16: 2,049 Largest Counting Number
17 Scatter Techniques Problem: a[i] = p Problem: a[i] = p Indirect write Indirect write Can’t set the x,y of fragment in pixel shader Can’t set the x,y of fragment in pixel shader Often want to do: a[i] += p Often want to do: a[i] += p
18 Scatter Techniques Solution 1: Convert to Gather Solution 1: Convert to Gather m1m2 f2 f3 f1 for each spring f = computed force mass_force[left] += f; mass_force[right] -= f;
19 Scatter Techniques Solution 1: Convert to Gather Solution 1: Convert to Gather m1m2 f2 f3 f1 for each spring f = computed force for each mass mass_force = f[left] - f[right];
20 Scatter Techniques Solution 2: Address Sorting Solution 2: Address Sorting Sort & Search Sort & Search Shader outputs destination address and data Shader outputs destination address and data Bitonic sort based on address Bitonic sort based on address Run binary search shader over destination buffer Run binary search shader over destination buffer –Each fragment searches for source data
21 Scatter Techniques Solution 3: Vertex processor Solution 3: Vertex processor Render points Render points Use vertex shader to set destination Use vertex shader to set destination or just read back the data and re-issue or just read back the data and re-issue Vertex Textures Vertex Textures Render data and address to texture Render data and address to texture Issue points, set point x,y in vertex shader using address texture Issue points, set point x,y in vertex shader using address texture Requires texld instruction in vertex program Requires texld instruction in vertex program
22 Conditionals Strategies & Tricks:
23 Conditionals Problem: Problem: Limited fragment shader conditional support Limited fragment shader conditional support if (a) b = f(); else b = g();
24 Pre-computation Pre-compute anything that will not change every iteration! Pre-compute anything that will not change every iteration! Example: static obstacles in fluid sim Example: static obstacles in fluid sim When user draws obstacles, compute texture containing boundary info for cells When user draws obstacles, compute texture containing boundary info for cells Reuse that texture until obstacles are modified Reuse that texture until obstacles are modified Combine with Z-cull for higher performance! Combine with Z-cull for higher performance!
25 Static Branch Resolution Avoid branches where outcome is fixed Avoid branches where outcome is fixed One region is always true, another false One region is always true, another false Separate FPs for each region, no branches Separate FPs for each region, no branches Example: boundaries Example: boundaries
26 Branching with Occlusion Query Use it for iteration termination Use it for iteration terminationDo { // outer loop on CPU BeginOcclusionQuery{ // Render with fragment program that // discards fragments that satisfy // termination criteria // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0
27 Conditionals Using the depth buffer Using the depth buffer Set Z buffer to a Set Z buffer to a Z-test can prevent shader execution Z-test can prevent shader execution glEnable(GL_DEPTH_TEST) glEnable(GL_DEPTH_TEST) Locality in conditional Locality in conditional if (a) b = f(); else b = g();
28 Conditionals Using the depth buffer Using the depth buffer Optimization disabled with: Optimization disabled with: ATI: Writing Z in shaderWriting Z in shader Enabling Alpha testEnabling Alpha test Using texkill in shaderUsing texkill in shader NVIDIA: Changing depth test direction in frameChanging depth test direction in frame Writing stencil while rejecting based on stencilWriting stencil while rejecting based on stencil Changing stencil func/ref/mask in frameChanging stencil func/ref/mask in frame
29 Depth Conditionals GeForce 7800 GTX
30 Conditionals Predication Predication Execute both Execute both f and g f and g Use CMP instruction Use CMP instruction CMP b, -a, f, g CMP b, -a, f, g Executes all conditional code Executes all conditional code if (a) b = f(); else b = g();
31 Conditionals Predication Predication Use DP4 instruction Use DP4 instruction DP4 b.x, a, f DP4 b.x, a, f Executes all conditional code Executes all conditional code if (a.x) b = x; else if (a.y) b = y; else if (a.z) b = z; else if (a.w) b = w; a = (0, 1, 0, 0) f = (x, y, z, w)
32 Conditionals Conditional Instructions Conditional Instructions Available with NV_fragment_program2 Available with NV_fragment_program2 MOVC CC, R0; IF GT.x; MOV R0, R1; # executes if R0.x > 0 ELSE; MOV R0, R2; # executes if R0.x <= 0 ENDIF;
33 GeForce 6+ Series Branching True, SIMD branching True, SIMD branching Lots of incoherent branching can hurt performance Lots of incoherent branching can hurt performance Should have coherent regions of ~1000 pixels Should have coherent regions of ~1000 pixels That is only about 30x30 pixels, so still very useable! That is only about 30x30 pixels, so still very useable! Don’t ignore overhead of branch instructions Don’t ignore overhead of branch instructions Branching over < 5 instructions may not be worth it Branching over < 5 instructions may not be worth it Use branching for early exit from loops Use branching for early exit from loops Save a lot of computation Save a lot of computation
34 Conditional Instructions GeForce 7800 GTX
35 Branching Techniques Fragment program branches can be expensive Fragment program branches can be expensive No true fragment branching on GeForce FX or Radeon 9x00-X850 No true fragment branching on GeForce FX or Radeon 9x00-X850 SIMD branching on GeForce 6/7 Series SIMD branching on GeForce 6/7 Series Incoherent branching hurts performance Incoherent branching hurts performance Sometimes better to move decisions up the pipeline Sometimes better to move decisions up the pipeline Pre-computation Pre-computation Replace with math Replace with math Occlusion Query Occlusion Query Static Branch Resolution Static Branch Resolution Depth Buffer Depth Buffer