GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Modified from: A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Evolutions of GPU Architectures Andrew Coile CMPE220 3/2007.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPGPU CS 446: Real-Time Rendering & Game Technology David Luebke University of Virginia.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
Slide 1 / 16 On Using Graphics Hardware for Scientific Computing ________________________________________________ Stan Tomov June 23, 2006.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
GPU Computation Strategies & Tricks Ian Buck Stanford University.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Cg Programming Mapping Computational Concepts to GPUs.
1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,
Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632.
The programmable pipeline Lecture 3.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Finding Body Parts with Vector Processing Cynthia Bruyns Bryan Feldman CS 252.
Stencil Routed A-Buffer
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Graphics Hardware CMSC 491/691.
Graphics Processing Unit
UMBC Graphics for Games
Data Parallel Computing on Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
Presentation transcript:

GPU Computation Strategies & Tricks Ian Buck NVIDIA

2 Recent Trends

3 Compute is Cheap parallelism parallelism to keep 100s of ALUs per chip busy to keep 100s of ALUs per chip busy shading is highly parallel shading is highly parallel millions of fragments per frame millions of fragments per frame 90nm Chip 64-bit FPU (to scale) 12mm 0.5mm courtesy of Bill Dally

4...but Bandwidth is Expensive courtesy of Bill Dally latency tolerance latency tolerance to cover 500 cycle remote memory access time to cover 500 cycle remote memory access time locality locality to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth 90nm Chip 12mm 0.5mm 1 clock

5 Optimizing for GPUs shading is compute intensive shading is compute intensive 100s of floating point operations 100s of floating point operations output 1 32-bit color value output 1 32-bit color value compute to bandwidth ratio compute to bandwidth ratio arithmetic intensity arithmetic intensity 90nm Chip 12mm 0.5mm 1 clock courtesy of Bill Dally

6 Compute vs. Bandwidth R300R360 R420 GFLOPS GFloats/sec

7 Arithmetic Intensity R300R360 R420 GFLOPS GFloats/sec 7x Gap

8 Arithmetic Intensity GPU wins when… Arithmetic intensity Arithmetic intensity Segment Segment 3.7 ops per word 11 GFLOPS GeForce 7800 GTX Pentium GHz

9 Arithmetic Intensity Overlapping computation with communication Overlapping computation with communication

10 Memory Bandwidth GPU wins when… Streaming memory bandwidth Streaming memory bandwidth SAXPY SAXPY  FFT GeForce 7800 GTX Pentium GHz

11 Memory Bandwidth Streaming Memory System Streaming Memory System Optimized for sequential performance Optimized for sequential performance GPU cache is limited GPU cache is limited Optimized for texture filtering Optimized for texture filtering Read-only Read-only Small Small Local storage Local storage CPU >> GPU CPU >> GPU GeForce 7800 GTX Pentium 4

12 Computational Intensity Considering GPU transfer costs: T r Considering GPU transfer costs: T r GPU Memory CPU

13 Computational Intensity Considering GPU transfer costs: T r Considering GPU transfer costs: T r Computational intensity:  Computational intensity:  to outperform the CPU: to outperform the CPU: speedup: s  K cpu / K gpu   K gpu / T r work per word transferred 1 s - 1  > >

14 Kernel Overhead Considering CPU cost to issuing a kernel Considering CPU cost to issuing a kernel Generating compute geometry Generating compute geometry Graphics driver Graphics driver CPUlimitedGPUlimited

15 Floating Point Precision NVIDIA FP32 NVIDIA FP32 s23e8 s23e8 ATI 24-bit float ATI 24-bit float s16e7 s16e7 NVIDIA FP16 NVIDIA FP16 s10e5 s10e5 mantissaexponents sign * 1.mantissa * 2 (exponent+bias)

16 Floating Point Precision Common Bug Common Bug Pack large 1D array in 2D texture Pack large 1D array in 2D texture Compute 1D address in shader Compute 1D address in shader Convert 1D address into 2D Convert 1D address into 2D FP precision will leave unaddressable texels! FP precision will leave unaddressable texels! NVIDIA FP32: 16,777,217 ATI 24-bit float: 131,073 NVIDIA FP16: 2,049 Largest Counting Number

17 Scatter Techniques Problem: a[i] = p Problem: a[i] = p Indirect write Indirect write Can’t set the x,y of fragment in pixel shader Can’t set the x,y of fragment in pixel shader Often want to do: a[i] += p Often want to do: a[i] += p

18 Scatter Techniques Solution 1: Convert to Gather Solution 1: Convert to Gather m1m2 f2 f3 f1 for each spring f = computed force mass_force[left] += f; mass_force[right] -= f;

19 Scatter Techniques Solution 1: Convert to Gather Solution 1: Convert to Gather m1m2 f2 f3 f1 for each spring f = computed force for each mass mass_force = f[left] - f[right];

20 Scatter Techniques Solution 2: Address Sorting Solution 2: Address Sorting Sort & Search Sort & Search Shader outputs destination address and data Shader outputs destination address and data Bitonic sort based on address Bitonic sort based on address Run binary search shader over destination buffer Run binary search shader over destination buffer –Each fragment searches for source data

21 Scatter Techniques Solution 3: Vertex processor Solution 3: Vertex processor Render points Render points Use vertex shader to set destination Use vertex shader to set destination or just read back the data and re-issue or just read back the data and re-issue Vertex Textures Vertex Textures Render data and address to texture Render data and address to texture Issue points, set point x,y in vertex shader using address texture Issue points, set point x,y in vertex shader using address texture Requires texld instruction in vertex program Requires texld instruction in vertex program

22 Conditionals Strategies & Tricks:

23 Conditionals Problem: Problem: Limited fragment shader conditional support Limited fragment shader conditional support if (a) b = f(); else b = g();

24 Pre-computation Pre-compute anything that will not change every iteration! Pre-compute anything that will not change every iteration! Example: static obstacles in fluid sim Example: static obstacles in fluid sim When user draws obstacles, compute texture containing boundary info for cells When user draws obstacles, compute texture containing boundary info for cells Reuse that texture until obstacles are modified Reuse that texture until obstacles are modified Combine with Z-cull for higher performance! Combine with Z-cull for higher performance!

25 Static Branch Resolution Avoid branches where outcome is fixed Avoid branches where outcome is fixed One region is always true, another false One region is always true, another false Separate FPs for each region, no branches Separate FPs for each region, no branches Example: boundaries Example: boundaries

26 Branching with Occlusion Query Use it for iteration termination Use it for iteration terminationDo { // outer loop on CPU BeginOcclusionQuery{ // Render with fragment program that // discards fragments that satisfy // termination criteria // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0

27 Conditionals Using the depth buffer Using the depth buffer Set Z buffer to a Set Z buffer to a Z-test can prevent shader execution Z-test can prevent shader execution glEnable(GL_DEPTH_TEST) glEnable(GL_DEPTH_TEST) Locality in conditional Locality in conditional if (a) b = f(); else b = g();

28 Conditionals Using the depth buffer Using the depth buffer Optimization disabled with: Optimization disabled with: ATI: Writing Z in shaderWriting Z in shader Enabling Alpha testEnabling Alpha test Using texkill in shaderUsing texkill in shader NVIDIA: Changing depth test direction in frameChanging depth test direction in frame Writing stencil while rejecting based on stencilWriting stencil while rejecting based on stencil Changing stencil func/ref/mask in frameChanging stencil func/ref/mask in frame

29 Depth Conditionals GeForce 7800 GTX

30 Conditionals Predication Predication Execute both Execute both f and g f and g Use CMP instruction Use CMP instruction CMP b, -a, f, g CMP b, -a, f, g Executes all conditional code Executes all conditional code if (a) b = f(); else b = g();

31 Conditionals Predication Predication Use DP4 instruction Use DP4 instruction DP4 b.x, a, f DP4 b.x, a, f Executes all conditional code Executes all conditional code if (a.x) b = x; else if (a.y) b = y; else if (a.z) b = z; else if (a.w) b = w; a = (0, 1, 0, 0) f = (x, y, z, w)

32 Conditionals Conditional Instructions Conditional Instructions Available with NV_fragment_program2 Available with NV_fragment_program2 MOVC CC, R0; IF GT.x; MOV R0, R1; # executes if R0.x > 0 ELSE; MOV R0, R2; # executes if R0.x <= 0 ENDIF;

33 GeForce 6+ Series Branching True, SIMD branching True, SIMD branching Lots of incoherent branching can hurt performance Lots of incoherent branching can hurt performance Should have coherent regions of ~1000 pixels Should have coherent regions of ~1000 pixels That is only about 30x30 pixels, so still very useable! That is only about 30x30 pixels, so still very useable! Don’t ignore overhead of branch instructions Don’t ignore overhead of branch instructions Branching over < 5 instructions may not be worth it Branching over < 5 instructions may not be worth it Use branching for early exit from loops Use branching for early exit from loops Save a lot of computation Save a lot of computation

34 Conditional Instructions GeForce 7800 GTX

35 Branching Techniques Fragment program branches can be expensive Fragment program branches can be expensive No true fragment branching on GeForce FX or Radeon 9x00-X850 No true fragment branching on GeForce FX or Radeon 9x00-X850 SIMD branching on GeForce 6/7 Series SIMD branching on GeForce 6/7 Series Incoherent branching hurts performance Incoherent branching hurts performance Sometimes better to move decisions up the pipeline Sometimes better to move decisions up the pipeline Pre-computation Pre-computation Replace with math Replace with math Occlusion Query Occlusion Query Static Branch Resolution Static Branch Resolution Depth Buffer Depth Buffer