Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Slides:



Advertisements
Similar presentations
Is There a Real Difference between DSPs and GPUs?
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
CS-378: Game Technology Lecture #9: More Mapping Prof. Okan Arikan University of Texas, Austin Thanks to James O’Brien, Steve Chenney, Zoran Popovic, Jessica.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Modified from: A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
GPUGI: Global Illumination Effects on the GPU
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
GPGPU CS 446: Real-Time Rendering & Game Technology David Luebke University of Virginia.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
General-Purpose Computation on Graphics Hardware.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
Enhancing GPU for Scientific Computing Some thoughts.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
GPGPU: General-Purpose Computation on GPUs Mark Harris NVIDIA Corporation.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,
General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.
The programmable pipeline Lecture 3.
Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Finding Body Parts with Vector Processing Cynthia Bruyns Bryan Feldman CS 252.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Ray Tracing using Programmable Graphics Hardware
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Build your own 2D Game Engine and Create Great Web Games using HTML5, JavaScript, and WebGL. Sung, Pavleas, Arnez, and Pace, Chapter 5 Examples 1.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
A Crash Course on Programmable Graphics Hardware
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
Graphics Hardware CMSC 491/691.
Introduction to Programmable Hardware
Graphics Processing Unit
GPGPU: Parallel Reduction and Scan
Ray Tracing on Programmable Graphics Hardware
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Mapping Computational Concepts to GPUs Mark Harris NVIDIA

2 Outline Data Parallelism and Stream Processing Data Parallelism and Stream Processing Computational Resources Inventory Computational Resources Inventory CPU-GPU Analogies CPU-GPU Analogies Example: Example: N-body gravitational simulation N-body gravitational simulation Parallel reductions Parallel reductions Overview of Branching Techniques Overview of Branching Techniques

3 The Importance of Data Parallelism GPUs are designed for graphics GPUs are designed for graphics Highly parallel tasks Highly parallel tasks GPUs process independent vertices & fragments GPUs process independent vertices & fragments Temporary registers are zeroed Temporary registers are zeroed No shared or static data No shared or static data No read-modify-write buffers No read-modify-write buffers Data-parallel processing Data-parallel processing GPUs architecture is ALU-heavy GPUs architecture is ALU-heavy Multiple vertex & pixel pipelines, multiple ALUs per pipe Multiple vertex & pixel pipelines, multiple ALUs per pipe Hide memory latency (with more computation) Hide memory latency (with more computation)

4 Arithmetic Intensity Arithmetic intensity Arithmetic intensity ops per word transferred ops per word transferred Computation / bandwidth Computation / bandwidth Best to have high arithmetic intensity Best to have high arithmetic intensity Ideal GPGPU apps have Ideal GPGPU apps have Large data sets Large data sets High parallelism High parallelism High indepence between data elements High indepence between data elements

5 Data Streams & Kernels Streams Streams Collection of records requiring similar computation Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. Vertex positions, Voxels, FEM cells, etc. Provide data parallelism Provide data parallelism Kernels Kernels Functions applied to each element in stream Functions applied to each element in stream transforms, PDE, … transforms, PDE, … Few dependencies between stream elements Few dependencies between stream elements Encourage high Arithmetic Intensity Encourage high Arithmetic Intensity

6 Example: Simulation Grid Common GPGPU computation style Common GPGPU computation style Textures represent computational grids = streams Textures represent computational grids = streams Many computations map to grids Many computations map to grids Matrix algebra Matrix algebra Image & Volume processing Image & Volume processing Physically-based simulation Physically-based simulation Global Illumination Global Illumination ray tracing, photon mapping, radiosity ray tracing, photon mapping, radiosity Non-grid streams can be mapped to grids Non-grid streams can be mapped to grids

7 Stream Computation Grid Simulation algorithm Grid Simulation algorithm Made up of steps Made up of steps Each step updates entire grid Each step updates entire grid Must complete before next step can begin Must complete before next step can begin Grid is a stream, steps are kernels Grid is a stream, steps are kernels Kernel applied to each stream element Kernel applied to each stream element Cloud simulation algorithm

8 Scatter vs. Gather Grid communication Grid communication Grid cells share information Grid cells share information

9 Computational Resources Inventory Programmable parallel processors Programmable parallel processors Vertex & Fragment pipelines Vertex & Fragment pipelines Rasterizer Rasterizer Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Texture unit Texture unit Read-only memory interface Read-only memory interface Render to texture Render to texture Write-only memory interface Write-only memory interface

10 Vertex Processor Fully programmable (SIMD / MIMD) Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather Capable of scatter but not gather Can change the location of current vertex Can change the location of current vertex Cannot read info from other vertices Cannot read info from other vertices Can only read a small constant memory Can only read a small constant memory Latest GPUs: Vertex Texture Fetch Latest GPUs: Vertex Texture Fetch Random access memory for vertices Random access memory for vertices Arguably still not gather Arguably still not gather

11 Fragment Processor Fully programmable (SIMD) Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Random access memory read (textures) Capable of gather but not scatter Capable of gather but not scatter RAM read (texture fetch), but no RAM write RAM read (texture fetch), but no RAM write Output address fixed to a specific pixel Output address fixed to a specific pixel Typically more useful than vertex processor Typically more useful than vertex processor More fragment pipelines than vertex pipelines More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline) Direct output (fragment processor is at end of pipeline)

12 CPU-GPU Analogies CPU programming is familiar CPU programming is familiar GPU programming is graphics-centric GPU programming is graphics-centric Analogies can aid understanding Analogies can aid understanding

13 CPU-GPU Analogies CPU GPU CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample

14 Kernels Kernel / loop body / algorithm step = Fragment Program CPUGPU

15 Feedback Each algorithm step depends on the results of previous steps Each algorithm step depends on the results of previous steps Each time step depends on the results of the previous time step Each time step depends on the results of the previous time step

16 Feedback.... Grid[i][j]= x;... Array Write = Render to Texture CPU GPU

17 GPU Simulation Overview Analogies lead to implementation Analogies lead to implementation Algorithm steps are fragment programs Algorithm steps are fragment programs Computational kernels Computational kernels Current state is stored in textures Current state is stored in textures Feedback via render to texture Feedback via render to texture One question: how do we invoke computation? One question: how do we invoke computation?

18 Invoking Computation Must invoke computation at each pixel Must invoke computation at each pixel Just draw geometry! Just draw geometry! Most common GPGPU invocation is a full-screen quad Most common GPGPU invocation is a full-screen quad Other Useful Analogies Other Useful Analogies Rasterization = Kernel Invocation Rasterization = Kernel Invocation Texture Coordinates = Computational Domain Texture Coordinates = Computational Domain Vertex Coordinates = Computational Range Vertex Coordinates = Computational Range

19 Typical “Grid” Computation Initialize “view” (so that pixels:texels::1:1) Initialize “view” (so that pixels:texels::1:1) glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(0, 1, 0, 1, 0, 1); glViewport(0, 0, outTexResX, outTexResY); For each algorithm step: For each algorithm step: Activate render-to-texture Activate render-to-texture Setup input textures, fragment program Setup input textures, fragment program Draw a full-screen quad (1x1) Draw a full-screen quad (1x1)

20 Example: N-Body Simulation Brute force  Brute force  N = 8192 bodies N = 8192 bodies N 2 gravity computations N 2 gravity computations 64M force comps. / frame 64M force comps. / frame ~25 flops per force ~25 flops per force 7.5 fps 7.5 fps GFLOPs sustained GFLOPs sustained GeForce 6800 Ultra GeForce 6800 Ultra Nyland, Harris, Prins, GP poster

21 Computing Gravitational Forces Each body attracts all other bodies Each body attracts all other bodies N bodies, so N 2 forces N bodies, so N 2 forces Draw into an NxN buffer Draw into an NxN buffer Pixel (i,j) computes force between bodies i and j Pixel (i,j) computes force between bodies i and j Very simple fragment program Very simple fragment program More than 2048 bodies makes it trickier More than 2048 bodies makes it trickier –Limited by max pbuffer size… –“exercise for the reader”

22 Computing Gravitational Forces F(i,j) = gM i M j / r(i,j) 2, r(i,j) = |pos(i) - pos(j)| Force is proportional to the inverse square of the distance between bodies

23 Computing Gravitational Forces N-body force Texture force(i,j) force(i,j) N i N 0 j i j Body Position Texture F(i,j) = gM i M j / r(i,j) 2, r(i,j) = | pos(i) - pos(j) | Coordinates (i,j) in force texture used to find bodies i and j in body position texture

24 Computing Gravitational Forces float4 force(float2 ij : WPOS, uniform sampler2D pos) : COLOR0 { // Pos texture is 2D, not 1D, so we need to // convert body index into 2D coords for pos tex float4 iCoords = getBodyCoords(ij); float4 iPosMass = texture2D(pos, iCoords.xy); float4 jPosMass = texture2D(pos, iCoords.zw); float3 dir = iPos.xyz - jPos.xyz; float r2 = dot(dir, dir); dir = normalize(dir); return dir * g * iPosMass.w * jPosMass.w / r2; }

25 Computing Total Force Have: array of (i,j) forces Have: array of (i,j) forces Need: total force on each particle i Need: total force on each particle i force(i,j) N-body force Texture N i N 0

26 Computing Total Force Have: array of (i,j) forces Have: array of (i,j) forces Need: total force on each particle i Need: total force on each particle i Sum of each column of the force array Sum of each column of the force array force(i,j) N-body force Texture N i N 0

27 Computing Total Force Have: array of (i,j) forces Have: array of (i,j) forces Need: total force on each particle i Need: total force on each particle i Sum of each column of the force array Sum of each column of the force array Can do all N columns in parallel Can do all N columns in parallel This is called a Parallel Reduction force(i,j) N-body force Texture N i N 0

28 Parallel Reductions 1D parallel reduction: 1D parallel reduction: sum N columns or rows in parallel sum N columns or rows in parallel add two halves of texture together add two halves of texture together + NxNNxNNxNNxN

29 Parallel Reductions 1D parallel reduction: 1D parallel reduction: sum N columns or rows in parallel sum N columns or rows in parallel add two halves of texture together add two halves of texture together repeatedly... repeatedly... + Nx(N/2)

30 Parallel Reductions 1D parallel reduction: 1D parallel reduction: sum N columns or rows in parallel sum N columns or rows in parallel add two halves of texture together add two halves of texture together repeatedly... repeatedly... + Nx(N/4)

31 Parallel Reductions 1D parallel reduction: 1D parallel reduction: sum N columns or rows in parallel sum N columns or rows in parallel add two halves of texture together add two halves of texture together repeatedly... repeatedly... Until we’re left with a single row of texels Until we’re left with a single row of texels Nx1 Requires log 2 N steps

32 Update Positions and Velocities Now we have a 1-D array of total forces Now we have a 1-D array of total forces One per body One per body Update Velocity Update Velocity u(i,t+dt) = u(i,t) + F total (i) * dt u(i,t+dt) = u(i,t) + F total (i) * dt Simple pixel shader reads previous velocity and force textures, creates new velocity texture Simple pixel shader reads previous velocity and force textures, creates new velocity texture Update Position Update Position x(i, t+dt) = x(i,t) + u(i,t) * dt x(i, t+dt) = x(i,t) + u(i,t) * dt Simple pixel shader reads previous position and velocity textures, creates new position texture Simple pixel shader reads previous position and velocity textures, creates new position texture

GPGPU Flow Control Strategies Branching and Looping

34 Branching Techniques Fragment program branches can be expensive Fragment program branches can be expensive No true fragment branching on GeForce FX or Radeon 9x00-X850 No true fragment branching on GeForce FX or Radeon 9x00-X850 SIMD branching on GeForce 6+ Series SIMD branching on GeForce 6+ Series Incoherent branching hurts performance Incoherent branching hurts performance Sometimes better to move decisions up the pipeline Sometimes better to move decisions up the pipeline Replace with math Replace with math Occlusion Query Occlusion Query Static Branch Resolution Static Branch Resolution Z-cull Z-cull Pre-computation Pre-computation

35 Branching with Occlusion Query Use it for iteration termination Use it for iteration terminationDo { // outer loop on CPU BeginOcclusionQuery{ // Render with fragment program that // discards fragments that satisfy // termination criteria // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0 Can be used for subdivision techniques Can be used for subdivision techniques

36 Static Branch Resolution Avoid branches where outcome is fixed Avoid branches where outcome is fixed One region is always true, another false One region is always true, another false Separate FPs for each region, no branches Separate FPs for each region, no branches Example: boundaries Example: boundaries

37 Z-Cull In early pass, modify depth buffer In early pass, modify depth buffer Clear Z to 1 Clear Z to 1 Draw quad at Z=0 Draw quad at Z=0 Discard pixels that should be modified in later passes Discard pixels that should be modified in later passes Subsequent passes Subsequent passes Enable depth test (GL_LESS) Enable depth test (GL_LESS) Draw full-screen quad at z=0.5 Draw full-screen quad at z=0.5 Only pixels with previous depth=1 will be processed Only pixels with previous depth=1 will be processed Can also use stencil cull on GeForce 6 series Can also use stencil cull on GeForce 6 series Not available on GeForce FX (NV3X) Not available on GeForce FX (NV3X) Discard and shader depth output disables Z-Cull Discard and shader depth output disables Z-Cull

38 Pre-computation Pre-compute anything that will not change every iteration! Pre-compute anything that will not change every iteration! Example: static obstacles in fluid sim Example: static obstacles in fluid sim When user draws obstacles, compute texture containing boundary info for cells When user draws obstacles, compute texture containing boundary info for cells Reuse that texture until obstacles are modified Reuse that texture until obstacles are modified Combine with Z-cull for higher performance! Combine with Z-cull for higher performance!

39 GeForce 6 Series Branching True, SIMD branching True, SIMD branching Lots of incoherent branching can hurt performance Lots of incoherent branching can hurt performance Should have coherent regions of  1000 pixels Should have coherent regions of  1000 pixels That is only about 30x30 pixels, so still very useable! That is only about 30x30 pixels, so still very useable! Don’t ignore overhead of branch instructions Don’t ignore overhead of branch instructions Branching over < 5 instructions may not be worth it Branching over < 5 instructions may not be worth it Use branching for early exit from loops Use branching for early exit from loops Save a lot of computation Save a lot of computation

40 Summary Presented mappings of basic computational concepts to GPUs Presented mappings of basic computational concepts to GPUs Basic concepts and terminology Basic concepts and terminology For introductory “Hello GPGPU” sample code, see For introductory “Hello GPGPU” sample code, see Only the beginning: Only the beginning: Rest of course presents advanced techniques, strategies, and specific algorithms. Rest of course presents advanced techniques, strategies, and specific algorithms.