Modified from: A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia.

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Advertisements

Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
GPGPU CS 446: Real-Time Rendering & Game Technology David Luebke University of Virginia.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
CSE 690 General-Purpose Computation on Graphics Hardware (GPGPU) Courtesy David Luebke, University of Virginia.
General-Purpose Computation on Graphics Hardware.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.
Enhancing GPU for Scientific Computing Some thoughts.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Physically-Based Simulation on Graphics Hardware Mark J. Harris UNC Chapel Hill Greg James NVIDIA Corp.
Computer Graphics Graphics Hardware
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,
General-Purpose Computation on Graphics Hardware Adapted from: David Luebke (University of Virginia) and NVIDIA.
GPU-Accelerated Surface Denoising and Morphing with LBM Scheme Ye Zhao Kent State University, Ohio.
General-Purpose Computation on Graphics Hardware.
The programmable pipeline Lecture 3.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Introduction: Lattice Boltzmann Method for Non-fluid Applications Ye Zhao.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Physically-Based Visual Simulation on Graphics Hardware Mark J. Harris Greg Coombe Thorsten Scheuermann.
CSCI 440.  So far we have learned how to  build shapes  create movement  change views  add simple lights  But, our objects still look very cartoonish.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Build your own 2D Game Engine and Create Great Web Games using HTML5, JavaScript, and WebGL. Sung, Pavleas, Arnez, and Pace, Chapter 5 Examples 1.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
Chapter 6 GPU, Shaders, and Shading Languages
Graphics Processing Unit
Sorting and Searching Tim Purcell NVIDIA.
Computer Graphics Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
University of Virginia
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Modified from: A Survey of General-Purpose Computation on Graphics Hardware John Owens University of California, Davis David Luebke University of Virginia with Naga Govindaraju, Mark Harris, Jens Kr ü ger, Aaron Lefohn, Tim Purcell

2 Motivation: The Potential of GPGPU In short: In short: The power and flexibility of GPUs makes them an attractive platform for general-purpose computation The power and flexibility of GPUs makes them an attractive platform for general-purpose computation Example applications range from in-game physics simulation to conventional computational science Example applications range from in-game physics simulation to conventional computational science Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

3 Problems: Difficult To Use GPUs designed for & driven by video games GPUs designed for & driven by video games Programming model unusual Programming model unusual Programming idioms tied to computer graphics Programming idioms tied to computer graphics Programming environment tightly constrained Programming environment tightly constrained Underlying architectures are: Underlying architectures are: Inherently parallel Inherently parallel Rapidly evolving (even in basic feature set!) Rapidly evolving (even in basic feature set!) Largely secret Largely secret Can’t simply “port” CPU code! Can’t simply “port” CPU code!

4 STAR Goals Detailed & useful survey of general-purpose computing on graphics hardware Detailed & useful survey of general-purpose computing on graphics hardware Hardware and software developments behind GPGPU Hardware and software developments behind GPGPU Building blocks: techniques for mapping general- purpose computation to the GPU Building blocks: techniques for mapping general- purpose computation to the GPU Applications: important applications of GPGPU Applications: important applications of GPGPU A comprehensive GPGPU bibliography A comprehensive GPGPU bibliography

5 Triangle Setup L2 Tex Shader Instruction Dispatch Fragment Crossbar Memory Partition Memory Partition Memory Partition Memory Partition Z-Cull NVIDIA GeForce D Pipeline Courtesy Nick Triantos, NVIDIA Vertex Fragment Composite

6 Programming a GPU for Graphics Each fragment is shaded w/ SIMD program Each fragment is shaded w/ SIMD program Shading can use values from texture memory Shading can use values from texture memory Image can be used as texture on future passes Image can be used as texture on future passes Application specifies geometry  rasterized Application specifies geometry  rasterized

7 Programming a GPU for GP Programs Run a SIMD kernel over each fragment Run a SIMD kernel over each fragment “Gather” is permitted from texture memory “Gather” is permitted from texture memory Resulting buffer can be treated as texture on next pass Resulting buffer can be treated as texture on next pass Draw a screen-sized quad  stream Draw a screen-sized quad  stream

8 Feedback Each algorithm step depend on the results of previous steps Each algorithm step depend on the results of previous steps Each time step depends on the results of the previous time step Each time step depends on the results of the previous time step

9 CPU-GPU Analogies Grid[i][j]= x;... Array Write = Render to Texture CPU GPU

10 CPU-GPU Analogies CPU GPU CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample

11 Kernels Kernel / loop body / algorithm step = Fragment Program CPUGPU

12 Scatter vs. Gather Grid communication Grid communication Grid cells share information Grid cells share information Gather Indirect read from memory ( x = a[i]) Naturally maps to a texture fetch Used to access data structures and data streams Scatter Indirect write to memory (a[i] = x) Difficult to emulate: Usually done on CPU

13 Computational Resources Inventory Programmable parallel processors Programmable parallel processors Vertex & Fragment pipelines Vertex & Fragment pipelines Rasterizer Rasterizer Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants Texture unit Texture unit Read-only memory interface Read-only memory interface Render to texture Render to texture Write-only memory interface Write-only memory interface

14 Vertex Processor Fully programmable (SIMD / MIMD) Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather Capable of scatter but not gather Can change the location of current vertex Can change the location of current vertex Cannot read info from other vertices Cannot read info from other vertices Can only read a small constant memory Can only read a small constant memory Latest GPUs: Vertex Texture Fetch Latest GPUs: Vertex Texture Fetch Random access memory for vertices Random access memory for vertices  Gather (But not from the vertex stream itself)  Gather (But not from the vertex stream itself)

15 Fragment Processor Fully programmable (SIMD) Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Random access memory read (textures) Capable of gather but not scatter Capable of gather but not scatter RAM read (texture fetch), but no RAM write RAM read (texture fetch), but no RAM write Output address fixed to a specific pixel Output address fixed to a specific pixel Typically more useful than vertex processor Typically more useful than vertex processor More fragment pipelines than vertex pipelines More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline) Direct output (fragment processor is at end of pipeline)

Building Blocks & Applications

17 GPGPU Building Blocks fundamental techniques & computational building blocks: fundamental techniques & computational building blocks: Flow control (a very fundamental building block) Flow control (a very fundamental building block) Stream operations Stream operations Data structures Data structures Differential equations & linear algebra Differential equations & linear algebra Data queries Data queries

18 Flow control Surprising number of issues on GPUs Surprising number of issues on GPUs Main themes: Main themes: Avoid branching when possible Avoid branching when possible Move branching earlier in the pipeline when possible Move branching earlier in the pipeline when possible Largely SIMD  coherent branching most efficient Largely SIMD  coherent branching most efficient Mechanisms: Mechanisms: Rasterized geometry Rasterized geometry Z-cull Z-cull Occlusion query Occlusion query

19 Domain Decomposition Avoid branches where outcome is fixed Avoid branches where outcome is fixed One region is always true, another false One region is always true, another false Separate FPs for each region, no branches Separate FPs for each region, no branches Example: boundaries Example: boundaries

20 Flat 3D Textures

21 Flat 3D Textures Advantages Advantages One texture update per operation One texture update per operation Better use of GPU parallelism Better use of GPU parallelism Non-power-of-two Textures Non-power-of-two Textures Quick simulation preview Quick simulation preview Disadvantage Disadvantage Must compute texture offsets Must compute texture offsets

22 Staggered Simulation Non-interactive application: Non-interactive application: Simulate as fast as possible Simulate as fast as possible Frame rate suffers Frame rate suffers 20ms

23 Staggered Simulation Interactive frame rate! Interactive frame rate! Simulation still proceeds pretty fast Simulation still proceeds pretty fast 10 20ms

24 Z-Cull In early pass, modify depth buffer In early pass, modify depth buffer Write depth=0 for pixels that should not be modified by later passes Write depth=0 for pixels that should not be modified by later passes Write depth=1 for rest Write depth=1 for rest Subsequent passes Subsequent passes Enable depth test (GL_LESS) Enable depth test (GL_LESS) Draw full-screen quad at z=0.5 Draw full-screen quad at z=0.5 Only pixels with previous depth=1 will be processed Only pixels with previous depth=1 will be processed Can also use early stencil test Can also use early stencil test Note: Depth replace disables ZCull Note: Depth replace disables ZCull

25 Pre-computation Pre-compute anything that will not change every iteration! Pre-compute anything that will not change every iteration! Example: arbitrary boundaries Example: arbitrary boundaries When user draws boundaries, compute texture containing boundary info for cells When user draws boundaries, compute texture containing boundary info for cells Reuse that texture until boundaries modified Reuse that texture until boundaries modified Combine with Z-cull for higher performance! Combine with Z-cull for higher performance!

26 Stream Operations Several stream operations in GPGPU toolkit: Several stream operations in GPGPU toolkit: Map: apply a function to every element in a stream Map: apply a function to every element in a stream Reduce: use a function to reduce a stream to a smaller stream (often 1 element) Reduce: use a function to reduce a stream to a smaller stream (often 1 element) Scatter/gather: indirect read and write Scatter/gather: indirect read and write Filter: select a subset of elements in a stream Filter: select a subset of elements in a stream Sort: order elements in a stream Sort: order elements in a stream Search: find a given element, nearest neighbors, etc Search: find a given element, nearest neighbors, etc

27 Simple Fire Effect Blur and scroll upward Trails of blur emerge from bright source ‘embers’ at the bottom VD VA VC VB

28 Cellular Automata Great for generating noise and other animated patterns to use in blending Great for generating noise and other animated patterns to use in blending Game of Life in a Pixel Shader Game of Life in a Pixel Shader Cell ‘state’ relative to the rules is computed at each texel Cell ‘state’ relative to the rules is computed at each texel Dependent texture read Dependent texture read State accesses ‘rules’ table, which is a texture State accesses ‘rules’ table, which is a texture Highly complex rules are easy! Highly complex rules are easy! The Rules For a space that is 'populated': Each cell with one or no neighbors dies, as if by loneliness. Each cell with four or more neighbors dies, as if by overpopulation. Each cell with two or three neighbors survives. For a space that is 'empty' or 'unpopulated' Each cell with three neighbors becomes populated

29 Lattice Computations How far can we take them? How far can we take them? Anything we can describe with discrete PDE equations! Anything we can describe with discrete PDE equations! Discrete in space and time Discrete in space and time Also other approximations Also other approximations

30 Approximate Methods Several different approximations Several different approximations Cellular Automata (CA) Cellular Automata (CA) Coupled Map Lattice (CML) Coupled Map Lattice (CML) Lattice-Boltzmann Methods (LBM) Lattice-Boltzmann Methods (LBM)

31 Coupled Map Lattice Mapping: Mapping: Continuous state  lattice nodes Continuous state  lattice nodes Coupling: Coupling: Nodes interact with each other to produce new state according to specified rules Nodes interact with each other to produce new state according to specified rules

32 Coupled Map Lattice CML introduced by Kaneko (1980s) CML introduced by Kaneko (1980s) Used CML to study spatio-temporal chaos Used CML to study spatio-temporal chaos Others adapted CML to physical simulation: Others adapted CML to physical simulation: Boiling [Yanagita 1992] Boiling [Yanagita 1992] Convection [Yanagita 1993] Convection [Yanagita 1993] Clouds [Yanagita 1997; Miyazaki 2001] Clouds [Yanagita 1997; Miyazaki 2001] Chemical reaction-diffusion [Kapral ‘93] Chemical reaction-diffusion [Kapral ‘93] Saltation (sand ripples / dunes) [ Nishimori ‘93] Saltation (sand ripples / dunes) [ Nishimori ‘93] And more And more

33 CML vs. CA CML extends cellular automata (CA) CML extends cellular automata (CA) CACML SPACEDiscreteDiscrete TIMEDiscreteDiscrete STATEDiscreteContinuous

34 CML vs. CA Continuous state is more useful Continuous state is more useful Discrete: physical quantities difficult Discrete: physical quantities difficult Must filter over many nodes to get “real” values Must filter over many nodes to get “real” values Continuous: physical quantities easy Continuous: physical quantities easy Real physical values at each node Real physical values at each node Temperature, velocity, concentration, etc. Temperature, velocity, concentration, etc.

35 Rules? CML updated via simple, local rules CML updated via simple, local rules Simple: same rule applied at every cell (SIMD) Simple: same rule applied at every cell (SIMD) Local: cells updated according to some function of their neighbors’ state Local: cells updated according to some function of their neighbors’ state

36 Example: Buoyancy Used in temperature-based boiling simulation Used in temperature-based boiling simulation At each cell: At each cell: If neighbors to left and right of cell are warmer, raise the cell’s temperature If neighbors to left and right of cell are warmer, raise the cell’s temperature If neighbors are cooler, lower its temperature If neighbors are cooler, lower its temperature

37 CML Operations Implement operations as building blocks for use in multiple simulations Implement operations as building blocks for use in multiple simulations Diffusion Diffusion Buoyancy (2 types) Buoyancy (2 types) Latent Heat Latent Heat Advection Advection Viscosity / Pressure Viscosity / Pressure Gray-Scott Chemical Reaction Gray-Scott Chemical Reaction Boundary Conditions Boundary Conditions User interaction (drawing) User interaction (drawing) Transfer function (color gradient) Transfer function (color gradient)

38 Anatomy of a CML operation Neighbor Sampling Neighbor Sampling Select and read values, v, of nearby cells Select and read values, v, of nearby cells Computation on Neighbors Computation on Neighbors Compute f(v) for each sample ( f can be arbitrary computation) Compute f(v) for each sample ( f can be arbitrary computation) Combine new values (arithmetic) Combine new values (arithmetic) Store new values back in lattice Store new values back in lattice

39 Graphics Hardware Why use it? Why use it? Speed: up to 25x speedup in our sims Speed: up to 25x speedup in our sims GPU perf. grows faster than CPU perf. GPU perf. grows faster than CPU perf. Cheap: GeForce 4 Ti 4200 < $130 Cheap: GeForce 4 Ti 4200 < $130 Load balancing in complex applications Load balancing in complex applications Why not use it? Why not use it? Low precision computation (not anymore!) Low precision computation (not anymore!) Difficult to program (not anymore!) Difficult to program (not anymore!)

40 Hardware Implementation (GF4)

Simulating the world Simulate a wide variety of phenomena on GPUs Simulate a wide variety of phenomena on GPUs Anything we can describe with discrete PDEs or approximations of PDEs Anything we can describe with discrete PDEs or approximations of PDEs