GPU Data Formatting and Addressing

Slides:

Advertisements

Similar presentations

Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.

Advertisements

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.

The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.

Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.

Interactive Deformation and Visualization of Level-Set Surfaces Using Graphics Hardware Aaron Lefohn Joe Kniss Charles Hansen Ross Whitaker Aaron Lefohn.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center.

A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.

Aaron Lefohn University of California, Davis With updates from slides by Suresh Venkatasubramanian, University of Pennsylvania Updates performed by Joseph.

Final Gathering on GPU Toshiya Hachisuka University of Tokyo Introduction Producing global illumination image without any noise.

Interactive, GPU-Based Level Sets for 3D Segmentation Aaron Lefohn Joshua Cates Ross Whitaker University of Utah Aaron Lefohn Joshua Cates Ross Whitaker.

Aaron Lefohn University of California, Davis With updates from slides by Suresh Venkatasubramanian, University of Pennsylvania Updates performed by Gary.

Hardware-Based Nonlinear Filtering and Segmentation using High-Level Shading Languages I. Viola, A. Kanitsar, M. E. Gröller Institute of Computer Graphics.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.

Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

General-Purpose Computation on Graphics Hardware.

Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.

A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.

Aaron Lefohn University of California, Davis GPU Memory Model Overview.

Enhancing GPU for Scientific Computing Some thoughts.

Technology and Historical Overview. Introduction to 3d Computer Graphics  3D computer graphics is the science, study, and method of projecting a mathematical.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

GPU Computation Strategies & Tricks Ian Buck Stanford University.

GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.

Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Cg Programming Mapping Computational Concepts to GPUs.

Interactive Level-Set Surface Deformation on the GPU Aaron Lefohn University of California, Davis.

1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,

1 Introduction to Computer Graphics with WebGL Ed Angel Professor Emeritus of Computer Science Founding Director, Arts, Research, Technology and Science.

General-Purpose Computation on Graphics Hardware.

The programmable pipeline Lecture 3.

CSE 690: GPGPU Lecture 6: Cg Tutorial Klaus Mueller Computer Science, Stony Brook University.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Real-Time High Quality Rendering CSE 291 [Winter 2015], Lecture 2 Graphics Hardware Pipeline, Reflection and Rendering Equations, Taxonomy of Methods

- Laboratoire d'InfoRmatique en Image et Systèmes d'information

A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.

Geometry processing on GPUs Jens Krüger Technische Universität München.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

Ray Tracing using Programmable Graphics Hardware

What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.

Dynamic Geometry Displacement Jens Krüger Technische Universität München.

An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.

COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.

Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.

A Crash Course on Programmable Graphics Hardware

Graphics Processing Unit

Chapter 6 GPU, Shaders, and Shading Languages

The Graphics Rendering Pipeline

GRAPHICS PROCESSING UNIT

Graphics Processing Unit

RADEON™ 9700 Architecture and 3D Performance

University of Virginia

Presentation transcript:

GPU Data Formatting and Addressing Aaron Lefohn University of California, Davis

Overview GPU Memory Model GPU-Based Data Structures Performance Considerations

GPU memory model GPU Data Storage Vertex data Texture data Frame buffer PS3.0 GPUs Texture Data Vertex Processor Rasterizer Fragment er result to ``other GPU memory'' (i.e., texture) - Write directly to ``other GPU memory'' instead of framebuffer. - Does the OS or OpenGL own GPU memory? - What other memory can we write to? - Textures - Vertex array buffers? - Fbuffer? - Mechanisms by which GPU can write to its own memory - Copy from framebuffer/pbuffer to texture - Cross platform - 2D output, save in 1D, 2D, 3D texture memory - Slow... - WGL_ARB_render_texture - RTT using pbuffers (only on Windows) - Fast RTT, but context switch is slow (time this!) - Current state of the art and lots of hacks to speed up - See next section for details of hackery - GL_EXT_Render_Target - Lightweight extension to enable x-platform, efficient RTT. - Spec. not yet approved and no implemenation - GL_EXT_pixel_buffer_object - Copy from frame buffer to vertex buffer - Asynchronous CPU readbacks - Supported by current NVIDIA drivers - TODO: Can I talk about this? - Uber buffers - General memory model for GPUs - Textures, frame buffers, vertex buffers are all just ``memory'' - Render to any GPU memory: N-D Texture, Vertex arrays, stencil bufer, frame buffer, etc. - Cross platform (OpenGL owns the memory, not the OS) - Mix-and-match depth buffers/color buffers/etc. - Alpha ATI drivers and spec. not approved - Stream/GPU-Based Data Structures 1) Multi-dimensional streams - Read/Write GPU memory optimized for 2D (images!) - But isn't memory all really 1D? - Yes, but GPU memory heirarchy is optimized for 2D accesses. Texture caches must capture multidimensional locality for texture filtering and 2D rasterization. - Reference texture cache stanford paper. - Result is that GPGPU programmer should use illusion of 2D physical memory. - Large 1D streams - Lay out in 2D - 3D streams - Update slice-by-slice (potentially limits parallelism) - Flatten parts or all into large 2D texture(s) - Streams of higher dimension (> 3D) - Layout in 2D memory in the same way that N-D arrays use 1D CPU memory. - 2D memory is limited in size. 4) How does the GPU get memory addresses? - Per-Vertex - Vertex attributes - Computed in vertex program - Read from vertex texture - Per-Fragment - Per-vertex addresses interpolated by rasterizer - Computed in fragment stage - Read from texture memory 2) Pointers - Dependent texture lookups 3) Sparse Data - Two options - Store entire dataset on GPU and create substreams out of it (depth culling or geometry-based substreams). - Sherbondy et al., IEEE Visualization 2003 - Purcell et al. - Only store sparse data on GPU (in packed format) - Sparse matrices: Kruger, Bolz - Sparse volume: Lefohn - Performance - Pbuffers - Currently the state-of-the-art for RTT - Most implementations optimized for RGBA??? (TODO: Is this true?) - Avoid context switches (TIME this!!) - Pack scalar data into RGBA channels - Use multiple surfaces (front/back/aux0/...) - Pack 2D domains into larger buffers (dangerous!) - Texture Cache Considerations - Caches designed to capture 2D locality wrt. to rasterization and texture filtering. - Dependent Texture Reads - NVIDIA: Based on cache locality - ATI: ??? - Compute addresses at lowest possible computational frequency - Neighbor offsets in vertex program - Avoid fragment-level address computation whenever possible Frame Buffer(s) Vertex Data

GPU memory model Read-Only Read/Write Traditional use of GPU memory CPU writes, GPU reads Read/Write Save frame buffer(s) for later use as texture or vertex array Save up to 16, 32-bit floating values per pixel Multiple Render Targets (MRTs)

How to Save Render Result Copy framebuffer result to “other GPU memory” Copy-to-texture Copy-to-vertex-array Write directly to “other GPU memory'' Render-to-texture Render-to-vertex-array

OpenGL GPU Memory Writes Texture Copy frame buffer to texture Render-to-texture WGL_ARB_render_texture GL_EXT_render_target Superbuffers Vertex Array Copy frame buffer to vertex array GL_EXT_pixel_buffer_object Render-to-vertex-array

Render-To-Texture: 1 Copy-To-Texture Good Cross-Platform texture writes Flexible output 2D output  Copy to 1D, 2D, or 3D texture Bad Slow Consumes internal GPU memory bandwidth

Render-To-Texture: 2 WGL_ARB_render_texture Render-to-texture (RTT) using pbuffers http://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txt Good Fast RTT Current state of the art for RTT Bad Only works on Windows Slow OpenGL context switches Many hacks to avoid this bottleneck

Render-To-Texture: 3 GL_EXT_render_target Proposed extension for cross-platform RTT http://www.opengl.org/resources/features/GL_EXT_render_target.txt Good Cross-platform, efficient RTT solution Lightweight, simple extension Bad Specification not approved (April 24, 2004) No implementations exist (April 24, 2004)

Render-To-Texture: 4 Superbuffers Proposed new memory model for GPUs http://www.ati.com/developer/gdc/SuperBuffers.pdf Good Unified GPU memory model Render to any GPU memory Cross platform (OpenGL owns memory, not OS) Mix-and-match depth/stencil/color buffers Bad Large, complex extension Specification not approved (April 24, 2004) Only driver support is alpha version (ATI)

Render-To-Texture Summary OpenGL RTT Currently Only Under Windows Pbuffers Complex and awkward RTT mechanism Current state of the art Cross-Platform RTT Coming Soon…

Render-To-Vertex-Array: 1 GL_EXT_pixel_buffer_object Copy framebuffer to vertex buffer object http://developer.nvidia.com/object/nvidia_opengl_specs.html Good Only GPU/AGP memory bandwidth Works with current drivers (NVIDIA) Bad No direct render-to-vertex-array (slower than true RTVA) No ATI implementation

Render-To-Vertex-Array: 2 Superbuffers Write to “memory object” as render target Read from “memory object” as vertex array Good Direct render-to-vertex-array (fast) Bad Can render results always be interpreted as vertex data? Large, complex, unapproved extension, …

Render-To-Vertex-Array Summary Current OpenGL Support NVIDIA: GL_EXT_pixel_buffer_object ATI: Superbuffers Semantics Still Under Development…

Fbuffer: Capturing Fragments Idea “Rasterization-Order FIFO Buffer” Render results are fragment values instead of pixel values Mark and Proudfoot, Graphics Hardware 2001 http://graphics.stanford.edu/projects/shading/pubs/hwws2001-fbuffer/ Uses Designed for multi-pass rendering with transparent geometry New possibilities for GPGPU? Varying number of results per pixel RTT and RTVA with an fbuffer?

Fbuffer: Capturing Fragments Implementations ATI Radeon 9800 and newer ATI GPUs Not yet exposed to user (ask for it!) Problems Size of fbuffer is not known before rendering GPUs cannot perform dynamic memory allocation How to handle buffer overflow?

Overview GPU Memory Model GPU-Based Data Structures Performance Considerations

GPU-Based Data Structures Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations

GPU Memory Addresses Where Are Addresses Generated? CPU Vertex stream or textures Vertex processor Input stream, ALU ops or textures Rasterizer Interpolation Fragment processor Input stream, ALU ops or textures Vertex Processor Rasterizer Fragment CPU

GPU Memory Addresses Where Are Addresses Used? Vertex textures (PS3.0 GPUs) Fragment textures Texture Data CPU Rasterizer Fragment Processor Vertex Processor

GPU Memory Addresses Pointers Store addresses in texture Dependent texture read Example: See Tim Purcell’s ray tracing talk float2 addr = tex2D( addrTex, texCoord ); float2 data = tex2D( dataTex, addr ); Address Texture Data Texture 1 2 3 3 Data 3 Data 1 1 Data 2 1 Data 3

GPU-Based Data Structures Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations

Multi-Dimensional Arrays Build Data Structures in 2D Memory Read/Write GPU memory optimized for 2D Images But Isn’t Physical Memory 1D? GPU memory hierarchy optimized to capture 2D locality Rasterization Texture filtering Igehy, Eldridge, Proudfoot, “"Prefetching in a Texture Cache Architecture,” Graphics Hardware, 1998 Conclusion: Use illusion of 2D physical memory

GPU Arrays Large 1D Arrays Current GPUs limit 1D array sizes to 2048 or 4096 Pack into 2D memory 1D-to-2D address translation

GPU Arrays 3D Arrays Problem GPUs do not have 3D frame buffers No RTT to slice of 3D texture (except Superbuffers) Solutions Stack of 2D slices Multiple slices per 2D buffer

GPU Arrays Problems With 3D Arrays for GPGPU Solutions Cannot read stack of 2D slices as 3D texture Must know which slices are needed in advance Visualization of 3D data difficult Solutions Need render-to-slice-of-3D-texture (Superbuffers) Volume rendering of slice-based 3D data Course 28, “Real-Time Volume Graphics”, Siggraph 2004

GPU Arrays Higher Dimensional Arrays Conclusions Pack into 2D buffers N-D to 2D address translation Same problems as 3D arrays if data does not fit in a single 2D texture Conclusions Fundamental GPU memory primitive is a fixed-size 2D array GPGPU needs more general memory model

GPU-Based Data Structures Building Blocks GPU memory addresses Address Generation Address Use Pointers Multi-dimensional arrays Sparse representations

Sparse Data Structures Why Sparse Data Structures? Reduce computational workload Reduce memory pressure Examples Sparse matrices Krueger et al., Siggraph 2003 Bolz et al., Siggraph 2003 Implicit surface computations (sparse volumes) Sherbondy et al., IEEE Visualization 2003 Lefohn et al., IEEE Visualization 2003 Premoze et al. Eurographics 2003

Sparse Computation Option 1: Store Complete Data Set on GPU Cull unused data Conditional execution tricks (discussed earlier) Option 2: Store Only Sparse Data on GPU Saves memory Potentially much faster than culling Much more complicated (especially if time-varying)

Sparse Data Structures Basic Idea Pack “active” data elements into GPU memory For more information Linear algebra section in this course : Static structures Level-set case study in this course : Dynamic structures

Sparse Data Structures Addressing Sparse Data Neighborhoods no longer implicitly defined on grid Use pointer-based data structures to locate neighbors Pre-compute neighbor addresses if possible Use CPU or vertex processor Removes pointer dereference from fragment program Separate common addressing case from boundary conditions Common case must be cache coherent See Harris and Lefohn case studies for “substream” technique

Overview GPU Memory Model GPU-Based Data Structures Performance Considerations

Memory Performance Issues Pbuffer Survival Guide Dependent Texture Costs Computational Frequency

Pbuffer Survival Guide Pbuffers Give us Render-To-Texture Designed to create an environment map or two Never intended to be used for GPGPU (100s of pbuffers) Problem Each pbuffer has its own OpenGL render context Each pbuffer may have depth and/or stencil buffer Changing OpenGL contexts is slow Solution Many optimizations to avoid this bottleneck…

Pbuffer Survival Guide Pack Scalar Data Into RGBA > 4x memory savings 4x reduction in context switches Be careful of read-modify-write hazard Scalar Data in 4 RGBA Pbuffers 1 RGBA Pbuffer

Pbuffer Survival Guide Use Multi-Surface Pbuffers Each RGBA surface is its own render-texture Front, Back, AuxN (N = 0,1,2,…) Greatly reduces context switches Technically illegal, but “blessed” by ATI. Works on NVIDIA. 5 Pbuffers 1 RGBA Surface Each 1 Pbuffer 5 RGBA Surfaces

Pbuffer Survival Guide Using Multi-Surface Pbuffers Allocate double buffer pbuffer (and/or with AUX buffers) Set render target to back buffer glDrawBuffer(GL_BACK) Bind front buffer as texture wglBindTexImageARB(hpbuffer, WGL_FRONT_ARB) Render Switch buffers wglReleaseTexImageARB(hpbuffer, WGL_FRONT_ARB) glDrawBuffer(GL_FRONT) wglBindTexImageARB(hpbuffer, WGL_BACK_ARB)

Pbuffer Survival Guide Pack 2D domains into large buffer “Flat 3D textures” Be careful of read-modify-write hazard 3D Volume Flattened Volume

Dependent Texture Costs Cache Coherency Dependent reads fast if they hit cache Even chained dependencies can be same speed as non-dependent reads Very slow if out of cache Example: 3 levels of dependent cache misses can be >10x slower More detail in “GPU Computation Strategies and Tricks”

Computational Frequency Compute Memory Addresses at Low Frequency Compute memory addresses in vertex program Let rasterizer interpolation create per-fragment addresses Compute neighbor addresses this way Avoid fragment-level address computation whenever possible Consumes fragment instructions Computation often redundant with neighboring fragments May defeat texture pre-fetch

Conclusions GPU Memory Model Evolving GPGPU Data Structures Writable GPU memory forms loop-back in an otherwise feed-forward streaming pipeline Memory model will continue to evolve as GPUs become more general stream processors GPGPU Data Structures Basic memory primitive is limited-size, 2D texture Use address translation to fit all array dimensions into 2D Maintain 2D cache locality Render-To-Texture Use pbuffers with care and eagerly adopt their successor

Selected References J. Boltz, I. Farmer, E. Grinspun, P. Schoder, “Spare Matrix Solvers on the GPU: Conjugate Gradients and Multigrid,” SIGGRAPH 2003 N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware,” Graphics Hardware 2003 M. Harris, W. Baxter, T. Scheuermann, A. Lastra, “Simulation of Cloud Dynamics on Graphics Hardware,“ Graphics Hardware 2003 H. Igehy, M. Eldridge, K. Proudfoot, “Prefetching in a Texture Cache Architecture,” Graphics Hardware 1998 J. Krueger, R. Westermann, “Linear Algebra Operators for GPU Implementation of Numerical Algorithms,” SIGGRAPH 2003 A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets,” IEEE Transactions on Visualization and Computer Graphics 2004

Selected References A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware,” IEEE Visualization 2003 W. Mark, K. Proudfoot, “The F-Buffer: A Rasterization-Order FIFO Buffer for Multi- Pass Rendering,” Graphics Hardware 2001 T. Purcell, C. Donner, M. Cammarano, H. W. Jensen, P. Hanrahan, “Photon Mapping on Programmable Graphics Hardware,” Graphics Hardware 2003 A. Sherbondy, M. Houston, S. Napel, “Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware,” IEEE Visualization 2003

OpenGL References GL_EXT_pixel_buffer_object http://www.nvidia.com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object.txt GL_EXT_render_target, http://www.opengl.org/resources/features/GL_EXT_render_target.txt OpenGL Extension Registry http://oss.sgi.com/projects/ogl-sample/registry/ Superbuffers http://www.ati.com/developer/gdc/SuperBuffers.pdf WGL_ARB_render_texture http://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_render_texture.txt http://oss.sgi.com/projects/ogl-sample/registry/ARB/wgl_pbuffer.txt

Questions? Acknowledgements Cass Everitt, Craig Kolb, Chris Seitz, and Jeff Juliano at NVIDIA Mark Segal, Rob Mace, and Evan Hart at ATI GPGPU Siggraph 2004 course presenters Joe Kniss and Ross Whitaker Brian Budge John Owens National Science Foundation Graduate Fellowship Pixar Animation Studios