High-Performance Software

Slides:



Advertisements
Similar presentations
Accelerating Real-Time Shading with Reverse Reprojection Caching Diego Nehab 1 Pedro V. Sander 2 Jason Lawrence 3 Natalya Tatarchuk 4 John R. Isidoro 4.
Advertisements

COMPUTER GRAPHICS SOFTWARE.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Stratified Sampling for Stochastic Transparency
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Direct3D New Rendering Features
1 Questions about GPUs AS-Temps réel – Bordeaux Philippe Decaudin Fabrice Neyret Stéphane Guy Sylvain Lefebvre.
Two Methods for Fast Ray-Cast Ambient Occlusion Samuli Laine and Tero Karras NVIDIA Research.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
High-Quality Parallel Depth-of- Field Using Line Samples Stanley Tzeng, Anjul Patney, Andrew Davidson, Mohamed S. Ebeida, Scott A. Mitchell, John D. Owens.
GCAFE 28 Feb Real-time REYES Jeremy Sugerman.
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis.
I3D Fast Non-Linear Projections using Graphics Hardware Jean-Dominique Gascuel, Nicolas Holzschuch, Gabriel Fournier, Bernard Péroche I3D 2008.
Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Microtriangle thesis mid-status September 22th, 2010.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.
© 2004 Tomas Akenine-Möller1 Shadow Generation Hardware Vision day at DTU 2004 Tomas Akenine-Möller Lund University.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.
1 A Hierarchical Shadow Volume Algorithm Timo Aila 1,2 Tomas Akenine-Möller 3 1 Helsinki University of Technology 2 Hybrid Graphics 3 Lund University.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Real-time Graphics for VR Chapter 23. What is it about? In this part of the course we will look at how to render images given the constrains of VR: –we.
GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
09/16/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Environment mapping Light mapping Project Goals for Stage 1.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Sunpyo Hong, Hyesoon Kim
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Week 2 - Friday CS361.
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Cache Memory Presentation I
Real-Time Ray Tracing Stefan Popov.
The Graphics Rendering Pipeline
Graphics Hardware CMSC 491/691.
Graphics Processing Unit
Lecture 13 Clipping & Scan Conversion
A Hierarchical Shadow Volume Algorithm
Mattan Erez The University of Texas at Austin
UMBC Graphics for Games
RADEON™ 9700 Architecture and 3D Performance
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

High-Performance Software Rasterization on GPUs Samuli Laine Tero Karras NVIDIA Research

Graphics and Programmability Graphics pipeline (OpenGL/D3D) Driven by dedicated hardware Executes user code in shaders What’s next in programmability?

Programmable Graphics Pipeline But what does it mean? Another API? More programmable stages? Coupling between CUDA/OpenCL and OpenGL/D3D? Or ”it’s just a program”?

Our Approach Try and implement a full pixel pipeline using CUDA From triangle setup to ROP Obey fundamental requirements of gfx pipe Maintain input order Hole-free rasterizer with correct rasterization rules Prefer speed over features

Project Goals Establish a firm data point through careful experiment Provide fully programmable platform for exploring algorithms that extend the hardware gfx pipe Ideas for future hardware Ultimate goal = flexibility of software, performance of fixed-function hardware Programmable ROP Stochastic rasterization Non-linear rasterization Non-quad derivatives Quad merging Decoupled sampling Compact after discard etc.

Previous Work: FreePipe [Liu et al. 2010] Very simple rasterization pipeline Each triangle processed by one thread Blit pixels directly to DRAM using atomics Limitations Cannot retain inputs order Limited support for ROP operations (dictated by atomics) Large triangles  game over We are 15x faster on average Larger difference for high resolutions

Design Considerations Run everything in parallel We need a lot of threads to fill the machine Minimize amount of synchronization Avoid excessive use of atomics Focus on load balancing Graphics workloads are wild

Pipeline Structure Chunker-style pipeline with four stages Triangle setup  Bin raster  Coarse raster  Fine raster Run data in large batches Separate kernel launch for each stage Keep data in input order all the time

Chunking to Bins and Tiles Frame buffer Bin 16x16 tiles 128x128 px Tile 8x8 px Pixel

Pipeline Stages Quick look at each stage More details in the paper

Triangle Setup Triangle Setup Vertex buffer positions, attributes Index buffer . . . Triangle Setup edge eqs. u/v pleqs zmin etc. Triangle data buffer . . .

Triangle Setup Fetch vertex indices and positions Clip if necessary (has guardband) Frustum, backface and between-samples cull Setup screen-space pleqs for u/w, v/w, z/w, 1/w Compute zmin for early depth cull in fine raster One-to-one mapping between input and output Trivial to employ full chip while preserving ordering

IDs of triangles that overlap bin Bin Raster Triangle data buffer . . . Bin Raster SM 0 Bin Raster SM 1 Bin Raster SM 14 . . . IDs of triangles that overlap bin

Bin Raster, First Phase Pick triangle batch (atomic, 16k tris) Read 512 set-up triangles Compact/expand according to culling/clipping results Efficient within thread block Repeat until has enough triangles to utilize all threads

Bin Raster, Second Phase Rasterize Determine bin coverage for each triangle (1 thread per tri) Fast paths for 2x2 and smaller footprints Output Output to per-SM queue  no sync between SMs

IDs of triangles that overlap tile Coarse Raster . . . Coarse Raster SM n One coarse raster SM has exclusive access to the bin it’s processing IDs of triangles that overlap tile

Coarse Raster Input Phase Rasterize Output Merge from 15 input queues (one per bin SM) Continue until enough triangles to utilize all threads Rasterize Determine tile coverage for each triangle (1 thread per tri) Fast paths for small / largely overlapping triangles Output Highly varying number of output items  divide evenly to threads Only one SM outputs to tiles of any given bin  no sync needed

IDs of triangles that overlap tile Fine Raster IDs of triangles that overlap tile Fine Raster warp n One fine raster warp has exclusive access to the tile it’s processing Read tile once from DRAM to shared Write tile once to DRAM Pixel data in FB

Fine Raster Pick tile, read FB tile, process, write FB tile Input Read 32 triangles in shared memory Early z kill based on triangle zmin and tile zmax Calculate pixel coverage using LUTs (153 instr. for 8x8 stamp) Repeat until has at least 32 fragments Raster Process one fragment per thread, full utilization Shade and ROP

Tidbit 1: Coverage Calculation Step along edge (Bresenham-like) Use look-up tables to generate coverage masks ~50 instructions for 8x8 stamp, one edge

Tidbit 2: Fragment Distribution Input Phase Shading Phase In input phase, calculate coverage and store in list In shading phase, detect triangle changes and calculate triangle index and fragment in triangle

Test Scenes Call of Juarez scene courtesy of Techland S.T.A.L.K.E.R.: Call of Pripyat scene courtesy of GSC Game World

Frame rendering time in ms (depth test + color, no MSAA, no blending) Results: Performance Frame rendering time in ms (depth test + color, no MSAA, no blending)

Results: Memory Usage San Miguel Juarez Stalker City Buddha Scene data 189 24 11 51 29 Triangle setup data 420.0 42.2 26.9 67.9 84.0 Bin queues 4.0 1.5 1.2 0.9 2.0 Tile queues 4.4 2.9 2.2 Memory usage in MB

Comparison to Hardware (1/3) – Resolution Cannot match hardware in raster, z kill + compact Currently support max 2K x 2K frame buffer, 4 subpixel bits – Attributes Fetched when used  bad latency hiding Expensive interpolation – Antialiasing Hardware nearly oblivious to MSAA, we much less so

Comparison to Hardware (2/3) – Memory usage, buffering through DRAM Performance implications of reduced buffering unknown Streaming through on-chip memory would be much better + Shader complexity Shader performance theoretically the same as in graphics pipe + Frame buffer bandwidth Each pixel touched only once in DRAM

Comparison to Hardware (3/3) + Extensibility Need one stage to do something extra? Need a new stage altogether? You can actually implement it + Specialization to individual applications Rip out what you don’t need, hard-code what you can

Exploration Potential Shader performance boosters Compact after discard, quad merging, decoupled sampling, … Things to do with programmable ROP A-buffering, order-independent transparency, … Stochastic rasterization Non-linear rasterization (Your idea here)

The Code is Out There The entire codebase is open-sourced and released http://code.google.com/p/cudaraster/

Thank You Questions