Real-Time Reyes-Style Adaptive Surface Subdivision

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
IIIT Hyderabad Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad,
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
High-Performance Software
High-Quality Parallel Depth-of- Field Using Line Samples Stanley Tzeng, Anjul Patney, Andrew Davidson, Mohamed S. Ebeida, Scott A. Mitchell, John D. Owens.
Results / Compared to Relief Mapping It does not scale linearly with screen coverage as does the other techniques. However, for larger displacements, it.
GCAFE 28 Feb Real-time REYES Jeremy Sugerman.
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis.
Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces Anjul Patney 1 Mohamed S. Ebeida 2 John D. Owens 1 University of California,
Damon Rocco.  Tessellation: The filling of a plane with polygons such that there is no overlap or gap.  In computer graphics objects are rendered as.
Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
HCI 530 : Seminar (HCI) Damian Schofield.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Adapted from: CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware Naga K. Govindaraju, Stephane.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Anjul Patney University of California, Davis Real-Time Reyes Programmable Pipelines and Research Challenges.
Many-Core Programming with GRAMPS & “Real Time REYES” Jeremy Sugerman, Kayvon Fatahalian Stanford University June 12, 2008.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
Prefiltered Anti-Aliasing on Parallel Hardware
Computer graphics & visualization REYES Render Everything Your Eyes Ever Saw.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Invitation to Computer Science 5th Edition
Advanced Computer Graphics March 06, Grading Programming assignments Paper study and reports (flipped classroom) Final project No written exams.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Interactive Rendering of Meso-structure Surface Details using Semi-transparent 3D Textures Vision, Modeling, Visualization Erlangen, Germany November 16-18,
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
Surface displacement, tessellation, and subdivision Ikrima Elhassan.
Matrices from HELL Paul Taylor Basic Required Matrices PROJECTION WORLD VIEW.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Depth-of-Field Rendering by Pyramidal Image Processing
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
View-dependent Adaptive Tessellation of Spline Surfaces
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Graphics Interface 2009 The-Kiet Lu Kok-Lim Low Jianmin Zheng 1.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1cs426-winter-2008 Notes. 2 Atop operation  Image 1 “atop” image 2  Assume independence of sub-pixel structure So for each final pixel, a fraction alpha.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.
Siggraph 2009 RenderAnts: Interactive REYES Rendering on GPUs Kun Zhou Qiming Hou Zhong Ren Minmin Gong Xin Sun Baining Guo JAEHYUN CHO.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad, India.
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
NVIDIA Fermi Architecture
Graphics Processing Unit
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney, John Owens University of California, Davis

Offline Rendering Looks realistic Virtually no visible artifacts Renders on clusters of CPUs Slow: hours per frame Flexible rendering pipelines Images courtesy: Pixar

Real-Time Rendering Looks good enough Minor artifacts are OK Renders on commodity GPUs Fast: 60+ frames per second Restrictive rendering pipeline Images courtesy: www.ign.com

There is a gap Geometric complexity Shading complexity Special Effects From Polygon Meshes to Smooth Surfaces Shading complexity From HLSL to RenderMan shaders Special Effects Motion Blur, depth-of-field Other complex effects Global Illumination, Subsurface scattering, ambient occlusion

But commodity GPUs… Vertex Processors Fragment Composite Host Rasterization Primitive Setup Fragment Crossbar Framebuffer Ref: NVIDIA GeForce 6 Architecture

… have changed a lot General-Purpose Processors Host Thread Issue Setup / Rasterize General-Purpose Processors Thread Processor Framebuffer Ref: NVIDIA G80 Architecture

How does this affect things? Increased programmability Arbitrary computation Dynamic memory management Irregular data structures Flexible Rendering Compute for graphics Offline quality in real-time ? But we must be careful

There is a gap Geometric complexity Shading complexity Special Effects From Polygon Meshes to Smooth Surfaces Shading complexity From HLSL to RenderMan shaders Special Effects Motion Blur, depth-of-field Other complex effects Global Illumination, Subsurface scattering, ambient occlusion

Real-Time Reyes-Style Adaptive Surface Subdivision Anjul Patney and John D. Owens SIGGRAPH Asia 2008 http://graphics.idav.ucdavis.edu/publications/print_pub?pub_id=952

Outline Motivation Reyes Subdivision – algorithm Challenges Parallel formulation Subdivision on GPU – implementation Issues Solutions Results

Outline Motivation Reyes Subdivision – algorithm Challenges Parallel formulation Subdivision on GPU – implementation Issues Solutions Results

Motivation Polygon-based Rendering is insufficient Undesirable artifacts, especially along silhouettes Complicated model representation Model resolution is view-independent Can we expect performance from irregular computation on GPUs? Can GPUs support completely new pipelines?

Image courtesy: www.ign.com

Enter Reyes Industry standard in high-quality rendering Bound, Cull and Split Dice (tessellate) Shade Sample and Filter Higher-order surfaces Split surfaces Shaded grids Final Pixels Industry standard in high-quality rendering Forms the architecture beneath RenderMan Pipeline features Input: Parametric Surfaces Rendering primitive: 0.5 x 0.5 pixel micropolygons Adaptive tessellation Per-micropolygon programmable shading Stochastic sampling Micropolygon grids

Outline Motivation Reyes Subdivision – algorithm Challenges Parallel formulation Subdivision on GPU – implementation Issues Solutions Results

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

What is bad? Depth first subdivision is recursive! List of primitives is not static Cull, split may destroy or generate primitives Unbounded memory Dicing produces a huge number of micropolygons Emphasize a real-time implementation

Can we do this in parallel?

Can we do this in parallel?

Can we do this in parallel? A lot of independent operations Our simplest model: 5k primitives, 1.2M micropolygons Massively Parallel workload Regular Computation Bound/Split/Dice all primitives together SPMD friendly Emphasize the importance of moving geometry processing to the GPU (no unnecessary transfers)

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes

Parallel Reyes Subdivision No Bound and Cull Diceable? Dice Split Yes No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes No Yes

Parallel Reyes Subdivision Bound and Cull Diceable? Dice Split No Yes No Yes

Analogy: A Dynamic Work Queue B C D E F G H I Explain here: why the work queue? A A B C C E E F H Cull Split No Action

How can we do these efficiently? Creating new primitives How to dynamically allocate space? Culling unneeded primitives How to avoid fragmentation?

Our Choice – keep it simple… A B C D E F G H I A B C E F H A C E A child primitive is offset by the queue length

…and get rid of the holes later B C E F H A C E A B C E F H A C E Work-queue stays contiguous Scan-based compact is fast! (Sengupta ‘07)

Outline Motivation Reyes Subdivision – algorithm Challenges Parallel formulation Subdivision on GPU – implementation Issues Solutions Results

Platform NVIDIA GeForce 8800 GTX NVIDIA CUDA 1.1 16 SMs, each with 32-wide effective SIMD 16KB shared memory per SM 768 MB total GPU memory, no cache NVIDIA CUDA 1.1 Grid/Block/Thread programming model OpenGL interface through shared buffers

Implementation Details Input primitives – Bicubic Bézier Surfaces Choice of primitive only affects implementation View Dependent Subdivision every frame CPU-GPU input transfer only once Suitable for animating control points Final micropolygons sent to OpenGL as a VBO Flat-shaded and displayed for preview

Kernels Implemented Dice Bound/Split Regular, symmetric computation on a highly parallel workload 256 threads per primitive Primitive information in shared memory Bound/Split Non-trivial to ensure efficiency in implementation

Bound/Split: Efficiency Goals Memory Coherence Off-chip memory accesses must be efficient Computational Efficiency Hardware SIMD must be maximally utilized

Memory Coherence After each iteration, work queue is compacted Primitives always contiguous in memory Structure-Of-Arrays representation Attributes across primitives adjacent in memory 99.5% of all accesses were fully coalesced

SIMD Utilization Intra-Primitive parallelism 16 Threads per primitive A primitive’s control points are mostly independent Execution path divergence is negligible 16 Threads per primitive Vectorized Bound/Split Use shared memory for communication 90.16% of all branches were SIMD coherent 32 threads (1 warp)

Outline Motivation Reyes Subdivision – algorithm Challenges Parallel formulation Subdivision on GPU – implementation Issues Solutions Results

Results - Killeroo 11532 patches  14426 grids 5 levels of subdivision Bound/Split: 6.99 ms Dice: 7.21 ms 4.2 frames per second (subdivision-only: 70) Killeroo NURBS model courtesy headus 3D tools: http://headus.com.au

Results - Teapot 32 patches  4823 grids 11 levels of subdivision Bound/Split: 3.46 ms Dice: 2.42 ms 12.4 frames per second (subdivision-only: 170)

Results – Random Models Subdivision time proportional to number of micropolygons

Overheads GPUs are inefficient at rendering lots of small triangles Should ideally be zero; this is an acknowledged CUDA limitation

Storage Issues Reyes pipeline suffers from unbounded memory demand A huge number of micropolygons are generated Transparency and Blending preclude early rejection Most implementations use screen-space buckets But how does this work in parallel? Large buckets present a more parallel workload Small buckets have a smaller memory footprint

Screen-Space Buckets Acceptable performance! Small memory footprint!

Limitations All primitives must be split before dicing Cracks / Pinholes Uniform Dicing is wasteful Why are each of these limitations important?

Limitations All primitives must be split before dicing Cracks / Pinholes Uniform Dicing is wasteful

Limitations All primitives must be split before dicing Cracks / Pinholes Uniform Dicing is wasteful

Limitations All primitives must be split before dicing Cracks / Pinholes Uniform Dicing is wasteful

Conclusions Recursive subdivision maps well to current GPUs And works fast! It is advantageous to use smooth primitives in interactive rendering Fixed-function tessellation can be emulated Dicing is already very fast (2 Mgrids/sec) It’s time to experiment with alternate pipelines 2 Mgrids/sec == 512M micropolygons/sec

Future Work Crack Filling More of Reyes Add dummy polygons during post-processing More of Reyes Displacement Mapping Offline quality Shading on GPUs Parallel Stochastic Sampling (Wei ‘08) A-buffer

Thanks to Feedback and suggestions from Financial support from Per Christensen, Charles Loop, Dave Luebke, Matt Pharr, and Daniel Wexler Financial support from DOE Early Career PI Award National Science Foundation SciDAC Institute for Ultrascale Visualization Equipment support from NVIDIA

Real-Time Reyes-Style Adaptive Surface Subdivision Backup Slides

CUDA Thread Structure Image courtesy: NVIDIA CUDA Programming Guide, 1.1

CUDA Memory Architecture Image courtesy: NVIDIA CUDA Programming Guide, 1.1