Two-Level Grids for Ray Tracing on GPUs

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
Physically Based Real-time Ray Tracing Ryan Overbeck.
Ray Tracing Ray Tracing 1 Basic algorithm Overview of pbrt Ray-surface intersection (triangles, …) Ray Tracing 2 Brute force: Acceleration data structures.
IIIT Hyderabad Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad,
Ray Tracing CMSC 635. Basic idea How many intersections?  Pixels  ~10 3 to ~10 7  Rays per Pixel  1 to ~10  Primitives  ~10 to ~10 7  Every ray.
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.
Latency considerations of depth-first GPU ray tracing
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis.
RT06 conferenceVlastimil Havran On the Fast Construction of Spatial Hierarchies for Ray Tracing Vlastimil Havran 1,2 Robert Herzog 1 Hans-Peter Seidel.
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
Experiences with Streaming Construction of SAH KD Trees Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek.
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
RT 08 Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models Symposium on Interactive Ray Tracing 2008 Los Angeles, California Kirill Garanzha.
GH05 KD-Tree Acceleration Structures for a GPU Raytracer Tim Foley, Jeremy Sugerman Stanford University.
Specialized Acceleration Structures for Ray-Tracing Warren Hunt Bill Mark.
Ray Tracing Dynamic Scenes using Selective Restructuring Sung-eui Yoon Sean Curtis Dinesh Manocha Univ. of North Carolina at Chapel Hill Lawrence Livermore.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Real-Time Ray Tracing 3D Modeling of the Future Marissa Hollingsworth Spring 2009.
RT08, August ‘08 Large Ray Packets for Real-time Whitted Ray Tracing Ryan Overbeck Columbia University Ravi Ramamoorthi Columbia University William R.
RAY TRACING ON GPU By: Nitish Jain. Introduction Ray Tracing is one of the most researched fields in Computer Graphics A great technique to produce optical.
Computer Graphics 2 Lecture x: Acceleration Techniques for Ray-Tracing Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Interactive Volume Visualization of General Polyhedral Grids Philipp Muigg 1,3, Markus Hadwiger 2, Helmut Doleisch 3, M. Eduard Gröller 1 1, Vienna University.
1 Speeding Up Ray Tracing Images from Virtual Light Field Project ©Slides Anthony Steed 1999 & Mel Slater 2004.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
GPU Broad Phase Collision Detection GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from articles taken from GPU Gems III.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Understanding the Efficiency of Ray Traversal on GPUs Timo Aila Samuli Laine NVIDIA Research.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
Hierarchical Penumbra Casting Samuli Laine Timo Aila Helsinki University of Technology Hybrid Graphics, Ltd.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Dynamic Scenes Paul Arthur Navrátil ParallelismJustIsn’tEnough.
Ray Tracing Animated Scenes using Motion Decomposition Johannes Günther, Heiko Friedrich, Ingo Wald, Hans-Peter Seidel, and Philipp Slusallek.
Interactive Ray Tracing of Dynamic Scenes Tomáš DAVIDOVIČ Czech Technical University.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees Tero Karras NVIDIA Research.
Compact, Fast and Robust Grids for Ray Tracing Ares Lagae & Philip Dutré 19 th Eurographics Symposium on Rendering EGSR 2008Wednesday, June 25th.
Compact, Fast and Robust Grids for Ray Tracing
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
David Luebke 3/5/2016 Advanced Computer Graphics Lecture 4: Faster Ray Tracing David Luebke
Ray Tracing Acceleration (5). Ray Tracing Acceleration Techniques Too Slow! Uniform grids Spatial hierarchies K-D Octtree BSP Hierarchical grids Hierarchical.
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
Ray Tracing Acceleration (1). Acceleration of Ray Tracing Goal: Reduce the number of ray/primitive intersections.
IIIT, Hyderabad Ray Tracing Dynamic Scenes on the GPU Sashidhar Guntury Centre for Visual Information Technology IIIT, Hyderabad. India.
CHC ++: Coherent Hierarchical Culling Revisited Oliver Mattausch, Jiří Bittner, Michael Wimmer Institute of Computer Graphics and Algorithms Vienna University.
1 The Method of Precomputing Triangle Clusters for Quick BVH Builder and Accelerated Ray Tracing Kirill Garanzha Department of Software for Computers Bauman.
Fabianowski · DinglianaInteractive Global Photon Mapping1 / 22 Interactive Global Photon Mapping Bartosz Fabianowski · John Dingliana Trinity College Dublin.
Reza Yazdani Albert Segura José-María Arnau Antonio González
Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad, India.
Deep Partitioned Shadow Volumes Using Stackless and Hybrid Traversals
Two-Level Grids for Ray Tracing on GPUs
Real-Time Ray Tracing Stefan Popov.
Hybrid Ray Tracing of Massive Models
Christian Lauterbach GPGPU presentation 3/5/2007
Using Packet Information for Efficient Communication in NoCs
Shared Memory and Distributed Memory Parallel Ray Tracer
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
A Comparison-FREE SORTING ALGORITHM ON CPUs
GEARS: A General and Efficient Algorithm for Rendering Shadows
Presentation transcript:

Two-Level Grids for Ray Tracing on GPUs Hello everyone, my name is Javor, and I will talk about grids, ray tracing and GPUs. Javor Kalojanov, Markus Billeter, Philipp Slusallek

Motivation Real time ray tracing of dynamic scenes While ray tracing performance today is fast enough for interactive walkthroughs of static scenes, there are hardly any applications that handle dynamic scenes, not to mention dynamic scenes with arbitrary motion such as explosions for instance.

While ray tracing performance today is fast enough for interactive walkthroughs of static scenes, there are hardly any applications that handle dynamic scenes, not to mention dynamic scenes with arbitrary motion such as explosions for instance.

Problems (Acceleration Structures) Time to image Build time Traversal performance High quality structures Slow construction Uniform Grids, LBVH Slow traversal Handling dynamic scenes for ray tracing is difficult, because high-performance ray tracers rely on high quality acceleration structures that are built offline. Other acceleration structures like uniform grids or linear BVHs offer very fast build times and indeed good time to image, but compared to the structures used for static scenes, they offer poor acceleration and limit the amount of rays you can shoot in each frame.

The problem with uniform grids is that they cannot handle non-uniform primitive distributions in the scene – the so-called teapot in stadium problem. This image visualizes the amount of intersection tests performed per ray for a test scene. It is obvious that there are large parts of the scene where the uniform grid fails to eliminate intersection candidates, which is the purpose of every acceleration structure.

Two-Level (Hierarchical) Grids Work around uniform grids shortcomings C B A D We wanted to work around this problems of the uniform grids, by making as little modifications to the uniform grid as possible. A natural choice was to introduce another level of hierarchical subdivision. The two level grid is a uniform grid and each cell of this grid is a uniform grid of an arbitrary resolution.

Grid Resolution Most common heuristic: 𝑅 𝑥 𝑅 𝑦 𝑅 𝑧 = 𝑑 𝑥 𝑑 𝑦 𝑑 𝑧 3 𝜆𝑛 𝑉 𝜆 – Grid density… …which minimizes the expected cost for traversing a random ray We base our choice of grid resolution on a heuristic commonly used for uniform grids. I will not go trough the formula, but the basic idea is to have grid cells as close to cubes as possible and select a number of cells that is linear in the number of primitives. This is where the only user defined constant called “grid density” comes from. To “guess” the best grid density for our setup, we modified a small part of the construction algorithm and used it to compute a grid density that minimizes the expected cost for traversing a random ray trough the structure. This strategy (and it’s variations) are considered the best choice for other acceleration structures like kd-trees and BHVs. You can find more details on that in the paper.

Teapot in Stadium Intersection tests Uniform Grid and Two-Level Grid Going back to the teapot in the stadium problem, we can see that a teapot in the stadium scene like the Fairy Forest does not present a problem and that the two-level grids are able to eliminate significantly more intersection candidates than the uniform grids.

Two-Level Grid Construction The rest of the talk is divided into two parts. We first describe a massively parallel construction algorithm for the two-level grids, thanks to whom we not only preserve the fast build times of uniform grids, but actually improve on them.

Data structure [0,0) [0,1) [1,2) [2,5) [5,7) [7,8) [0,0) [0,1) [1,2) Top Level Cells: C B A D 17,1x1 18,1x1 19,1x1 14,1x1 16,1x1 0, 3x3 9,1x1 10,2x2 15,1x1 0, 3x3 [0,0) [0,1) [1,2) [2,5) [5,7) [7,8) Leaf Array: [0,0) [0,1) [1,2) [2,5) [5,7) [0,0) [0,1) [1,2) [2,5) [5,7) [2,5) Before we take a look at the algorithms for construction, we need to understand the data structure and its representation in memory. To use the grid as an acceleration structure for ray tracing we need to be able to tell which are the primitives inside a given cell in constant time. The first level cells store the information needed to access the leaves inside it. This is the resolution of the cell and a location inside a leaf array where the first leaf that belongs to the cell is stored. Each leaf stores data that describes which primitives does it contain. We keep all references to primitives inside a reference array and in the leaves we store the parts of this array with the corresponding references. Reference Array: 6 1 2 3 4 C B 10 5 A 7 8 9 D 11 12 2 3 4 C B A

Sort-Based Construction Spatial index structure = partial ordering GPU friendly Works for LBVHs and uniform grids Idea Pair primitives with a key Sort by the key Extract the structure from the sorted data The data organization is important because it allows to reduce the grid construction to a sorting problem. This is to some extent intuitive, because spatial index structures actually store a partial ordering of the input objects. And you want to do this, because if there is one thing GPUs are good at, that’s sorting! The approach is not new and known to work well for a type of BVHs and uniform grids. The outline of a general sort-based algorithm for construction is the following: First we pair each of the input objects with a key, we then sort these pairs and read out the acceleration structure from the sorted pairs.

Top-Level Cells (Cell ID, Prim ID) Write Pairs Radix Sort Sorted Pairs B A D Write Pairs (Cell ID, Prim ID) D 2 A C B 1 3 4 Radix Sort Here is an example. To construct the top level of our structure, we use a sort-based method for uniform grid construction. What we do is, we load the primitives in parallel and for each compute which cells does it intersect. For each primitive-cell intersection, or fragment, we write a pair of the cell index and the primitive index in a global array. We then sort this array. As a result we get an array in which references to primitives inside each cell are stored continuously and we know where the data for each individual cell is to be found. From that we can either build a uniform grid, or in our case the top level cells of the two-level grid. Note that at this point I am skipping quite a few details, that are to be found in the paper. 17,1x1 18,1x1 19,1x1 14,1x1 16,1x1 0, 3x3 9,1x1 10,2x2 15,1x1 Extract Cells Sorted Pairs D 2 A C B 1 3 4

Leaves (Naïve Version) Build top level cells consecutively Sequential ⇒ slow One thread per cell Poor work distribution ⇒ slow Top Cells 17,1x1 18,1x1 19,1x1 14,1x1 16,1x1 0, 3x3 9,1x1 10,2x2 15,1x1 D 2 A C B 1 3 4 The next step of the construction algorithm is to build the leaf level cells. There are several straight forward approaches to do that…

Leaves Same Leaf! Top Cells 0, 3x3 14,1x1 Write Pairs D 2 A C B 1 3 4 Write Pairs (Leaf ID, Prim ID) A 4 B 8 D 1 5 C 2 3 C B Same Leaf! The main contribution of our paper is an algorithm for construction of multiple uniform grids in parallel. The performance and work distribution is independent of the primitive distribution and the resolutions of the individual grids. This allows for perfect work distribution and no need for atomic operations during the build. We write a pair of a key and primitive index in an array as we did previously. This time however we can not use the leaf index as a key. If we would do that, we would have pairs with key 0 and both the dark blue triangle C and the green triangle B. This will put them in the same leaf, but they will be in leaf 0 of different top-level cells.

Which Key? Keys and leaf cells must be one-one correspondence Use the leaf positions in the array as keys [0,0) [0,1) [1,2) [2,5) [5,7) [7,8) Leaf Array: 1 2 3 4 5 6 7 8 9 The good thing is that we can solve this problem with a simple trick. The issue we are facing is that leaf indices are not in a one-one correspondence with the leafs. In general, we need a unique key to identify each leaf cell. If you recall, when describing the data layout of the structure, we mentioned that all leaves are stored in a 1D array. We exploit this fact and take the position of the leaf inside this global array as a key.

Leaves (2) Single sort for all leaves Top Cells 0, 3x3 14,1x1 D 2 A C B 1 3 4 Write Pairs (Key, Prim ID) A 4 B 8 D 11 5 C 2 9 15 12 13 14 This, not only allows us to construct the complete array of primitive references in one pass, but also simplifies the generation of the array of leaf cells to a few lookups. So the complete construction of the second level of the structure looks like that: We write a pair of a unique key and primitive index in an array and sort just as we did previously. From the sorted data we can easily compute the leaf cells and the primitive array. Note that we need a single sort for the entire second level and we parallelize over the pairs of level 1 in the first step, and the pairs of level 2 in the next steps. Radix Sort Sorted Pairs A 5 B 8 D 11 4 C 2 9 15 12 13 14 Extract Leaves [0,0) [0,1) [1,2) [8,9) [2,5) [5,7) [7,8)

Build Performance Very scalable Small memory footprint 100M pairs per second Small memory footprint The benefits from such construction strategy are the very high performance and scalability of the construction. The runtime only depends on the number or pairs we generate, and the spatial distribution of the scene primitives is irrelevant. Even if all primitives happen to be in the same leaf, the performance will not suffer. In addition the memory footprint of the structure is very low, especially compared to uniform grids, because of the low optimal density for the two-level grid.

Build Performance (2)

Two-Level Grid Traversal To summarize. So far we found optimal parameters and build strategy for the two-level grids, which by construction resolve the teapot in the stadium problem. In the next and last part of the talk we will address the ray traversal and describe how to make it SIMD-efficient.

Two-Level Grid Traversal on GPUs Uniform grid traversal For the first level For the second level Avoid code divergence Coherent memory access for coherent rays Must start in the same cell The ray traversal algorithm we used in our tests is a straight forward extension of a standard algorithm used for uniform grids. We treated the top level and each of the top-level cells as individual uniform grids. To improve the coherency of execution, we force the SIMD units of the GPU to traverse on the same level of the structure. This eliminates code divergence and slightly improves performance. The traversal algorithm favors coherent rays, because they usually require more local memory access.

Ray Coherence Memory coherence for rays starting in the same cell 1 This is an example of what happens for uniform grids. Rays that start from the same cell, will always load a cell intersected by them at the same time.

Ray Coherence Memory coherence for rays starting in the same cell 2 Even if they split at some point…

Ray Coherence Memory coherence for rays starting in the same cell 3 They rejoin at the same step of the traversal if the go trough a given cell. This is why tracing coherent rays is faster for uniform grids, and we exploit it by synchronizing the traversal of the top and bottom level of the two-level grids. This means that we finish tracing all rays in the second level, before we move to the next top-level cell.

Ray Incoherence SIMD inefficiency due to early termination Thread Activity Thread 1: active Thread 2: active Thread 3: active Thread 6: idle Thread 7: idle Thread 4: active Thread 5: idle Thread 8: idle While the advancing through the cells is completely coherent in terms of executed instructions and almost so in terms of memory access, this is not true by default during the ray-triangle intersection stage of the traversal. You can see that even for primary rays, it can happen that only part of the threads do some work during intersection. While others are idle, because they are responsible for tracing rays that are already terminated.

Secondary Rays SIMD inefficiency due to incoherence Thread Activity Thread 1: active Thread 2: active Thread 3: idle Thread 6: idle Thread 7: idle Thread 4: idle Thread 5: idle Thread 8: idle The problem is of course bigger for incoherent rays, produced for example by a standard path tracer.

Work Distribution for Incoherent Workloads Standard packet ray tracer rays R1 R2 R3 R4 R5 R6 R7 R8 intersection candidates In this particular example only two of the rays have intersection candidates. A standard packet implementation will perform intersection tests horizontally, which will result in poor SIMD utilization. And 6 consecutive execution steps.

Work Distribution for Incoherent Workloads Vertical parallelization rays R1 R2 R3 R4 R5 R6 R7 R8 intersection candidates Note that this particular case can be computed more efficiently if we were to parallelize vertically instead of horizontally.

Hybrid Intersector Detect rays with high workload Compute intersections for single ray in parallel Proceed with the remaining rays Up to 25% overall speedup Brute force path tracing No other optimizations No overhead on Fermi GPUs Based on this observation, we tried a hybrid parallelization strategy in the ray triangle intersection stage. We detected rays with high workload, and computed intersections for each of those vertically. Afterwards we proceeded with the standard packet intersector for the remaining tests. We could achieve several times faster ray-triangle intersection in the best case, but because the method introduces some overheads for shuffling data the best overall speedup in rendering times was around 25%. The good thing is that on Fermi GPUs we could exploit hardware support for functions like population count and prefix sum on warp level and eliminate the overhead. While the overall speedup was not so big – got up to 10%, when we used other optimizations, we did observe performance boost even for primary rays on more complex scenes.

Results Finally I would like to briefly discuss some of our results, you can find in the paper.

Results Model (Triangles) LBVH Grid 2lvl Grid Hybrid BVH Fairy (174K) 10.3 ms 1.8 fps 24 ms 3.5 fps 8 ms 9.2 fps 124 ms 11.6 fps Bunny/Dragon (252K) 17 ms 7.3 fps 13 ms 7.7 fps 13 ms 10.3 fps 66 ms 7.6 fps Conference (284K) 19 ms 6.7 fps 27 ms 7.0 fps 17 ms 12 fps 105 ms 22.9 fps Soda Hall (2.2M) 66 ms 3.0 fps 130 ms 6.3 fps 67 ms 12.6 fps 445 ms 20.7 fps Times for primary rays and simple shading (no shadows). Frame rate does not include build time. GeForce 280 GTX

Results (2) GeForce 280 GTX Model (Triangles) Simple Shading FPS (MRays/s) Direct Illumination Path Tracing Ogre (50K) 31 fps ( 31 ) 7.8 fps ( 39 ) 3.4 fps ( 10.2 ) Fairy (174K) 15 fps ( 15 ) 2.7 fps ( 13.5 ) 2.1 fps ( 6.3 ) Conference (284K) 22 fps ( 22 ) 3.0 fps ( 15 ) 1.6 fps ( 4.8 ) Venice (1.2M) 18 fps ( 18 ) 3.3 fps ( 16.5 ) 1.4 fps ( 4.2 ) GeForce 280 GTX

Conclusion Fast(est) rebuild from scratch Fast time to image Data parallel and scalable Fast time to image Even for “Teapot in Stadium” – scenes (Almost) no code divergence during traversal We showed how to construct the two-level grids very fast enough, to allow rebuild the entire structure every frame, and render animated scenes with arbitrary motion. Mostly because of the build time, but also because of the good ray traversal performance we are able to achieve fast render times, even for scenes with complex primitive distributions, that are known to be hard for grids. Last but not least we will take a look at how to improve rendering performance for incoherent rays, used for example when path tracing. In addition the algorithms and ideas we presented are not restricted to grids. The main concept of the construction algorithm can be used for construction of other hierarchies, and the hybrid ray-triangle intersection algorithm makes sense for every sparse acceleration structure.

Acknowledgements Anonymous Reviewers Vincent Pegoraro

Thank You!