Download presentation
Presentation is loading. Please wait.
Published byBuddy Burns Modified over 9 years ago
1
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel, Philipp Slusallek
2
Stefan PopovHigh Performance GPU Ray Tracing Background GPUs attractive for ray tracing High computational power Shading oriented architecture GPU ray tracers Carr – the ray engine Purcell – Full ray tracing on the GPU, based on grids Ernst – KD trees with parallel stack Carr, Thrane & Simonsen – BVH Foley, Horn, Popov – KD trees - stackless traversal
3
Stefan PopovHigh Performance GPU Ray Tracing Motivation So far Interactive RT on GPU, but Limited model size No dynamic scene support The G80 – new approach to the GPU High performance general purpose processor with graphics extensions PRAM architecture BVH allow for Dynamic/deformable scenes Small memory footprint Goal: Recursive ordered traversal of BVH on the G80
4
Stefan PopovHigh Performance GPU Ray Tracing GPU Architecture (G80) Multi-threaded scalar architecture 12K HW threads Threads cover latencies Off-chip memory ops Instruction dependencies 4 or 16 cycles to issue instr. 16 (multi-)cores 8-wide SIMD 128 scalar cores in total Cores process threads in 32 wide SIMD chunks … Multi-Core 1 … Thread 1 Thread 32 IP Chunk Pool … Thread 1 Thread 32 … Thread 1 Thread 32 … Thread 1 Thread` 32 … Thread 1 Thread 32 Multi-Core 16 … Thread 1 Thread 32 IP Chunk Pool … Thread 1 Thread 32 … Thread 1 Thread 32 … Thread 1 Thread` 32 … Thread 1 Thread 32
5
Stefan PopovHigh Performance GPU Ray Tracing GPU Architecture (G80) Scalar register file (8K) Partitioned among running threads Shared memory (16KB) On-chip, 0 cycle latency On-board memory (768MB) Large latency (~ 200 cycles) R/W from within thread Un-cached Read-only L2 cache (128KB) On chip, shared among all threads On-board memory Multi-Core 1 … Thread 1 Thread 32 Shared Memory Registers … Multi-Core 16 L2 Cache (128KB)
6
Stefan PopovHigh Performance GPU Ray Tracing Programming the G80 CUDA C based language with parallel extensions GPU utilization at 100% only if Enough threads are present (>> 12K) Every thread uses less than 10 registers and 5 words (32 bit) of shared memory Enough computations per transferred word of data Bandwidth << computational power Adequate memory access pattern to allow read combining
7
Stefan PopovHigh Performance GPU Ray Tracing Performance Bottlenecks Efficient per-thread stack implementation Shared memory too small – will limit parallelism On-board memory – uncached Need enough computations between stack ops Efficient memory access pattern Use texture caches However, only few words of cache / thread Read successive memory locations in successive threads of a chunk Single roundtrip to memory (read combining) Cover latency with enough computations
8
Stefan PopovHigh Performance GPU Ray Tracing Ray Tracing on the G80 Map each ray to one thread Enough threads to keep the GPU busy Recursive ray tracing Use per-thread stack stored on on-board memory Efficient, since enough computations are present But how to do the traversal ? Skip pointers (Thrane) – no ordered traversal Geometric images (Carr) – single mesh only Shared stack traversal
9
Stefan PopovHigh Performance GPU Ray Tracing SIMD Packet Traversal of BVH Traverse a node with the whole packet At an internal node: Intersect all rays with both children and determine traversal order Push far child (if any) on a stack and descend to the near one with the packet At a leaf: Intersect all rays with contained geometry Pop next node to visit from the stack
10
Stefan PopovHigh Performance GPU Ray Tracing PRAM Basics The PRAM model Implicitly synchronized processors (threads) Shared memory between all processors Basic PRAM operations Parallel OR in O(1) Parallel reduction in O(log N) false true false true false true 11 9 9 12 32 44 20 ++ + 64 20 11 9 9 9 9
11
Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH The G80 – PRAM machine on chunk level Map packet chunk, ray thread Threads behave as in the single ray traversal At leaf: Intersect with geometry. Pop next node from stack At node: Decide which children to visit and in what order. Push far child Difference: How rays choose which node to visit first Might not be the one they want to
12
Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH Choose child traversal order PRAM OR to determine if all rays agree on visiting the same node first The result is stored in shared memory In case of divergence: choose child with more ray candidates Use PRAM SUM on +/- 1 for each thread, -1 left node Look at result’s sign Guarantees synchronous traversal of BVH
13
Stefan PopovHigh Performance GPU Ray Tracing PRAM Packet Traversal of BVH Stack: Near & far child – the same for all threads => store once Keep stack in shared memory. Only few bits per thread! Only Thread 0 does all stack ops. Reading data: All threads work with the same node / triangle Sequential threads bring in sequential words Single load operation. Single round trip to memory Implementable in CUDA
14
Stefan PopovHigh Performance GPU Ray Tracing Results
15
Stefan PopovHigh Performance GPU Ray Tracing Analysis Coherent branch decisions / memory access Small footprint of the data structure Can trace up to 12 million triangle models Program becomes compute bound Determined by over/under-clocking the core/memory No frustums required Good for secondary rays, bad for primary Can use rasterization for primary rays Implicit SIMD – easy shader programming Running on a GPU – shading “for free”
16
Stefan PopovHigh Performance GPU Ray Tracing Dynamic Scenes Update parts / whole BVH and geometry on GPU Use GPU for RT and CPU for BVH construction / refitting Construct BVH using binning Similar to Wald RT07 / Popov RT06 Bin all 3 dimensions using SIMD Results in > 10% better trees Measured as SAH quality, not FPS Speed loss is almost negligible
17
Stefan PopovHigh Performance GPU Ray Tracing Results
18
Stefan PopovHigh Performance GPU Ray Tracing Conclusions New recursive PRAM BVH traversal algorithm Very well suited for the new generation of GPUs No additional pre-computed data required First GPU ray tracer to handle large models Previous implementations were limited to < 300K Can handle dynamic scenes By using the CPU to update the geometry / BVH
19
Stefan PopovHigh Performance GPU Ray Tracing Future Work More features Shaders, adaptive anti-aliasing, … Global illumination Code optimizations Current implementation uses too many registers
20
Stefan PopovHigh Performance GPU Ray Tracing Thank you!
21
Stefan PopovHigh Performance GPU Ray Tracing CUDA Hello World __global__ void addArrays(int *arr1, int *arr2) { unsigned t = threadIdx.x + blockIdx.x * blockDim.x; arr1[t] += arr2[t]; } int main() { int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304); int *ta1, *ta2; cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304); for(int i = 0; i < 4194304; i++) { inArr1[i] = rand(); inArr2[i] = rand(); } cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice); cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice); addArrays >>(ta1, ta2); cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost); for(int i = 0; i < 4194304; i++) printf("%d ", inArr1[i]); return 0; }
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.