Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing
Depth-first GPU ray tracing Based on bounding box or spatial hierarchy Recursive traversal Usually using a stack Threads inside a warp may access different data May also diverge
Performance Analysis What limits performance of the trace kernel? Device memory bandwidth? Obviously not!
Performance Analysis What limits performance of the trace kernel? Maximum (warp) instructions per clock? Not really!
Performance Analysis Why doesn’t the kernel fully utilize the cores? Three possible reasons: Instruction fetch e.g. due to branches Memory latency a.k.a. data request mainly due to random access Read after write latency a.k.a. execution dependency It takes 22 clock cycles (Kepler) until the result is written to a register
Performance Analysis Why doesn’t the kernel fully utilize the cores? Profiling shows: Memory & RAW latency limit performance!
Reducing Latency Standard solution for latency: Relocate memory access Increase occupancy No option due to register pressure Relocate memory access Automatically performed by compiler But not between iterations of a while loop Loop unrolling for triangle test
Reducing Latency Instruction level parallelism Wider trees Not directly supported by GPU Increases number of eligible warps Same effect as higher occupancy We might even spend some more registers Wider trees 4-ary tree means 4 independent instructions paths Almost doubles the number of eligible warps during node tests Higher width increase number of node tests, 4 is optimum
Reducing Latency Tree construction Start from root Recursively pull largest child up Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible
Reducing Latency Overhead: sorting intersected nodes 0.7 0.3 Can have two independent paths with parallel merge sort We don‘t need sorting for occlusion rays 0.7 0.3 0.2 0.3 0.7 0.2 0.3 0.2 0.7 0.2 0.3 0.7
Results Improved instructions per clock Doesn’t directly translate to speedup
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Sibenik, 80k tris.
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Fairy forest, 174k tris.
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Conference, 283k tris.
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 San Miguel, 11M tris.
Results Latency is still performance limiter Mostly improved memory latency