Download presentation
Presentation is loading. Please wait.
Published byJason Alexander Modified over 9 years ago
1
Latency considerations of depth-first GPU ray tracing
Michael Guthe University Bayreuth Visual Computing
2
Depth-first GPU ray tracing
Based on bounding box or spatial hierarchy Recursive traversal Usually using a stack Threads inside a warp may access different data May also diverge
3
Performance Analysis What limits performance of the trace kernel?
Device memory bandwidth? Obviously not!
4
Performance Analysis What limits performance of the trace kernel?
Maximum (warp) instructions per clock? Not really!
5
Performance Analysis Why doesn’t the kernel fully utilize the cores?
Three possible reasons: Instruction fetch e.g. due to branches Memory latency a.k.a. data request mainly due to random access Read after write latency a.k.a. execution dependency It takes 22 clock cycles (Kepler) until the result is written to a register
6
Performance Analysis Why doesn’t the kernel fully utilize the cores?
Profiling shows: Memory & RAW latency limit performance!
7
Reducing Latency Standard solution for latency: Relocate memory access
Increase occupancy No option due to register pressure Relocate memory access Automatically performed by compiler But not between iterations of a while loop Loop unrolling for triangle test
8
Reducing Latency Instruction level parallelism Wider trees
Not directly supported by GPU Increases number of eligible warps Same effect as higher occupancy We might even spend some more registers Wider trees 4-ary tree means 4 independent instructions paths Almost doubles the number of eligible warps during node tests Higher width increase number of node tests, 4 is optimum
9
Reducing Latency Tree construction Start from root
Recursively pull largest child up Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible
10
Reducing Latency Overhead: sorting intersected nodes 0.7 0.3
Can have two independent paths with parallel merge sort We don‘t need sorting for occlusion rays 0.7 0.3 0.2 0.3 0.7 0.2 0.3 0.2 0.7 0.2 0.3 0.7
11
Results Improved instructions per clock
Doesn’t directly translate to speedup
12
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Sibenik, 80k tris.
13
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Fairy forest, 174k tris.
14
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Conference, 283k tris.
15
Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 San Miguel, 11M tris.
16
Results Latency is still performance limiter
Mostly improved memory latency
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.