Latency considerations of depth-first GPU ray tracing

Latency considerations of depth-first GPU ray tracing
Michael Guthe University Bayreuth Visual Computing

Depth-first GPU ray tracing
Based on bounding box or spatial hierarchy Recursive traversal Usually using a stack Threads inside a warp may access different data May also diverge

Performance Analysis What limits performance of the trace kernel?
Device memory bandwidth? Obviously not!

Performance Analysis What limits performance of the trace kernel?
Maximum (warp) instructions per clock? Not really!

Performance Analysis Why doesn’t the kernel fully utilize the cores?
Three possible reasons: Instruction fetch e.g. due to branches Memory latency a.k.a. data request mainly due to random access Read after write latency a.k.a. execution dependency It takes 22 clock cycles (Kepler) until the result is written to a register

Performance Analysis Why doesn’t the kernel fully utilize the cores?
Profiling shows: Memory & RAW latency limit performance!

Reducing Latency Standard solution for latency: Relocate memory access
Increase occupancy No option due to register pressure Relocate memory access Automatically performed by compiler But not between iterations of a while loop Loop unrolling for triangle test

Reducing Latency Instruction level parallelism Wider trees
Not directly supported by GPU Increases number of eligible warps Same effect as higher occupancy We might even spend some more registers Wider trees 4-ary tree means 4 independent instructions paths Almost doubles the number of eligible warps during node tests Higher width increase number of node tests, 4 is optimum

Reducing Latency Tree construction Start from root
Recursively pull largest child up Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible

    Reducing Latency Overhead: sorting intersected nodes 0.7 0.3
Can have two independent paths with parallel merge sort We don‘t need sorting for occlusion rays 0.7 0.3  0.2 0.3 0.7 0.2  0.3 0.2 0.7  0.2 0.3 0.7 

Results Improved instructions per clock
Doesn’t directly translate to speedup

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Sibenik, 80k tris.

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Fairy forest, 174k tris.

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Conference, 283k tris.

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 San Miguel, 11M tris.

Results Latency is still performance limiter
Mostly improved memory latency

Latency considerations of depth-first GPU ray tracing

Similar presentations

Presentation on theme: "Latency considerations of depth-first GPU ray tracing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latency considerations of depth-first GPU ray tracing

Similar presentations

Presentation on theme: "Latency considerations of depth-first GPU ray tracing"— Presentation transcript:

Similar presentations

About project

Feedback