Latency considerations of depth-first GPU ray tracing

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Optimization on Kepler Zehuan Wang
The University of Adelaide, School of Computer Science
Restart Trail for Stackless BVH Traversal Samuli Laine NVIDIA Research.
IIIT Hyderabad Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad,
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
Blind Search-Part 2 Ref: Chapter 2. Search Trees The search for a solution can be described by a tree - each node represents one state. The path from.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
Extracted directly from:
GPU Programming with CUDA – Optimisation Mike Griffiths
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
Understanding the Efficiency of Ray Traversal on GPUs Timo Aila Samuli Laine NVIDIA Research.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees Tero Karras NVIDIA Research.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan UC Riverside
Sunpyo Hong, Hyesoon Kim
Performance in GPU Architectures: Potentials and Distances
(1) ©Sudhakar Yalamanchili unless otherwise noted Reducing Branch Divergence in GPU Programs T. D. Han and T. Abdelrahman GPGPU 2011.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Multi-dimensional Range Query Processing on the GPU Beomseok Nam Date Intensive Computing Lab School of Electrical and Computer Engineering Ulsan National.
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CHC ++: Coherent Hierarchical Culling Revisited Oliver Mattausch, Jiří Bittner, Michael Wimmer Institute of Computer Graphics and Algorithms Vienna University.
CUDA programming Performance considerations (CUDA best practices)
Buffering Techniques Greg Stitt ECE Department University of Florida.
Layout by orngjce223, CC-BY Compact BVH StorageFabianowski ∙ Dingliana Compact BVH Storage for Ray Tracing and Photon Mapping Bartosz Fabianowski ∙ John.
Single Instruction Multiple Threads
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Henri Ylitie, Tero Karras, Samuli Laine
Real-Time Ray Tracing Stefan Popov.
Lecture 5: GPU Compute Architecture
Flow Path Model of Superscalars
Accelerated Single Ray Tracing for Wide Vector Units
Lecture 5: GPU Compute Architecture for the last time
CS 179: Lecture 3.
Farzad Khorasani Efficient Warp Execution in Presence of Divergence with Collaborative Context Collection Farzad Khorasani
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 5: Synchronization and ILP
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Latency considerations of depth-first GPU ray tracing Michael Guthe University Bayreuth Visual Computing

Depth-first GPU ray tracing Based on bounding box or spatial hierarchy Recursive traversal Usually using a stack Threads inside a warp may access different data May also diverge

Performance Analysis What limits performance of the trace kernel? Device memory bandwidth? Obviously not!

Performance Analysis What limits performance of the trace kernel? Maximum (warp) instructions per clock? Not really!

Performance Analysis Why doesn’t the kernel fully utilize the cores? Three possible reasons: Instruction fetch e.g. due to branches Memory latency a.k.a. data request mainly due to random access Read after write latency a.k.a. execution dependency It takes 22 clock cycles (Kepler) until the result is written to a register

Performance Analysis Why doesn’t the kernel fully utilize the cores? Profiling shows: Memory & RAW latency limit performance!

Reducing Latency Standard solution for latency: Relocate memory access Increase occupancy No option due to register pressure Relocate memory access Automatically performed by compiler But not between iterations of a while loop Loop unrolling for triangle test

Reducing Latency Instruction level parallelism Wider trees Not directly supported by GPU Increases number of eligible warps Same effect as higher occupancy We might even spend some more registers Wider trees 4-ary tree means 4 independent instructions paths Almost doubles the number of eligible warps during node tests Higher width increase number of node tests, 4 is optimum

Reducing Latency Tree construction Start from root Recursively pull largest child up Special rules for leaves to reduce memory consumption Goal: 4 child nodes whenever possible

    Reducing Latency Overhead: sorting intersected nodes 0.7 0.3 Can have two independent paths with parallel merge sort We don‘t need sorting for occlusion rays 0.7 0.3  0.2 0.3 0.7 0.2  0.3 0.2 0.7  0.2 0.3 0.7 

Results Improved instructions per clock Doesn’t directly translate to speedup

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Sibenik, 80k tris.

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Fairy forest, 174k tris.

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 Conference, 283k tris.

Results Up to 20.1% speedup over Aila et. al: “Understan- ding the Efficiency of Ray Traversal on GPUs”, 2012 San Miguel, 11M tris.

Results Latency is still performance limiter Mostly improved memory latency