Understanding the Efficiency of Ray Traversal on GPUs Timo Aila Samuli Laine NVIDIA Research.

Slides:

Advertisements

Similar presentations

Sven Woop Computer Graphics Lab Saarland University

Advertisements

Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Restart Trail for Stackless BVH Traversal Samuli Laine NVIDIA Research.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Physically Based Real-time Ray Tracing Ryan Overbeck.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Two Methods for Fast Ray-Cast Ambient Occlusion Samuli Laine and Tero Karras NVIDIA Research.

Latency considerations of depth-first GPU ray tracing

Two-Level Grids for Ray Tracing on GPUs

Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

©Wen-mei W. Hwu and David Kirk/NVIDIA 2010 ECE 498HK Computational Thinking for Many-core Computing Lecture 15: Dealing with Dynamic Data.

Memory-Savvy Distributed Interactive Ray Tracing David E. DeMarle Christiaan Gribble Steven Parker.

Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.

Multiprocessor Cache Coherency

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Samuli Laine: A General Algorithm for Output-Sensitive Visibility PreprocessingI3D 2005, April 3-6, Washington, D.C. A General Algorithm for Output- Sensitive.

By Dominik Seifert B Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little.

2 Path Tracing Overview Cast a ray from the camera through the pixel At hit point, evaluate material Determine new incoming direction Update path throughput.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Computer Graphics Graphics Hardware

Architecture Considerations for Tracing Incoherent Rays Timo Aila, Tero Karras NVIDIA Research.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.

Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.

Hierarchical Penumbra Casting Samuli Laine Timo Aila Helsinki University of Technology Hybrid Graphics, Ltd.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Interactive Ray Tracing of Dynamic Scenes Tomáš DAVIDOVIČ Czech Technical University.

Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees Tero Karras NVIDIA Research.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Counting and Distributed Coordination BASED ON CHAPTER 12 IN «THE ART OF MULTIPROCESSOR PROGRAMMING» LECTURE BY: SAMUEL AMAR.

Sunpyo Hong, Hyesoon Kim

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

My Coordinates Office EM G.27 contact time:

Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.

8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.

CHC ++: Coherent Hierarchical Culling Revisited Oliver Mattausch, Jiří Bittner, Michael Wimmer Institute of Computer Graphics and Algorithms Vienna University.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Yu-Lun Kuo Computer Sciences and Information Engineering

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Performance of Single-cycle Design

Henri Ylitie, Tero Karras, Samuli Laine

Two-Level Grids for Ray Tracing on GPUs

Cache Memory Presentation I

Real-Time Ray Tracing Stefan Popov.

Lecture 2: Intro to the simd lifestyle and GPU internals

Architecture Considerations for Tracing Incoherent Rays

Accelerated Single Ray Tracing for Wide Vector Units

Using Packet Information for Efficient Communication in NoCs

Forest Packing: Fast Parallel, Decision Forests

Presentation transcript:

Understanding the Efficiency of Ray Traversal on GPUs Timo Aila Samuli Laine NVIDIA Research

Agenda What limits the performance of fastest traversal methods on GPUs? Memory speed (bandwidth, latency)? Computation? Resource conflicts (serialization, scoreboard, …)? Load balancing? How far from “theoretical optimum”? How much performance on table? Solutions that help today Solutions that may help tomorrow

Terminology Trace() Unpredictable sequence of acceleration structure traversal and primitive intersection SIMT SIMD with execution divergence handling built into hardware Computes everything on all lanes Warp Group of threads that execute simultaneously in a SIMD/SIMT unit, 32 in NVIDIA hardware

“Theoretical optimum” Peak FLOPS as quoted by marketing? Achievable only in special cases Too conservative bound Instructions issued/sec for a given kernel (i.e. program), assuming: Infinitely fast memory Complete absence of resource conflicts  Tight upper bound when limited by computation

Simulator Custom simulator written for this project Inputs Sequence of operations for each ray (traversal, intersection, enter leaf, mainloop) Native asm instructions for each operation Execution Mimics GTX285 instruction issue [Lindholm et al. 2008] Assumes all instructions have zero latency Configurable SIMD width, default 32 Outputs Practical upper bound of performance

Test setup NVIDIA GeForce GTX285, CUDA 2.1 BVH (64 bytes per 2 child nodes) Always tested together, proceeds to closer Greedy SAH construction, early triangle splits Max 8 triangles per leaf Woop’s unit triangle test (48 bytes per tri) Nodes in 1D texture -- cached Triangles in global memory -- uncached Hierarchical optimizations not used in traversal

Test scenes Sibenik (80K tris, 54K nodes) 1024x768, 32 secondary rays (Halton) Average of 5 viewpoints All timings include only trace() Fairy (174K tris, 66K nodes) Conference (282K tris, 164K nodes)

Packet traversal Assign one ray to each thread Follows Günther et al Slightly optimized for GT200 All rays in a packet (i.e. warp) follow exactly the same path in the tree Single traversal stack per warp, in shared mem Rays visit redundant nodes Coherent memory accesses Could expect measured performance close to simulated upper bound

Packet traversal Simulated Mrays/s Measured Mrays/s % of simulated Primary AO Diffuse Only 40%!? Similar in other scenes. Not limited by computation Memory speed even with coherent accesses? Simulator broken? Resource conflicts? Load balancing? 2.5X performance on table

Per-ray traversal Assign one ray to each thread Full traversal stack for each ray In thread-local (external) mem [Zhou et al. 2008] No extra computation on SIMT Rays visit exactly the nodes they intersect Less coherent memory accesses Stacks cause additional memory traffic If memory speed really is the culprit Gap between measured and simulated should be larger than for packet traversal

Per-ray traversal Simulated Mrays/s Measured Mrays/s % of simulated Primary AO Diffuse % -- getting closer with all ray types Memory not guilty after all?

while-while vs. if-if if-if trace() while ray not terminated if node does not contain primitives traverse to the next node if node contains untested primitives perform ray-node intersection while-while trace() while ray not terminated while node does not contain primitives traverse to the next node while node contains untested primitives perform ray-node intersection

Per-ray traversal (if-if) ~20% slower code gives same measured perf Memory accesses are less coherent Faster than while-while when leaf nodes smaller Neither memory communication nor computation should favor if-if Results possible only when some cores idle? Simulated Mrays/s Measured Mrays/s % of simulated Primary AO Diffuse

Work distribution Histograms of warp execution times Fewer extremely slow warps in if-if Slowest warp 30% faster in if-if Otherwise similar CUDA work distribution units Optimized for homogeneous work items Applies to all NVIDIA’s current cards Trace() has wildly varying execution time May cause starvation issues in work distribution Need to bypass in order to quantify

Persistent threads Launch only enough threads to fill the machine once Warps fetch work from global pool using atomic counter until the pool is empty Bypasses hardware work distribution Simple and generic solution Pseudocode in the paper

Persistent packet traversal Simulated Mrays/s Measured Mrays/s % of simulated Primary AO Diffuse ~2X performance from persistent threads ~85% of simulated, also in other scenes Hard to get much closer Optimal dual issue, no resource conflicts, infinitely fast memory, 20K threads…

Persistent while-while Simulated Mrays/s Measured Mrays/s % of simulated Primary AO Diffuse ~1.5X performance from persistent threads ~80% of simulated, other scenes ~85% Always faster than packet traversal 2X with incoherent rays

Speculative traversal “If a warp is going to execute node traversal anyway, why not let all rays participate?” Alternative: be idle Can perform redundant node fetches Should help when not bound by memory speed 5-10% higher performance in primary and AO No improvement in diffuse Disagrees with simulation (10-20% expected) First evidence of memory bandwidth issues? Not latency, not computation, not load balancing…

Two further improvements Currently these are slow because crucial instructions missing Simulator says 2 warp-wide instructions will help ENUM (prefix-sum) Enumerates threads for which a condition is true Returns indices [0,M-1] POPC (population count) Returns the number threads for which a condition is true, i.e. M above

1. Replacing terminated rays Threads with terminated rays are idle until warp terminates Replace terminated rays with new ones Less coherent execution & memory accesses Remember: per-ray kernels beat packets Currently helps in some cases, usually not With ENUM & POPC, +20% possible in ambient occlusion and diffuse, simulator says Iff not limited by memory speed

2. Local work queues Assign 64 rays to a 32-wide warp Keep the other 32 rays in shared mem/registers 32+ rays will always require either node traversal or primitive intersection Almost perfect SIMD efficiency (% threads active) Shuffling takes time Too slow on GTX285 With ENUM + POPC, in Fairy scene Ambient occlusion +40% Diffuse +80% Iff not limited by memory speed

Conclusions (1/2) Primary bottleneck was load balancing Reasonably coherent rays not limited by memory bandwidth on GTX285 Even without cache hierarchy Larger scenes should work fine Only faster code and better trees can help E.g. Stich et al ~40% performance on table for incoherent rays Memory layout optimizations etc might help Wide trees, maybe with enum & popc?

Conclusions (2/2) Encouraging performance Especially for incoherent rays Randomly shuffled diffuse rays 20-40M/sec GPUs not so bad in ray tracing after all Persistent threads likely useful in other applications that have heterogeneous workloads

Acknowledgements David Luebke for comments on drafts, suggesting we include section 5 David Tarjan, Jared Hoberock for additional implementations, tests Reviewers for clarity improvements Bartosz Fabianowski for helping to make CUDA kernels more understandable Marko Dabrovic for Sibenik University of Utah for Fairy Forest

CUDA kernels available NVIDIA Research website (soon) Download, test, improve Thank you for listening!