Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel.

Slides:

Advertisements

Similar presentations

GR2 Advanced Computer Graphics AGR

Advertisements

Sven Woop Computer Graphics Lab Saarland University

Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Lecture 6: Multicore Systems

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Restart Trail for Stackless BVH Traversal Samuli Laine NVIDIA Research.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Chapter 4: Trees Part II - AVL Tree

CSE 681 Bounding Volumes. CSE 681 Bounding Volumes Use simple volume enclose object(s) tradeoff for rays where there is extra intersection test for object.

The Efficiency of Algorithms

Collision Detection CSCE /60 What is Collision Detection?  Given two geometric objects, determine if they overlap.  Typically, at least one of.

Ray Tracing Ray Tracing 1 Basic algorithm Overview of pbrt Ray-surface intersection (triangles, …) Ray Tracing 2 Brute force: Acceleration data structures.

Ray Tracing CMSC 635. Basic idea How many intersections?  Pixels  ~10 3 to ~10 7  Rays per Pixel  1 to ~10  Primitives  ~10 to ~10 7  Every ray.

Latency considerations of depth-first GPU ray tracing

Visibility Culling. Back face culling View-frustrum culling Detail culling Occlusion culling.

CSE506: Operating Systems Block Cache. CSE506: Operating Systems Address Space Abstraction Given a file, which physical pages store its data? Each file.

Week 14 - Monday.  What did we talk about last time?  Bounding volume/bounding volume intersections.

Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.

Vertices and Fragments I CS4395: Computer Graphics 1 Mohan Sridharan Based on slides created by Edward Angel.

Ray Tracing Acceleration Structures Solomon Boulos 4/16/2004.

Tomas Mőller © 2000 Speeding up your game The scene graph Culling techniques Level-of-detail rendering (LODs) Collision detection Resources and pointers.

Ray Tracing Performance

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.

Chapter 5.4 Artificial Intelligence: Pathfinding.

10/11/2001CS 638, Fall 2001 Today Kd-trees BSP Trees.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Ray Tracing Primer Ref: SIGGRAPH HyperGraphHyperGraph.

Acceleration Digital Image Synthesis Yung-Yu Chuang 10/4/2005 with slides by Mario Costa Sousa and Pat Hanrahan.

Computer Graphics 2 Lecture x: Acceleration Techniques for Ray-Tracing Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.

Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.

1 Multiway trees & B trees & 2_4 trees Go&Ta Chap 10.

Chapter 5.4 Artificial Intelligence: Pathfinding.

Spatial Data Structures Jason Goffeney, 4/26/2006 from Real Time Rendering.

1 Speeding Up Ray Tracing Images from Virtual Light Field Project ©Slides Anthony Steed 1999 & Mel Slater 2004.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

SPATIAL DATA STRUCTURES Jon McCaffrey CIS 565. Goals  Spatial Data Structures (Construction esp.)  Why  What  How  Designing Algorithms for the GPU.

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

12/4/2001CS 638, Fall 2001 Today Managing large numbers of objects Some special cases.

Computer Organization and Assembly Language Bitwise Operators.

Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.

Starting at Binary Trees

CIS 350 – I Game Programming Instructor: Rolf Lakaemper.

Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

David Luebke11/26/2015 CS 551 / 645: Introductory Computer Graphics David Luebke

Hierarchical Penumbra Casting Samuli Laine Timo Aila Helsinki University of Technology Hybrid Graphics, Ltd.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Ray Tracing Optimizations

David Luebke 3/5/2016 Advanced Computer Graphics Lecture 4: Faster Ray Tracing David Luebke

Ray Tracing Acceleration (5). Ray Tracing Acceleration Techniques Too Slow! Uniform grids Spatial hierarchies K-D Octtree BSP Hierarchical grids Hierarchical.

David Luebke3/12/2016 Advanced Computer Graphics Lecture 3: More Ray Tracing David Luebke

Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.

Ray Tracing Acceleration (3)

CS/COE 1541 (term 2174) Jarrett Billingsley

Real-Time Ray Tracing Stefan Popov.

Ray Tracing Acceleration Techniques

Ray Tracing, Part 1 Dinesh Manocha COMP 575/770.

Accelerated Single Ray Tracing for Wide Vector Units

Digital Image Synthesis Yung-Yu Chuang 10/5/2006

Collision Detection.

Memory System Performance Chapter 3

Presentation transcript:

Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel

Ray Tracing Performance Zero to Millions in 45 Minutes?! Gordon Stoll, Intel

Goals for this talk Goals – point you toward the current state-of-the-art (“BKM”) for non-researchers: off-the-shelf performance for researchers: baseline for comparison – get you interested in poking at the problem Non-Goals – present lowest-level details of kernels – present “the one true way”

Acceleration Structures BKM is to use a kD-tree (AA BSP) Previous BKM was to use a uniform grid – Only scheme with comparable speed – Performance is not robust – No packet tracing algorithm Other grids, octrees, etc…just use a kD-tree. Don’t use bounding volume hierarchies.

kD-Trees

Advantages of kD-Trees Adaptive – Can handle the “Teapot in a Stadium” Compact – Relatively little memory overhead Cheap Traversal – One FP subtract, one FP multiply

Take advantage of advantages Adaptive – You have to build a good tree Compact – At least use the compact node representation (8-byte) – You can’t be fetching whole cache lines every time Cheap traversal – No sloppy inner loops! (one subtract, one multiply!)

“Bang for the Buck” ( !/$ ) A basic kD-tree implementation will go pretty fast… …but extra effort will pay off big.

Fast Ray Tracing w/ kD-Trees Adaptive Compact Cheap traversal

Building kD-trees Given: – axis-aligned bounding box (“cell”) – list of geometric primitives (triangles?) touching cell Core operation: – pick an axis-aligned plane to split the cell into two parts – sift geometry into two batches (some redundancy) – recurse

Building kD-trees Given: – axis-aligned bounding box (“cell”) – list of geometric primitives (triangles?) touching cell Core operation: – pick an axis-aligned plane to split the cell into two parts – sift geometry into two batches (some redundancy) – recurse – termination criteria!

“Intuitive” kD-Tree Building Split Axis – Round-robin; largest extent Split Location – Middle of extent; median of geometry (balanced tree) Termination – Target # of primitives, limited tree depth

“Hack” kD-Tree Building Split Axis – Round-robin; largest extent Split Location – Middle of extent; median of geometry (balanced tree) Termination – Target # of primitives, limited tree depth All of these techniques stink.

“Hack” kD-Tree Building Split Axis – Round-robin; largest extent Split Location – Middle of extent; median of geometry (balanced tree) Termination – Target # of primitives, limited tree depth All of these techniques stink. Don’t use them.

“Hack” kD-Tree Building Split Axis – Round-robin; largest extent Split Location – Middle of extent; median of geometry (balanced tree) Termination – Target # of primitives, limited tree depth All of these techniques stink. Don’t use them. – I mean it.

Building good kD-trees What split do we really want? – Clever Idea: The one that makes ray tracing cheap – Write down an expression of cost and minimize it – Cost Optimization What is the cost of tracing a ray through a cell? Cost(cell) = C_trav + Prob(hit L) * Cost(L) + Prob(hit R) * Cost(R)

Splitting with Cost in Mind

Split in the middle Makes the L & R probabilities equal Pays no attention to the L & R costs

Split at the Median Makes the L & R costs equal Pays no attention to the L & R probabilities

Cost-Optimized Split Automatically and rapidly isolates complexity Produces large chunks of empty space

Building good kD-trees Need the probabilities – Turns out to be proportional to surface area Need the child cell costs – Simple triangle count works great (very rough approx.) Cost(cell) = C_trav + Prob(hit L) * Cost(L) + Prob(hit R) * Cost(R) = C_trav + SA(L) * TriCount(L) + SA(R) * TriCount(R)

Termination Criteria When should we stop splitting? – Another Clever idea: When splitting isn’t helping any more. – Use the cost estimates in your termination criteria Threshold of cost improvement – Stretch over multiple levels Threshold of cell size – Absolute probability so small there’s no point

Building good kD-trees Basic build algorithm – Pick an axis, or optimize across all three – Build a set of “candidates” (split locations) BBox edges or exact triangle intersections – Sort them or bin them – Walk through candidates or bins to find minimum cost split Characteristics you’re looking for – “stringy”, depth , ~2 triangle leaves, big empty cells

Just Do It Benefits of a good tree are not small – not 10%, 20%, 30%... – several times faster than a mediocre tree

Building kD-trees quickly Very important to build good trees first – otherwise you have no basis for comparison Don’t give up cost optimization! – Use the math, Luke… Luckily, lots of flexibility… – axis picking (“hack” pick vs. full optimization) – candidate picking (bboxes, exact; binning, sorting) – termination criteria (“knob” controlling tradeoff)

Building kD-trees quickly Remember, profile first! Where’s the time going? – split personality memory traffic all at the top (NO cache misses at bottom) – sifting through bajillion triangles to pick one split (!) – hierarchical building? computation mostly at the bottom – lots of leaves, need more exact candidate info – lazy building? change criteria during the build?

Fast Ray Tracing w/ kD-Trees adaptive – build a cost-optimized kD-tree w/ the surface area heuristic compact cheap traversal

What’s in a node? A kD-tree internal node needs: – Am I a leaf? – Split axis – Split location – Pointers to children

Compact (8-byte) nodes kD-Tree node can be packed into 8 bytes – Leaf flag + Split axis 2 bits – Split location 32 bit float – Always two children, put them side-by-side One 32-bit pointer

Compact (8-byte) nodes kD-Tree node can be packed into 8 bytes – Leaf flag + Split axis 2 bits – Split location 32 bit float – Always two children, put them side-by-side One 32-bit pointer So close! Sweep those 2 bits under the rug…

No Bounding Box! kD-Tree node corresponds to an AABB Doesn’t mean it has to *contain* one – 24 bytes – 4X explosion (!)

Memory Layout Cache lines are much bigger than 8 bytes! – advantage of compactness lost with poor layout Pretty easy to do something reasonable – Building depth first, watching memory allocator

Other Data Memory should be separated by rate of access – Frames – << Pixels – << Samples [ Ray Trees ] – << Rays [ Shading (not quite) ] – << Triangle intersections – << Tree traversal steps Example: pre-processed triangle, shading info…

Fast Ray Tracing w/ kD-Trees adaptive – build a cost-optimized kD-tree w/ the surface area heuristic compact – use an 8-byte node – lay out your memory in a cache-friendly way cheap traversal

kD-Tree Traversal Step split t_split t_min t_max

kD-Tree Traversal Step split t_split t_min t_max

kD-Tree Traversal Step split t_split t_min t_max

kD-Tree Traversal Step Given: ray P & iV (1/V), t_min, t_max, split_location, split_axis t_at_split = ( split_location - ray->P[split_axis] ) * ray_iV[split_axis] if t_at_split > t_min need to test against near child If t_at_split < t_max need to test against far child

Optimize Your Inner Loop kD-Tree traversal is the most critical kernel – It happens about a zillion times – It’s tiny – Sloppy coding will show up Optimize, Optimize, Optimize – Remove recursion and minimize stack operations – Other standard tuning & tweaking

kD-Tree Traversal while ( not a leaf ) t_at_split = ( split_location - ray->P[split_axis] ) * ray_iV[split_axis] if t_split <= t_min continue with far child // hit either far child or none if t_split >= t_max continue with near child // hit near child only // hit both children push (far child, t_split, t_max) onto stack continue with (near child, t_min, t_split)

Can it go faster? How do you make fast code go faster? Parallelize it!

Ray Tracing and Parallelism Classic Answer: Ray-Tree parallelism – independent tasks – # of tasks = millions (at least) – size of tasks = thousands of instructions (at least) So this is wonderful, right?

Parallelism in CPUs Instruction-Level Parallelism (ILP) – pipelining, superscalar, OOO, SIMD – fine granularity (~100 instruction “window” tops) – easily confounded by unpredictable control – easily confounded by unpredictable latencies So…what does ray tracing look like to a CPU?

No joy in ILP-ville At <1000 instruction granularity, ray tracing is anything but “embarrassingly parallel” kD-Tree traversal (CPU view): 1) fetch a tiny fraction of a cache line from who knows where 2) do two piddling floating-point operations 3) do a completely unpredictable branch, or two, or three 4) repeat until frustrated PS: Each operation is dependent on the one before it. PPS: No SIMD for you! Ha!

Split Personality Coarse-Grained parallelism (TLP) is perfect – millions of independent tasks – thousands of instructions per task Fine-Grained parallelism (ILP) is awful – look at a scale <1000 of instructions sequential dependencies unpredictable control paths unpredictable latencies no SIMD

Options Option #1: Forget about ILP, go with TLP – improve low-ILP efficiency and use multiple CPU cores Option #2: Let TLP stand in for ILP – run multiple independent threads (ray trees) on one core Option #3: Improve the ILP situation directly – how? Option #4: …

…All of the above! multi-core CPUs are already here (more coming) – better performance, better low-ILP performance – on the right performance curve multi-threaded CPUs are already here – improve well-written ray tracer by ~20-30% packet tracing – trace multiple rays together in a packet – bulk up the inner loop with ILP-friendly operations

Packet Tracing Very, very old idea from vector/SIMD machines – Vector masks Old way – if the ray wants to go left, go left – if the ray wants to go right, go right New way – if any ray wants to go left, go left with mask – if any ray wants to go right, go right with mask

Key Observations Doesn’t add “bad” stuff – Traverses the same nodes – Adds no global fetches – Adds no unpredictable branches What it does add – SIMD-friendly floating-point operations – Some messing around with masks Result: Very robust in relation to single rays

How many rays in a packet? Packet tracing gives us a “knob” with which to adjust computational intensity. Do natural SIMD width first Real answer is potentially much more complex – diminishing returns due to per-ray costs – lack of coherence to support big packets – register pressure, L1 pressure Makes hardware much more likely/possible

Fast Ray Tracing w/ kD-Trees Adaptive – build a cost-optimized tree (w/ surface area heuristic) Compact – use an 8-byte node – lay out your memory in a cache-friendly way Cheap traversal – optimize your inner loop – trace packets

Getting started… Read PBRT (yeah, I know, it’s 1300 pages) – great book, pretty decent kD-tree builder Read Ingo Wald’s thesis – lots of coding details for this stuff Track down the interesting references Learn SIMD programming (e.g. SSE intrinsics) Use a profiler.

Getting started… Read PBRT (yeah, I know, it’s 1300 pages) – great book, pretty decent kD-tree builder Read Ingo Wald’s thesis – lots of coding details for this stuff Track down the interesting references Learn SIMD programming (e.g. SSE intrinsics) Use a profiler. I mean it.

If you remember nothing else “Rays per Second” is measured in millions.