Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald
Hardware for Realtime Ray Tracing Custom Hardware for Realtime Ray Tracing – Characteristics and requirements – RPU Design and Implementation GPU + Recursion + Custom Traversal HW – Programming Model – FPGA Prototype – Performance and Scalability
Ray Tracing on CPUs Characteristics – Commodity, well understood HW – High FP performance, yet still too slow – Limited parallelism, bulky clusters – Poor silicon usage (e.g. cache) Outlook – Multi-core designs are coming – Will still take too long
Ray Tracing on GPUs Characteristics – Very high raw FP performance – High degree of parallelism – Fast development cycle Stream programming model – Still too limited for efficient ray tracing No support for recursion Limited memory access
Ray Tracing Characteristics: kd-Tree Traversal One-dimensional computation along ray – Compute location of d relative to t_min / t_max – Iterate or recurse with updated t_max / t_max t_min t_max d t_min t_max dsplit t_min t_max d split Near: t_min< t_max < dBoth: t_min < d < t_maxFar: d < t_min < t_max
Ray Tracing Characteristics: kd-Tree Traversal Inner traversal loop tmp = node.split – ray.origin d = tmp * 1/ray.direction near = d > t_min far = d < t_max if (near & far) push(node.far, d, t_max) if (near)iterate(node.near, t_min, d) elseiterate(node.far, d, t_max) Advantages of using kd-trees – Simple and fast traversal & building algorithm – Robust & very good handling of large scenes t_min t_max d split
Ray Tracing Characteristics: kd-Tree Traversal Traversal Processing – k-D steps per 10 instructions/step many instructions many clock cycles – Serial dependency low pipeline efficiency, stalls, latency – Limited but flexible control flow and memory access Custom HW unit – One clock tick per traversal step (fully pipelined) – Up to 100:1 improvement
Ray Tracing Characteristics: Intersection Intersection computation – Triggered by traversal at every leaf node Called with: ray and address of geometry – Option 1: Custom hardware [SaarCOR’05] – Option 2: Software on programmable processor Can be implemented efficiently Enables arbitrary programmable primitives Do not use costly dedicated hardware
Ray Tracing Characteristics: Shading Shading computation – Triggered by finished ray traversal Called with: ray, hit point, shader-id, address of parameters – Characteristics: General-purpose computation, many 3-/4-vectors Needs support for efficient texture and memory access Needs support for arbitrary recursive tracing rays – E.g. support dependent ray tracing Main feature of ray tracing: Do not put limits on it
Ray Tracing Characteristics: Coherence Ray coherence – Neighboring primary rays Traverse highly similar kd-node in same order Often hit same geometric primitives Often execute the same shader, access same textures, … – Similar for shadow rays to one light source – Often (but not always) applies for secondary rays HW should take advantage of this coherence
Previous Work SaarCOR I – Fixed function ray tracing chip [GH’05]
RPU Approach Take GPUs as basis and core component – Highly parallel, highly efficient Improve programming model – Add efficient recursion, conditionals – Add memory access options Add custom traversal unit – Slave to RPU – Performs indirect, data dependent functions calls
RPU Design Shader Processing Units (SPU) -General purpose computation -For shading, geometry, lighting computations -Operates on 4-component vectors -Integer and float -Dual issue, split vector -GPU-like instruction set -Arbitrary read/write -Texture addressing mode -No texture filtering SW
RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) -Efficient traversal of k-D trees -Communicates with SPU over dedicated registers
RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading -Increases usage of HW resources -Hides latency due to -Memory access -Instruction dependencies -Long traversal operations -Separate thread pool for SPU & TPU -Software scheduling (compiler) -No overhead for switching threads -Increases resources (mainly register file)
RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking -SIMD execution (SPUs & TPUs) -Takes advantage of coherence -Reduces hardware complexity -Can combine of memory requests -Reduces external bandwidth -Must allow for incoherence -Chunks may split at conditionals -Inactive sub-chunk put on stack -Masked execution -Worst case: serial computation
RPU Design Shader Processing Units (SPU) Custom Ray Traversal Unit (TPU) Multi-Threading Chunking Mailbox Processing (MPU) Per thread caching mechanism Avoids multiple processing of same kd-tree entry (e.g. triangle) 10x performance for some scenes
RPU Architecture
SPU Vector Registers All registers have 4- component (float or integer) R0 to R15: General registers – Index into a HW managed register stack – Allows for single-cycle function call P0 to P15: shader parameters I0 to I3: data read from memory A = (A0,A1,A2,A3) – Memory addressing ORG, DIR,... – TPU communication registers
Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store
Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store
Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store
Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store
Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store
Instruction Set of SPU Ray traversal operation trace Conditional instructions (paired) if jmp label if call If return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store
Ray Triangle Intersection Unit-Triangle Test ; load triangle transformation load4x A.y,0 ; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2 ; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return ; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy ( =1) return ; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return Input Arithmetic (dot products) Multi-issue (arith. & cond.)
Read Instruction Read 3 Source Registers Swizzeling mov R0,R1 * mov R2,R3 * mov R0,R2 Masking Writeback * Memory Access Writeback I0 – I3 *** Clamp Thread Control Branching Stack Control RCP, RSQ Writeback Masking Shader Processing Unit Pipelining
RPU Programming Model ↨: Direct function calls ↔: Indirect function calls via TPU... Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays SPU Processing TPU / MPU Processing... TPU/ MPU shadow rays
RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays
RPU Programming Model Threads are started for each pixel Registers initialized from an input stream – 2D Hilbert curve generator sampling the screen – Memory stream for multi-pass Shader computes ray Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays
RPU Programming Model Threads are started Registers initialized from an input stream – 2D Hilbert curve generator sampling the screen – Memory stream for multi-pass Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays
RPU Programming Model Shooting Primary Rays – Ray traversal performed on the TPU – Started in top-level kd-tree – Intersector transforms ray into local coordinate system Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree
RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree
RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays top-level kd-tree
RPU Programming Model Shooting Primary Rays (II) – Transformed ray traversed through object kd-tree on TPU – Geometry intersection performed on programmable SPU – Programmable geometry: triangles, spheres, bicubic splines, quadrics, … Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays object-level kd-tree
RPU Programming Model Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays object-level kd-tree
RPU Programming Model Surface shading performed on programmable SPU – Surface shader is called directly from primary shader – Arguments passed on HW stack – May trace secondary rays at any time: reflection, refraction, … – Writing shaders is easy due to global access to the scene and physically-based computation Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays
RPU Programming Model Light properties and illumination can be abstracted using function calls Illumination shader iterates over all light sources For each light source a Light source shader is called Primary Ray Shader TPU/ MPU Top-Level Object Intersector TPU/ MPU Surface/ BRDF Shader Light Source Shader Lighting Shader TPU/ MPU Light Source Shader Geometry Intersector primary ray secondary rays TPU/ MPU shadow rays
Prototype Implementation
Prototype Performance FPGA prototype – Xilinx Virtex II 6000 – 128 MB DDR-RAM at 350 MB/s – PCI bus for up-/download (no VGA) Single RPU at only 66 MHz – Up to 4 million rays per second – Up to x384 – Same ray tracing performance as Intel 2.66 GHz
Scalability Larger Chunk Size – Less ray coherence – More data is accessed – Increased cache bandwidth – Larger caches
Scalability Larger Chunk Size Multiple RPUs on a Chip – Limited by VLSI technology Memory bandwidth – FPGA prototype versus current GPUs Floating point units 50x Memory bandwidth 100x Clock rate 7x
Scalability Larger Chunk Size Multiple RPUs on a Chip Multiple chips on a board – Fast interconnect for data exchange – Cache sizes accumulate – Managed through virtual memory [Schmittler’2003] – Limited through external bandwidth due to scene changes
Scalability Larger Chunk Size Multiple RPUs on a Chip Multiple chips on a board Multiple boards in a PC – Similar to today’s PC clusters in a much smaller form factor
Video
Future Work Support for fully dynamic scenes – Vertex shader + building kd-trees Efficient photon mapping – kd-tree construction + kNN filtering OpenRT-API [Dietrich’03] ASIC prototype
Questions?