Download presentation
1
Sven Woop Computer Graphics Lab Saarland University
DRPU: A Programmable Hardware Architecture for Real-Time Ray Tracing of Coherent Dynamic Scenes Sven Woop Computer Graphics Lab Saarland University
2
Overview Motivation: Why Ray Tracing? Previous Work DRPU Architecture
FPGA Prototype ASIC Performance Estimates Conclusion & Future Work
3
Why not Rasterization ... Primitive Operation: Rasterize Isolated Triangles Perfect for dynamic scenes Very simple operation (good for HW) Parallel processing of triangles and fragments (good for HW) No global access to the scene All Interesting Visual Effects Need 2+ Triangles (Shadows, Reflection, Global Illumination, …) Approximations via multiple pass approaches have many issues Difficult to Use Algorithm Very Fast Hardware Implementations
4
... but Ray Tracing? Primive Operation: Trace a Ray
O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance
5
... but Ray Tracing? Primive Operation: Trace a Ray
O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance
6
... but Ray Tracing? Primive Operation: Trace a Ray
O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance
7
... but Ray Tracing? Primive Operation: Trace a Ray
O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance
8
... but Ray Tracing? Primive Operation: Trace a Ray
O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance
9
... but Ray Tracing? Primive Operation: Trace a Ray
O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance
10
... but Ray Tracing? Primive Operation: Trace a Ray
O(log n) traversal operation Demand driven Global Access to Scene Automatic combination of effects (orthogonal shaders) Recursive evaluation Physical Light Simulation Embarrassingly parallel (good for HW) Accurate and realistic images Easy to use Algorithm Low Performance
11
Previous Work Ray Tracers for Static Scenes
CPU based: [OpenRT], [MLRT SIGGRAPH05] GPU based: Purcell (Grids) [SIGGRAPH02], Foley et al. (KD Trees) [GH05] Stefan Popov (Stackless KD Tree traversal) [EG07] Custom Hardware: ART-VPS (AR350 Chip for offline rendering) Schmittler (SaarCOR) [GH04] Woop (RPU) [SIGGRAPH05] Ray Tracers for Dynamic Scenes CPU based: Wald (Grids) [SIGGRAPH06] Wald (AABVHs) [TOG / Tech. Rep. 2006] Wächter and Keller (BIH) [EG06] Johannes Günther (Motion Decomposition) [EG06] Custom Hardware: Woop (B-KD Trees) [GH06] Woop (DRPU-ASIC) [RT06]
12
Why isn’t everybody using Ray Tracing …
Low Performance High computational complexity 1 million pixels (minimal) 30 frames per second (minimal) 10 rays per pixel (minimal) At least 300 million rays 24 billion traversal steps (80 trav. steps per ray) 240 billion instructions (10 instructions) 0.5 trillion (5E11) cycles (instruction dependencies) Limited Support for Dynamic Scenes Due to need of spatial index structures (costly rebuild O(n log n)) But most graphics applications are highly dynamic (e.g. computer games)
13
… and what can be done? Hardware Implementation (DRPU)
High performance through dedicated hardware units A high end ASIC implementation would provide enough performance for computer games using RT (about 200 million rays/s) Algorithmic Changes B-KD Trees as spatial index structure Supports most kinds of dynamic scenes
14
DRPU Architecture Task Parallelism Optimized Hardware Units
vertices from memory
15
DRPU Architecture Rendering Units
Synchronous execution of packets of 4 rays Memory bandwidth reduction (combining) Sharing of HW (e.g. caches) Highly multi-threaded Higher hardware usage First level caches Memory bandwidth reduction Memory latency reduction vertices from memory
16
DRPU Hardware Architecture
vertices from memory
17
DRPU Architecture Programmable Shading Processor Fully programmable
In-order execution 4-component SIMD operations Similar Instruction set to GPUs, but: Efficient recursion Flexible memory access Programming Model Material shading Ray generation tasks Calls Ray Casting Units to cast rays vertices from memory
18
DRPU Architecture Programmable Shading Unit Ray Casting Units
Find closest intersection of a ray with the scene High-performance traversal and intersection Implement the atomic “trace” instruction of Shading Processor SP can continue scheduling instruction not dependent on intersection result vertices from memory
19
DRPU Architecture Programmable Shading Unit Ray Casting Units
Traversal Processor B-KD Tree approach vertices from memory
20
Definition of B-KD Trees
B-KD Tree (Bounded KD-Tree) Binary Tree 1D bounding intervals (or slabs) for each child Leaf nodes point to a single primitive Bounding Volume Hierarchy (subdivides geometry)
21
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T)
Each node T can be assigned a box B(T) B(T)
22
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T)
Hright(min_1) = { (x,y,z) | x >= min_1 } B(T)
23
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree)
Hright(min_1) = { (x,y,z) | x >= min_1 } Hleft(max_1) = { (x,y,z) | x <= max_1 }
24
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T)
Hleft(min_1) Hright(max_1) B(T)
25
B-KD Tree Semantics B-KD Tree (Bounded KD-Tree) B(T) B(T0)
B(Troot) = R3 B(T0) = B(T) Hleft(min_0) Hright(max_0) B(T1) = B(T) Hleft(min_1) Hright(max_1) B(T) B(T0)
26
B-KD Tree Example
27
B-KD Tree Example
28
B-KD Tree Example
29
B-KD Tree Example
30
B-KD Tree Example Boxes may Overlap
More traversal steps as for KD Tree Support for dynamic scenes
31
B-KD Tree Example Boxes may Overlap
More traversal steps as for KD Tree Support for dynamic scenes
32
Traversal of B-KD Trees
Interval Algorithm B(T)
33
Traversal of B-KD Trees
Interval Algorithm Early ray termination B(T)
34
Traversal of B-KD Trees
Interval Algorithm Early ray termination Compute Distances
35
Traversal of B-KD Trees
Interval Algorithm Early ray termination Compute Distances
36
Traversal of B-KD Trees
Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Simple min/max operations
37
Traversal of B-KD Trees
Interval Algorithm Early ray termination Compute Distances Clipping of near/far interval against both bounding intervals Take closer child, push farther child to stack Traversal order does not affect correctness
38
Traversal Processor Stack control computes next address 36 FPUs
39
Traversal Processor 36 FPUs Stack control computes next address
Next node is fetched from cache 36 FPUs
40
Traversal Processor 36 FPUs Stack control computes next address
Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 36 FPUs
41
Traversal Processor 36 FPUs Stack control computes next address
Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision 36 FPUs
42
Traversal Processor Stack control computes next address
Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right
43
Traversal Processor Stack control computes next address
Next node is fetched from cache 4 traversal slices compute 4x4 distances to bounding planes 4 Decision Units compute per ray traversal decision Packet Decision Unit computes packet traversal decision Packet goes left if exists a that ray goes left Packet goes right if exists a ray that goes right Packet goes from left to right if exists a ray that goes into both children from left to right Incoherent packets possible
44
DRPU Architecture Programmable Shading Unit Ray Casting Units
Traversal Processor Geometry Processor Ray transformations Vertex-based ray/triangle intersection [Möller Trumbore] Solve linear system of equations with 3 unknowns Shared vertices save memory 6x 1 ray/triangle intersection each 2 cycle 38 floating point units vertices from memory
45
DRPU Architecture Programmable Shading Unit Ray Casting Units
Scene Changes Skinning Processor Skeleton Subspace Deformation Re-uses Geometry Unit 4 additional floating point units Pure stream architecture vertices from memory
46
B-KD Trees for Dynamic Scenes
B-KD Tree Approach Initially build B-KD tree O(n log n) Update after each frame O(n) Updating Works well for Continuous motion where structure of motion matches tree structure E.g. skinned meshes, characters, water surfaces, ... Not Optimal for Random motions, turbulence However amortizing O(n log n) reconstruction over many frames is feasible
47
Examples Bounding Approaches Perform well for Continous motion
Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...
48
Examples Bounding Approaches Perform well for Continous motion
Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...
49
Examples Bounding Approaches Perform well for Continous motion
Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...
50
Examples Bounding Approaches Perform well for Continous motion
Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...
51
Examples Bounding Approaches Perform well for Continous motion
Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...
52
Examples Bounding Approaches Perform well for Continous motion
Structure of motion must match tree structure E.g. skinned meshes, characters, water surfaces, ...
53
Examples Bounding Volume Approaches are less Efficient for
Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes
54
Examples Bounding Volume Approaches are less Efficient for
Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes
55
Examples Bounding Volume Approaches are less Efficient for
Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes
56
Examples Bounding Volume Approaches are less Efficient for
Non-continous motion Structure of motion does not match tree structure High traversal cost due to large overlapping boxes
57
DRPU Architecture Programmable Shading Unit Ray Casting Units
Scene Changes Skinning Processor Update Processor In-order execution 32 bit instructions Precomputed Instruction Stream Load vertex, merge 3 vertices, merge 2 boxes ¼ more memory (#vertices + #nodes) instructions One B-KD node update each two clock cycle peak vertices from memory
58
FPGA Implementation Hardware Implementation Virtex4 Board
HWML Hardware Description Xilinx Virtex4 LX160 66 MHz clock frequency 1.0 GB/s memory bandwidth 7.5 Gflops (113 floating point units) 2,3 Gflops programmable 5,2 Gflops fixed function Implementation Packets of 4 rays 32 packets of rays 3x 8 KB caches, direct mapped 24 bit floating point Virtex4 Board
59
Video
60
ASIC Implementation Implementation Differences Synthesis Place & Route
Larger caches (3x 16 KB, 4-way associative) 32 bit floating point Synthesis Synopsys Synthesis UMC 130nm CMOS process Place & Route Cadence Encounter Manual placements to achieve good results Only DRPU Core No chip interface designed (PCI Express, DRAM, ...) DRPU-ASIC
61
DRPU-ASIC Hardware Very Efficient Fixed Function Units
UMC 130nm CMOS process 49 mm2 266 MHz clock 2.1 GB/s bandwidth 30 Gflops 10 Gflops programmable 20 Gflops fixed function Very Efficient Fixed Function Units GP via SP: 5x smaller area, 3x higher performance 15 times more efficient (performance per area) 7mm 7mm
62
DRPU8-ASIC Hardware 90nm CMOS process
extrapolated using constant field scaling 186 mm2 die 400 MHz clock speed 25,6 GB/s bandwidth 361 Gflops 110 Gflops programmable 471 Gflops fixed function About million shaded rays per second 19,3 mm 9,6 mm
63
Results at 1024x768 with shadows
64
Conclusion and Future Work
Efficient Hardware Ray Tracing is Possible Performance levels sufficient for computer games could be achieved Even support for Dynamic Scenes Ray Tracing ready to replace rasterization? But Still Open Questions Anti-aliasing (many rays per pixel) Arbitrary dynamics (reconstruction) What about advanced global illumination (e.g. photon mapping) ?
65
Questions?
66
Instruction Set of Shading Processor
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return
67
Instruction Set of Shading Processor
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return
68
Instruction Set of Shading Processor
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return
69
Instruction Set of Shading Processor
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return
70
Instruction Set of Shading Processor
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return
71
Instruction Set of Shading Processor
Short vector instruction set mov, add, mul, mad, frac dph2, dp3, dph3, dp4 Input modifiers Swizzeling, negation, masking Multiply with power of 2 Special operations (modifiers) rcp, rsq, sat Fast 2D texture lookups texload, texload4x Read from and write to memory load, load4x, store Ray traversal operation trace Conditional instructions (paired) if <condition> jmp label if <condition> call <fun> If <condition> return Dual issue (pairing) 3/1 and 2/2 arithmetic splitting Arithmetic + load Arithmetic + conditional jump, call, return
72
Hardware Description Problem HWML
Most hardware description languages operate on a low abstraction level (e.g. VHDL, Verilog, …) High level languages are behavioral (Handel-C, Mitrion-C, …) no reliable mapping to hardware Need high-level structural language HWML Structural hardware description Implemented as an SML library Design add, mul, fadd, fmul, rcp, rsq, … Allows a compact description of HW algorithms, e.g.: 8000 LOC for the entire DRPU 160 LOC for a full implementation of the Tomasulo algorithm
73
HWML Features Functional Recursive circuit descriptions
Circuit descriptions are SML functions Functions can operate on circuits (e.g. arbitrary reductions) Recursive circuit descriptions Important for the implementation of arithmetic units (e.g. adders) Abstract Data Types Polymorphic functions (e.g. a single FIFO operates on different types of data) Allows for full parameterized designs (e.g. change floating point precision) Data Stream Abstraction Only one communication protocol in complete chip Automatic pipelining of circuits (higher order operator) Automatically generates highly efficient implementation Atomar support for multiported (typed) memories Allows to map memories efficiently to different platforms (e.g. memory compilers for CMOS processes) Generate FPGA and ASIC from one description
74
Brute Force Ray Tracing Demands
Property Standard Quality Medium Quality High Quality Resolution 1024x768 1920x1080 FPS 30 60 Rays per Pixel 10 200 Total Rays/s 250M 1.2B 24.0B ...
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.