Coherent ray tracing via stream filtering christiaan gribble karthik ramani ieee/eurographics symposium on interactive ray tracing august 2008.

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
Multidimensional Lightcuts Bruce Walter Adam Arbree Kavita Bala Donald P. Greenberg Program of Computer Graphics, Cornell University.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Physically Based Real-time Ray Tracing Ryan Overbeck.
IIIT Hyderabad Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad,
Coherent ray tracing via stream filtering christiaan gribble karthik ramani ieee/eurographics symposium on interactive ray tracing august 2008.
Zhao Dong 1, Jan Kautz 2, Christian Theobalt 3 Hans-Peter Seidel 1 Interactive Global Illumination Using Implicit Visibility 1 MPI Informatik Germany 2.
Interactive Rendering using the Render Cache Bruce Walter, George Drettakis iMAGIS*-GRAVIR/IMAG-INRIA Steven Parker University of Utah *iMAGIS is a joint.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
Paper Presentation - An Efficient GPU-based Approach for Interactive Global Illumination- Rui Wang, Rui Wang, Kun Zhou, Minghao Pan, Hujun Bao Presenter.
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
GRAMPS: A Programming Model For Graphics Pipelines Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Ray Tracing Dynamic Scenes using Selective Restructuring Sung-eui Yoon Sean Curtis Dinesh Manocha Univ. of North Carolina at Chapel Hill Lawrence Livermore.
Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.
Fast Agglomerative Clustering for Rendering Bruce Walter, Kavita Bala, Cornell University Milind Kulkarni, Keshav Pingali University of Texas, Austin.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Real-Time Ray Tracing 3D Modeling of the Future Marissa Hollingsworth Spring 2009.
RT08, August ‘08 Large Ray Packets for Real-time Whitted Ray Tracing Ryan Overbeck Columbia University Ravi Ramamoorthi Columbia University William R.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Ray Tracing Primer Ref: SIGGRAPH HyperGraphHyperGraph.
Computer Graphics 2 Lecture x: Acceleration Techniques for Ray-Tracing Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Technology and Historical Overview. Introduction to 3d Computer Graphics  3D computer graphics is the science, study, and method of projecting a mathematical.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Multi-core architectures. Single-core computer Single-core CPU chip.
Cg Programming Mapping Computational Concepts to GPUs.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
An Evaluation of Existing BVH Traversal Algorithms for Efficient Multi-Hit Ray Tracing Jefferson Amstutz (SURVICE) Johannes Guenther (Intel) Ingo Wald.
Introduction to Realtime Ray Tracing Course 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Interactive Visualization of Exceptionally Complex Industrial CAD Datasets Andreas Dietrich Ingo Wald Philipp Slusallek Computer Graphics Group Saarland.
Saarland University, Germany B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes Sven Woop Gerd Marmitt Philipp Slusallek.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
SIGGRAPH 2011 ASIA Preview Seminar Rendering: Accuracy and Efficiency Shinichi Yamashita Triaxis Co.,Ltd.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
1 Ray Tracing with Existing Graphics Systems Jeremy Sugerman, FLASHG 31 January 2006.
Hierarchical Penumbra Casting Samuli Laine Timo Aila Helsinki University of Technology Hybrid Graphics, Ltd.
Multi-Hit Ray Traversal Christiaan Gribble Alexis Naveros Ethan Kerzner ACM SIGGRAPH Symposium on Interactive 3D Graphics & Games 15 March 2014.
Memory Management and Parallelization Paul Arthur Navrátil The University of Texas at Austin.
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Parallel and Distributed Simulation Time Parallel Simulation.
Subject Name: Computer Graphics Subject Code: Textbook: “Computer Graphics”, C Version By Hearn and Baker Credits: 6 1.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Virtual Light Field Group University College London Ray Tracing with the VLF (VLF-RT) Jesper Mortensen
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Hybrid Ray Tracing and Path Tracing of Bezier Surfaces using a mixed hierarchy Rohit Nigam, P. J. Narayanan CVIT, IIIT Hyderabad, Hyderabad, India.
Combining Edges and Points for Interactive High-Quality Rendering
Real-Time Ray Tracing Stefan Popov.
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

coherent ray tracing via stream filtering christiaan gribble karthik ramani ieee/eurographics symposium on interactive ray tracing august 2008

early implementation –andrew kensler (utah) –ingo wald (intel) –solomon boulos (stanford) other contributors –steve parker & pete shirley (nvidia) –al davis & erik brunvand (utah) acknowledgements

ray packets  SIMD processing increasing SIMD widths –current GPUs –intel’s larrabee –future processors how to exploit wide SIMD units for fast ray tracing? wide SIMD environments

recast ray tracing algorithm –series of filter operations –applied to arbitrarily-sized groups of rays apply filters throughout rendering –eliminate inactive rays –improve SIMD efficiency –achieve interactive performance stream filtering

ray streams –groups of rays –arbitrary size –arbitrary order stream filters –set of conditional statements –executed across stream elements –extract only rays with certain properties core concepts

abdef stream element input stream out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream } c test conditional statement(s)

core concepts abdef input stream c test output stream a a out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream }

core concepts abdef input stream c test output stream a b b out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream }

core concepts abdef input stream c test output stream a c b out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream }

core concepts abdef input stream c test output stream a d bd out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream }

core concepts abdef input stream c test output stream a e bd out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream }

core concepts abdef input stream c test output stream a f bdf out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream }

core concepts abdef input stream c test output stream abdf out_stream filter (in_stream) { foreach e in in_stream if (test(e) == true) out_stream.push(e) return out_stream }

process stream in groups of N elements two steps –N-wide groups  boolean mask –boolean mask  partitioned stream SIMD filtering

abdef input stream c test boolean mask step one

SIMD filtering abdef input stream c test boolean mask abc ttf step one

SIMD filtering abdef input stream c test boolean mask t def tf tft step one

SIMD filtering abdef input stream c test boolean mask ttf tft

SIMD filtering abdef input stream c test boolean mask ttf tft partition abdec output stream f

all rays requiring same sequence of ops will always perform those ops together independent of execution path independent of order within stream coherence defined by ensuing ops no guessing with heuristics adapts to geometry, etc. key characteristics

all rays requiring same sequence of ops will always perform those ops together independent of execution path independent of order within stream coherence defined by ensuing ops no guessing with heuristics adapts to geometry, etc. key characteristics

all rays requiring same sequence of ops will always perform those ops together independent of execution path independent of order within stream coherence defined by ensuing ops no guessing with heuristics adapts to geometry, etc. key characteristics

SIMD execution  exploits parallelism within stream improves efficiency  underutilization only when (# rays % N) > 0 potential benefits

wide SIMD ops (N > 4) scatter/gather memory ops partition op hardware requirements

recast ray tracing algorithm as a sequence of filter operations possible to use filters in all three major stages of ray tracing –traversal –intersection –shading application to ray tracing

sequence of stream filters –extract certain rays for processing –ignore others, process later –implicit or explicit traversal  implicit filter stack shading  explicit filter stack filter stacks

while (!stack.empty()) (node, stream) = stack.pop() substream = filter (stream) if (substream.size() == 0) continue else if (node.is_leaf()) prims.intersect(substream) else determine front/back children stack.push(back_child, substream) stack.push(front_child, substream) traversal

drop inactive rays traversal abdef input stream c stack current node  x w(0, 5) … abdec output stream f y z filter against node

traversal abdef input stream c stack current node  x y(0, 3) … abd output stream f y z w(0, 5) push back child

traversal abdef input stream c stack current node  x z(0, 3) … abd output stream f y z w(0, 5) y(0, 3) push front child

traversal abdef input stream c stack current node  x z(0, 3) … abd output stream f y z w(0, 5) y(0, 3) continue to next traversal step

explicit filter stacks –decompose test into sequence of filters sequence of barycentric coordinate tests … –too little coherence to necessitate additional filter ops simply apply test in N-wide SIMD fashion intersection

explicit filter stacks –extract & process elements shadow rays for explicit direct lighting rays that miss geometry rays whose children sample direct illumination … –streams are quite long –filter stacks are used to good effect shading achieves highest utilization shading

general & flexible supports parallel execution –process only active elements –yields highest possible efficiency –adapts to geometry, etc. incurs low overhead algorithm – summary

why a custom core? –skeptical that algorithm could perform interactively –provides upper bound on expected performance –explore parameter space more easily if successful, implement for available architectures hardware simulation

cycle-accurate –models stalls & data dependencies –models contention for components conservative –could be synthesized at nm –we assume nm simulator highlights

four major subsystems –distributed memory –N-wide SIMD execution units –dedicated address generation units –flexible, hierarchical interconnect additional details available in companion papers simulator highlights

stream filter  C++ class template filter tests  C++ functors –ray/node intersection –ray/triangle intersection –… scatter/gather ops used to manipulate operands, rendering state programming model

does sufficient coherence exist to use wide SIMD units efficiently? focus on SIMD utilization is interactive performance achievable with a custom core? initial exploration of design space key questions

does sufficient coherence exist to use wide SIMD units efficiently? focus on SIMD utilization is interactive performance achievable with a custom core? key questions

does sufficient coherence exist to use wide SIMD units efficiently? focus on SIMD utilization is interactive performance achievable with a custom core? initial exploration of design space key questions

monte carlo path tracing –explicit direct lighting –glossy, dielectric, & lambertian materials –depth-of-field effects tile-based, breadth-first rendering rendering

1024x1024 images stream size  1K or 4K rays –1 spp  32x32 or 64x64 pixels/tile –64 spp  4x4 or 8x8 pixels/tile per-frame stats –O(100s millions) rays/frame –O(100s millions) traversal ops –O(10s millions) intersection ops experimental setup

high geometric & illumination complexity representative of common scenarios test scenes rtrtconfkala

primary rays initial stream size = 64x64 rays

secondary rays initial stream size = 64x64 rays

predicted performance

achieve high utilization –as high as 97% –SIMD widths of up to 16 elements –utilization increases with stream size achieve interactive performance –15-25 fps –performance increases with stream size –currently requires custom core results – summary

too few common ops  no improvement in utilization possible remedies –longer ray streams –parallel traversal limitations – parallelism

ray stream –generalized ray packet –longer streams  better performance –could perform poorly possible remedies –carefully designed memory system –streaming memory systems limitations – memory system

conventional cpus –narrow SIMD (4-wide SSE & altivec) –limited support for scatter/gather ops –partition op  software implementation possible remedies –custom core –current GPUs –time limitations – hw support

new approach to coherent ray tracing –process arbitrarily-sized groups of rays in SIMD fashion with high utilization –eliminates inactive elements, process only active rays stream filtering provides –sufficient coherence for wider-than-four SIMD processing –interactive performance with custom core conclusions

key strengths –parallel processing –implicit reordering –modest hw requirements –general key challenges –hw support for required ops conclusions

additional hw simulation –parameter tuning –homogeneous multicore –heterogeneous multicore –… improved GPU-based implementation implementations for future processors future work

temple of kalabsha –veronica sundstedt –patrick ledda –other members of the university of bristol computer graphics group financial support –swezey scientific instrumentation fund –utah graduate research fellowship –nsf grants & (more) acknowledgements