Architecture Considerations for Tracing Incoherent Rays

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

Christian Lauterbach COMP 770, 2/16/2009. Overview  Acceleration structures  Spatial hierarchies  Object hierarchies  Interactive Ray Tracing techniques.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Restart Trail for Stackless BVH Traversal Samuli Laine NVIDIA Research.
Router Architecture : Building high-performance routers Ian Pratt
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Architecture Considerations for Tracing Incoherent Rays Timo Aila, Tero Karras NVIDIA Research.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CUDA - 2.
Understanding the Efficiency of Ray Traversal on GPUs Timo Aila Samuli Laine NVIDIA Research.
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees Tero Karras NVIDIA Research.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Martin Kruliš by Martin Kruliš (v1.1)1.
Sunpyo Hong, Hyesoon Kim
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Single Instruction Multiple Threads
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Memory Management.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Lecture: Large Caches, Virtual Memory
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
Parallel Programming By J. H. Wang May 2, 2017.
How will execution time grow with SIZE?
Henri Ylitie, Tero Karras, Samuli Laine
Task Scheduling for Multicore CPUs and NUMA Systems
So far we have covered … Basic visualization algorithms
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Cache Memory Presentation I
Real-Time Ray Tracing Stefan Popov.
Main Memory Management
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Chapter 8: Main Memory.
RegLess: Just-in-Time Operand Staging for GPUs
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
NVIDIA Fermi Architecture
Lecture: Cache Innovations, Virtual Memory
Memory Management-I 1.
Main Memory Background Swapping Contiguous Allocation Paging
Chapter 9: Virtual Memory
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
CS 3410, Spring 2014 Computer Science Cornell University
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
by Mikael Bjerga & Arne Lange
Foundations and Definitions
Programming with Shared Memory Specifying parallelism
rePLay: A Hardware Framework for Dynamic Optimization
CS703 – Advanced Operating Systems
6- General Purpose GPU Programming
CSE 542: Operating Systems
Page Main Memory.
Presentation transcript:

Architecture Considerations for Tracing Incoherent Rays Timo Aila, Tero Karras NVIDIA Research

Outline Our research question: Test setup Architecture overview What can be done if memory bandwidth becomes the primary bottleneck in ray tracing? Test setup Architecture overview Optimizing stack traffic Optimizing scene traffic Results Future work

Test setup – methodology Hypothetical parallel architecture All measurements done on custom simulator Assumptions Processors and L1 are fast (not bottleneck) L1s ↔ Last-level cache, LLC, may be a bottleneck LLC ↔ DRAM assumed primary bottleneck Minimum transfer size 32 bytes (DRAM atom) Measurements include all memory traffic

Test setup – scenes Simulator cannot deal with large scenes Two organic scenes with difficult structure One car interior with simple structure BVH, 32 bytes per node/triangle Vegetation Hairball Veyron 1.1M tris 2.8M tris 1.3M tris 629K BVH nodes 1089K BVH nodes 751 BVH nodes 86Mbytes 209Mbytes 47Mbytes

Test setup – rays In global illumination rays typically Start from surface Need closest intersection Are not coherent We used diffuse interreflection rays 16 rays per primary hit point, 3M rays in total Submitted to simulator as batches of 1M rays Ray ordering Random shuffle, ~worst possible order Morton (6D space-filling curve), ~best possible order Ideally ray ordering wouldn’t matter

Architecture (1/2) We copy several parameters from Fermi: Additionally 16 processors, each with private L1 (48KB, 128B lines, 6-way) Shared L2 (768KB, 128-byte lines, 16-way) Otherwise our architecture is not Fermi Additionally Write-back caches with LRU eviction policy Processors 32-wide SIMD, 32 warps** for latency hiding Round robin warp scheduling Fast. Fixed function or programmable, we don’t care ** Warp = static collection of threads that execute together in SIMD fashion

Architecture (2/2) Each processor is bound to an input queue Launcher fetches work Compaction When warp has <50% threads alive, terminate warp, re-launch Improves SIMD utilization from 25% to 60% Enabled in all tests

Outline Test setup Architecture overview Optimizing stack traffic Baseline ray tracer and how to reduce its stack traffic Optimizing scene traffic Results Future work

Stack traffic – baseline method While-while CUDA kernel [Aila & Laine 2009] One-to-one mapping between threads and rays Stacks interleaved in memory (CUDA local memory) 1st stack entry from 32 rays, 2nd stack entry from 32 rays,… Good for coherent rays, less so for incoherent 50% of traffic caused by traversal stacks with random sort!

Stack traffic – stacktop caching Non-interleaved stacks, cached in L1 Requires 128KB of L1 (32x32x128B), severe thrashing Keep N latest entries in registers [Horn07] Rest in DRAM, optimized direct DRAM communication N=4 eliminates almost all stack-related traffic 16KB of RF (1/8th of L1 requirements…)

Outline Test setup Architecture overview Optimizing stack traffic Optimizing scene traffic Treelets Treelet assignment Queues Scheduling Results Future work

Scene traffic – treelets (1/2) Scene traffic about 100X theoretical minimum Each ray traverses independently Concurrent working set is large Quite heavily dependent on ray ordering

Scene traffic – Treelets (2/2) Divide tree into treelets Extends [Pharr97, Navratil07] Each treelet fits into cache (nodes, geometry) Assign one queue per treelet Enqueue a ray that enters another treelet (red), suspend Encoded to node index When many rays collected Bind treelet/queue to processor(s) Amortizes scene transfers Repeat until done Ray in 1 treelet at a time Can go up as well

Treelet assignment Done when BVH constructed Tradeoff Treelet index encoded into node index Tradeoff Treelets should fit into cache; we set max mem footprint Treelet transitions cause non-negligible memory traffic Minimize total surface area of treelets Probability to hit a treelet proportional to surface area Optimization done using dynamic programming More details in paper E.g. 15000 treelets for Hairball (max footprint 48KB)

Queues (1/2) Queues contain ray states (16B, current hit, …) Stacktop flushed on push, Ray (32B) re-fetched on pop Queue traffic not cached Do not expect to need a ray for a while when postponed Bypassing Target queue already bound to some processor? Forward ray + ray state + stacktop directly to that processor Reduces DRAM traffic

Queues (2/2) Static or dynamic memory allocation? Static Dynamic Simple to implement Memory consumption proportional to scene size Queue can get full, must pre-empt to avoid deadlocks Dynamic Need a fast pool allocator Memory consumption proportional to ray batch size Queues never get full, no pre-emption We implemented both, used dynamic

Scheduling (1/2) Task: Bind processors to queues Goal: Minimize binding changes Lazy Input queue gets empty  bind to the queue that has most rays Optimal with one processor… Binds many processors to the same queue Prefers L2-sized treelets Expects very fast L1↔L2 Unrealistic? Vegetation, Random

Scheduling (2/2) Balanced Queues request #processors Granted based on “who needs most” Processors (often) bound to different queues  more bypassing Prefers L1-sized treelets Used in results Vegetation, Random

Treelet results Scene traffic reduced ~90% Unfortunately aux traffic (queues + rays + stacks) dominates Scales well with #processors Virtually independent of ray ordering 2-5X difference for baseline, now <10%

Conclusions Scene traffic mostly solved Open question: how to reduce auxiliary traffic? Necessary features generally useful Queues [Sugerman2009] Pool allocation [Lalonde2009] Compaction Today memory bw perhaps not #1 bottleneck, but likely to become one Instruction set improvements Custom units [RPU, SaarCOR] Flops still scaling faster than bandwidth Bandwidth is expensive to build, consumes power

Future work Complementary memory traffic reduction Wide trees Multiple threads per ray? Reduces #rays in flight Compression? Batch processing vs. continuous flow of rays Guaranteeing fairness? Memory allocation?

Thank you for listening! Acknowledgements Samuli Laine for Vegetation and Hairball Peter Shirley and Lauri Savioja for proofreading Jacopo Pantaleoni, Martin Stich, Alex Keller, Samuli Laine, David Luebke for discussions