Memory-Savvy Distributed Interactive Ray Tracing David E. DeMarle Christiaan Gribble Steven Parker.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.

Render Cache John Tran CS851 - Interactive Ray Tracing February 5, 2003.

Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.

CS 153 Design of Operating Systems Spring 2015

Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen.

Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.

DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.

1 Computer System Overview OS-1 Course AA

Chapter 3.2 : Virtual Memory

Virtual Memory. Why do we need VM? Program address space: 0 – 2^32 bytes –4GB of space Physical memory available –256MB or so Multiprogramming systems.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

PRASHANTHI NARAYAN NETTEM.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

DISTRIBUTED INTERACTIVE RAY TRACING FOR LARGE VOLUME VISUALIZATION Dave DeMarle May

COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.

Multiprocessor Cache Coherency

CS 241 Section Week #12 (04/22/10).

Virtual Memory By: Dinouje Fahih. Definition of Virtual Memory Virtual memory is a concept that, allows a computer and its operating system, to use a.

1 The Google File System Reporter: You-Wei Zhang.

ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Computer Architecture Lecture 28 Fasih ur Rehman.

CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.

So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.

Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

Operating Systems COMP 4850/CISG 5550 Page Tables TLBs Inverted Page Tables Dr. James Money.

Interactive Visualization of Exceptionally Complex Industrial CAD Datasets Andreas Dietrich Ingo Wald Philipp Slusallek Computer Graphics Group Saarland.

Computer Architecture Lecture 26 Fasih ur Rehman.

8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.

Click to edit Master title style HCCMeshes: Hierarchical-Culling oriented Compact Meshes Tae-Joon Kim 1, Yongyoung Byun 1, Yongjin Kim 2, Bochang Moon.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

A summary by Nick Rayner for PSU CS533, Spring 2006

Computer Architecture Lecture 27 Fasih ur Rehman.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Shared-memory multiprocessors University of Utah implementations – Designed mainly for SGIs Many CPUs, shared address space, frame buffer – Chief architect:

1 Virtual Memory. Cache memory: provides illusion of very high speed Virtual memory: provides illusion of very large size Main memory: reasonable cost,

Virtual Light Field Group University College London Ray Tracing with the VLF (VLF-RT) Jesper Mortensen

I MAGIS is a joint project of CNRS - INPG - INRIA - UJF iMAGIS-GRAVIR / IMAG Efficient Parallel Refinement for Hierarchical Radiosity on a DSM computer.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)

Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.

Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.

CS161 – Design and Architecture of Computer

Non Contiguous Memory Allocation

Virtual Memory Chapter 7.4.

CS161 – Design and Architecture of Computer

Ramya Kandasamy CS 147 Section 3

Chapter 8: Main Memory Source & Copyright: Operating System Concepts, Silberschatz, Galvin and Gagne.

Multiprocessor Cache Coherency

So far we have covered … Basic visualization algorithms

Real-Time Ray Tracing Stefan Popov.

Distributed Shared Memory

Artificial Intelligence

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 29: Virtual Memory-Address Translation

CPSC 457 Operating Systems

Introduction to the Intel x86’s support for “virtual” memory

Virtual Memory Hardware

Overheads for Computers as Components 2nd ed.

Translation Buffers (TLB’s)

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Translation Buffers (TLB’s)

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CS703 - Advanced Operating Systems

Translation Buffers (TLBs)

Review What are the advantages/disadvantages of pages versus segments?

Presentation transcript:

Memory-Savvy Distributed Interactive Ray Tracing David E. DeMarle Christiaan Gribble Steven Parker

Impetus for the Paper data sets are growing memory access time is a bottleneck use parallel memory resources efficiently three techniques for faster access to scene data

System Overview base system presented at IEEE PVG’03 cluster port of an interactive ray tracer for shared memory supercomputers IEEE VIS’98 image parallel work division fetch scene data over from peers and cache locally

Three Techniques for Memory Efficiency ODSM  PDSM central work queue  distributed work sharing polygonal mesh reorganization

Distributed Shared Memory data is kept in memory blocks each node has 1/n th of the blocks fetch rest over the network from peers cache recently fetched blocks abstract view of memory 1472 node 1’s memory 2583 node 2’s memory node 3’s memory resident setcache

Object Based DSM each block has a unique handle application finds handle for each datum acquire and release for every block access //locate data handle, offset = ODSM_location(datum); block_start_addr = acquire(handle); //use data datum = *(block_start_addr + offset); //relinquish space release(handle);

ODSM Observations handle = level of indirection  > 4 GB mapping scene data to blocks is tricky acquire and release add overhead address computations add overhead 7.5 GB Richtmyer-Meshkov time step 64 CPUs ~3fps, with view and isovalue changes

Page Based DSM like ODSM: each node keeps 1/n th of scene fetches from peers uses caching difference is how memory is accessed normal virtual memory addressing use addresses between heap and stack PDSM installs a segmentation fault signal handler: on a miss  obtain page from peer, return

PDSM Observations no handles, normal memory access no acquire/release or address computations easy to place any type of scene data in shared space limited to 2^32 bytes  hard to make thread safe  DSM acts only in the exceptional case of a miss ray tracing acceleration structure  > 90 % hit rates ODSMPDSM Hit time10.2 µs4.97 µs Miss time629 µs632 µs

Head-to-Head Comparison compare replication, PDSM and ODSM use a small 512 ^3 volumetric data set PDSM and ODSM keep only 1/16 th locally change viewpoint and isovalue throughout first half, large working set second half, small working set

Head-to-Head Comparison note - accelerated ~2x for presentation

Head-to-Head Comparison

replicated 3.74 frames/sec average

Head-to-Head Comparison ODSM 32% speed of replication

Head-to-Head Comparison PDSM 82% speed of replication

Three Techniques for Memory Efficiency ODSM  PDSM central work queue  distributed work sharing polygonal mesh reorganization

Load Balancing Options central work queue legacy from original shared memory implementation display node keeps task queue render nodes get tiles from queue now  distributed work sharing start with tiles traced last frame  hit rates increase workers get tiles from each other  communicate in parallel, better scalability steal from random peers, slowest worker gives work

Supervisor node tile 0tile 1tile 2tile 3 … Worker node 0 Worker node 1 Worker node 2 Worker node 3 … Worker node 0 Worker node 1 Worker node 2 Worker node 3 … tile 0 tile 1 tile 2 tile 3 … … … … Central Work QueueDistributed Work Sharing

Central Work QueueDistributed Work Sharing

Central Work QueueDistributed Work Sharing

Central Work QueueDistributed Work Sharing

Central Work QueueDistributed Work Sharing

Comparison bunny, dragon, and acceleration structures in PDSM measure misses and frame rates vary local memory to simulate data much larger than physical memory

Misses Frames/Sec 0 1E MB locally E4 central queuedistributed sharing

Misses Frames/Sec 0 1E MB locally E4 central queuedistributed sharing

Misses Frames/Sec 0 1E MB locally E4 central queuedistributed sharing

Misses Frames/Sec 0 1E MB locally E4 central queuedistributed sharing

Three Techniques for Memory Efficiency ODSM  PDSM central work queue  distributed work sharing polygonal mesh reorganization

Mesh “Bricking” similar to volumetric bricking increase hit rates by reorganizing scene data for better data locality place neighboring triangles on the same page &0&1 &90&91 &2 &92 &3 &93 volume bricking &3&5 &4 &7 &8 &1 &0&2&6 &94&96 &95 &98&90&92 &91&93&97 … … … … mesh “bricking”

Mesh “Bricking” similar to volumetric bricking increase hit rates by reorganizing scene data for better data locality place neighboring triangles on the same page &0&1 &90&91 &2 &92 &3 &93 volume bricking &3&5 &4 &7 &8 &1 &0&2&6 &94&96 &95 &98&90&92 &91&93&97 … … … … mesh “bricking”

Mesh “Bricking” similar to volumetric bricking increase hit rates by reorganizing scene data for better data locality place neighboring triangles on the same page &0&1 &90&91 &2 &92 &3 &93 volume bricking &3&5 &4 &7 &8 &1 &0&2&6 &94&96 &95 &98&90&92 &91&93&97 … … … … mesh “bricking”

Mesh “Bricking” similar to volumetric bricking increase hit rates by reorganizing scene data for better data locality place neighboring triangles on the same page &0&1 &90&91 &2 &92 &3 &93 volume bricking &3&5 &4 &7 &8 &1 &0&2&6 &94&96 &95 &98&90&92 &91&93&97 … … … … mesh “bricking”

Mesh “Bricking” similar to volumetric bricking increase hit rates by reorganizing scene data for better data locality place neighboring triangles on the same page &0&1 &90&91 &2 &92 &3 &93 volume bricking &3&5 &4 &7 &8 &1 &0&2&6 &94&96 &95 &98&90&92 &91&93&97 … … … … mesh “bricking”

Mesh “Bricking” similar to volumetric bricking increase hit rates by reorganizing scene data for better data locality place neighboring triangles on the same page &0&1 &2&3 &4 &6 &5 &7 volume bricking &6&8 &7 &13 &14 &1 &0&2&12 &10&15 &11 &17&3&5 &4&9&16 … … … … mesh “bricking”

Mesh “Bricking” similar to volumetric bricking increase hit rates by reorganizing scene data for better data locality place neighboring triangles on the same page &0&1 &2&3 &4 &6 &5 &7 volume bricking &6&8 &7 &13 &14 &1 &0&2&12 &10&15 &11 &17&3&5 &4&9&16 … … … … mesh “bricking”

Input Mesh

Sorted Mesh

Reorganizing the Mesh based on a grid acceleration structure each grid cell contains pointers to triangles within our grid structure is bricked in memory 1.create grid acceleration structure 2.traverse the cells as stored in memory 3.append copies of the triangles to a new mesh new mesh has triangles sorted in space and memory

Comparison same test as before compare input and sorted mesh

Misses Frames/Sec MB locally input meshsorted mesh

Misses Frames/Sec MB locally input meshsorted mesh

Misses Frames/Sec MB locally input meshsorted mesh

Misses Frames/Sec MB locally input meshsorted mesh

Frames/Sec MB locally input meshsorted mesh grid based approach  duplicates split triangles

Summary three techniques for more efficient memory use: 1.PDSM adds overhead only in the exceptional case of data miss 2.reuse tile assignments with parallel load balancing heuristics 3.mesh reorganization puts related triangles onto nearby pages

Future Work need 64-bit architecture for very large data thread safe PDSM for hybrid parallelism distributed pixel result gathering surface based mesh reorganization

Acknowledgments Funding agencies NSF , DOE VIEWS NIH Reviewers - for tips and seeing through the rough initial data presentation EGPGV Organizers Thank you!