Download presentation
Presentation is loading. Please wait.
1
Distributed Interactive Ray Tracing for Large Volume Visualization Dave DeMarle Steven Parker Mark Hartner Christiaan Gribble Charles Hansen
2
Scientific Computing and Imaging Institute, University of Utah Ray tracing CPU 4 CPU 3 CPU 2 CPU 1 For every pixel, cast a ray and find the first hit object Every pixel is independent, so image parallelism is a natural choice for acceleration
3
Scientific Computing and Imaging Institute, University of Utah Ray traced forest Image parallelism Showing work division
4
Scientific Computing and Imaging Institute, University of Utah Interactive ray tracing of scientific data
5
Scientific Computing and Imaging Institute, University of Utah Large volume visualization Richtmyer- Meshkov instability simulation Each timestep is 1920x2048x2048 x 8 bit
6
Scientific Computing and Imaging Institute, University of Utah Zooming in on the test data set
7
Scientific Computing and Imaging Institute, University of Utah Architectural comparison SGI test machineCluster test machine ~$1.5 million~$150 thousand Threaded programmingCustom or add on APIs 1x32 400 MHz R12K CPUs32x2 1.7 GHz Xeon CPUs 64 bit addressing32 bit addressing 16GB RAM (shared)32GB RAM (1GB per node) ccNUMA hypercube networkSwitched Gbit Ethernet 335ns avg round trip latency (spec) 34000ns avg round trip latency (measured) 12.8 Gbit/sec bandwidth (spec).6 Gbit/sec bandwidth (measured)
8
Scientific Computing and Imaging Institute, University of Utah Lack of a parallel programming model Build a minimal networking library based on TCP Lower network performance Perform IO asynchronously to overlap computation and communication Workers try to keep a backlog Supervisor does upkeep tasks while workers render Memory limited to 4GB and isolated within each node Create an object based DSM in the network library to share memory between nodes Overcoming Cluster Limitations
9
Scientific Computing and Imaging Institute, University of Utah System architecture Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory
10
Scientific Computing and Imaging Institute, University of Utah Central executive network limitation Moderately complex scene
11
Scientific Computing and Imaging Institute, University of Utah Central executive network limitation Frame rate scales until supervisor bottleneck dominates
12
Scientific Computing and Imaging Institute, University of Utah Central executive network limitation latency = 19μs per tile bandwidth = 600Mbit/s
13
Scientific Computing and Imaging Institute, University of Utah With enough processors, latency is the limitation
14
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes We make the entire memory space of the cluster usable in our object based DSM Each node owns part of the data Application threads acquire bricks from the DSM If the data isn't locally owned, the DSM gets a copy from the owner The DSM tries to cache blocks to use again - Corrie & Mackerras, 1993, “Parallel Volume Rendering and Data Coherence”
15
Scientific Computing and Imaging Institute, University of Utah System architecture Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory
16
Scientific Computing and Imaging Institute, University of Utah System architecture, extended with SDSM Node 2 Node 3 Node 1 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Ray thread 1 Ray thread 2 Supervisor node local memory Software distributed shared memory
17
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 036 owned cached DSM Communication thread Ray thread 1 Node 1 Node 2 Node 3 Ray thread 1 147 owned cached DSM Communication thread 258 owned cached DSM Communication thread
18
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 036 owned cached DSM Communication thread acquire(4) Ray thread 1 Node 1 4 Node 2 Node 3 acquire(3) Ray thread 1 acquire(2) Ray thread 1 147 owned cached DSM Communication thread 3 258 owned cached DSM Communication thread
19
Scientific Computing and Imaging Institute, University of Utah Beyond 2 32 bytes 0361 owned cached DSM Communication thread acquire(7) release(4) Ray thread 1 Node 1 427 Node 2 Node 3 acquire(8) release(3) Ray thread 1 acquire(4) release(2) Ray thread 1 1475 owned cached DSM Communication thread 863 2581 owned cached DSM Communication thread 463
20
Scientific Computing and Imaging Institute, University of Utah Which node owns what? Isosurface of visible female Showing ownership
21
Scientific Computing and Imaging Institute, University of Utah Acceleration structure a “macrocell” lists the min and max values inside- space leaping reduces the number of data accesses Isovalue=92 Missed Max=85 Min=100 Max=97 Min=89 Parker et al, viz 98, “Interactive Ray Tracing for Isosurface Rendering”
22
Scientific Computing and Imaging Institute, University of Utah Acceleration structure Enter only those macrocells that contain the isovalue Isovalue=92 Missed Max=85 Min=100 Max=97 Min=89
23
Scientific Computing and Imaging Institute, University of Utah Acceleration structure Recurse inside interesting macrocells, until you have to access the actual volume data Missed Max=90 Max=95 Min=91 Max=93 Min=90 Missed Not traversed Max=97 Min=89 ed 00
24
Scientific Computing and Imaging Institute, University of Utah &0&1&2&3&4&5&6&7 &8&9&10 &11 &12&13&14&15 Data bricking Use multi-level, 3D tiling for memory coherence - 64 byte cache line 4KB OS page &15 &0&1&4&5 &2&3&6&7 &8&9&12 &13 &10&11&14&15 Parker et al, viz 98, “Interactive Ray Tracing for Isosurface Rendering”
25
Scientific Computing and Imaging Institute, University of Utah &0&1&2&3&4&5&6&7 &8&9&10 &11 &12&13&14&15 Data bricking Use 3 level, 3D tiling for memory coherence 64 byte cache line 4KB OS page 4KB x L 3 Network transfer size For the datasets we've tried level three brick size of 32KB is the best trade-off between data locality and transmission time &15 &0&1&4&5 &2&3&6&7 &8&9&12 &13 &10&11&14&15
26
Scientific Computing and Imaging Institute, University of Utah 89.6 Isosurface intersection We analytically test for ray isosurface intersection within a voxel by solving a cubic polynomial, defined by ray parameters and 8 voxel corner values We use the data gradient for the surface normal 93.1 89.0 90.2 94.0 88.1 91.0 Isovalue=92
27
Scientific Computing and Imaging Institute, University of Utah Benchmark test isovalue viewpoint 1589 300 Frame #
28
Scientific Computing and Imaging Institute, University of Utah Consolidated data access Most of the time is spent accessing data which is locally owned or cached Reduce the number of DSM accesses by eliminating redundant accesses When ray needs data, sort accesses to get all needed data in one shot 7µs hit time 600 µs miss time 98% hit rate
29
Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell # accesses = 3,279,000 per worker per frame frame rate =.115 f/s
30
Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner # accesses = 453,400 per worker per frame frame rate =.709 f/s Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell
31
Scientific Computing and Imaging Institute, University of Utah Acquire on every voxel corner # accesses = 53,290 per worker per frame frame rate = 1.69 f/s Brick 1Brick 2Brick 3 Brick 4Brick 5Brick 6 macrocell
32
Scientific Computing and Imaging Institute, University of Utah Results – cache behaviour Frame Number Measured Frame Rate [f/s] Transfers [MB/node/f] 1589 290 ISO VIEW
33
Scientific Computing and Imaging Institute, University of Utah Results – cache behaviour Frame Number Filled Frame Rate [f/s] Measured Frame Rate [f/s] 1290 Cache fill costs 22% here
34
Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 4.7 [f/s] Cluster 31x2 CPUs avg = 1.7 [f/s] Cluster 31x1 CPUs avg = 1.1 [f/s] 1589 290
35
Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 5.7 [f/s] Cluster 31x2 CPUs avg = 1.5 [f/s] 1290 SGI is 3.8x faster
36
Scientific Computing and Imaging Institute, University of Utah Results - machine comparison SGI 1x31 CPUs avg = 4.2 [f/s] Cluster 31x2 CPUs avg = 2.6 [f/s] 290589 SGI is 1.6x faster
37
Scientific Computing and Imaging Institute, University of Utah Conclusions Confirmed that interactive ray tracing on a cluster is possible Scaling limited by latency, and the number of tiles determines max frame rate Data sets that exceed the memory space of any one node can be handled with a DSM For isosurfacing hit time is limiting factor, not network time Overheads make the cluster slower than the supercomputer, but the new solution has a significant price advantage
38
Scientific Computing and Imaging Institute, University of Utah Future work Make it faster!!! Use lower latency network Remove central bottleneck Use block prefetch and ray rescheduling Optimize the DSM for faster hit times Use more parallelism - SIMD, hyperthreading, GPU
39
Scientific Computing and Imaging Institute, University of Utah Acknowledgments NSF Grants 9977218, 9978099 DOE Views NIH Grants Mark Duchaineau at LLNL Our anonymous reviewers www.cs.utah.edu/~demarle/research demarle@sci.utah.edu For more information
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.