Sort-First, Distributed Memory Parallel Visualization and Rendering Wes Bethel, R3vis Corporation and Lawrence Berkeley National Laboratory Parallel Visualization.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

CPSCG: Constructive Platform for Specialized Computing Grid Institute of High Performance Computing Department of Computer Science Tsinghua University.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Xingfu Wu Xingfu Wu and Valerie Taylor Department of Computer Science Texas A&M University iGrid 2005, Calit2, UCSD, Sep. 29,
From the market leader in digital signage players
MIT iCampus iLabs Software Architecture Workshop June , 2006.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
1 Angel: Interactive Computer Graphics 4E © Addison-Wesley 2005 Better Interactive Programs Ed Angel Professor of Computer Science, Electrical and Computer.
Figure 1.1 Interaction between applications and the operating system.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Parallel Rendering Ed Angel
Chapter 23: ARP, ICMP, DHCP IS333 Spring 2015.
Chapter 9 Classification And Forwarding. Outline.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.
CECS 5460 – Assignment 3 Stacey VanderHeiden Güney.
1 Design and Implementation of an Efficient MPEG-4 Interactive Terminal on Embedded Devices Yi-Chin Huang, Tu-Chun Yin, Kou-Shin Yang, Yan-Jun Chang, Meng-Jyi.
CSE 381 – Advanced Game Programming 3D Game Architecture.
Chapter 6 High-Speed LANs Chapter 6 High-Speed LANs.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
LECTURE 9 CT1303 LAN. LAN DEVICES Network: Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and.
MACHINE VISION GROUP Graphics hardware accelerated panorama builder for mobile phones Miguel Bordallo López*, Jari Hannuksela*, Olli Silvén* and Markku.
Methods  OpenGL Functionality Visualization Tool Functionality 1)3D Shape/Adding Color1)Atom/element representations 2)Blending/Rotation 2)Rotation 3)Sphere.
Electronic Visualization Laboratory University of Illinois at Chicago “Sort-First, Distributed Memory Parallel Visualization and Rendering” by E. Wes Bethel,
Chep06 1 High End Visualization with Scalable Display System By Dinesh M. Sarode, S.K.Bose, P.S.Dhekne, Venkata P.P.K Computer Division, BARC, Mumbai.
Parallel Rendering 1. 2 Introduction In many situations, standard rendering pipeline not sufficient ­Need higher resolution display ­More primitives than.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Providing Policy Control Over Object Operations in a Mach Based System By Abhilash Chouksey
Syzygy Design overview Distributed Scene Graph Master/slave application framework I/O Device Integration using Syzygy Scaling down: simulators and other.
The University of Bolton School of Games Computing & Creative Technologies LCT2516 Network Architecture CCNA Exploration LAN Switching and Wireless Chapter.
© ABB Inc. - USETI All Rights Reserved 10/17/2015 Insert image here An Economic Analysis Development Framework for Distributed Resources Aaron F. Snyder.
Sam Uselton Center for Applied Scientific Computing Lawrence Livermore National Lab October 25, 2001 Challenges for Remote Visualization: Remote Viz Is.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Computing & Information Sciences Kansas State University Lecture 20 of 42CIS 636/736: (Introduction to) Computer Graphics Lecture 21 of 42 William H. Hsu.
A High-Performance Scalable Graphics Architecture Daniel R. McLachlan Director, Advanced Graphics Engineering SGI.
Parallel Rendering. 2 Introduction In many situations, a standard rendering pipeline might not be sufficient ­Need higher resolution display ­More primitives.
Boundary Assertion in Behavior-Based Robotics Stephen Cohorn - Dept. of Math, Physics & Engineering, Tarleton State University Mentor: Dr. Mircea Agapie.
Real-time Graphics for VR Chapter 23. What is it about? In this part of the course we will look at how to render images given the constrains of VR: –we.
A Software Solution for the Control, Acquisition, and Storage of CAPTAN Network Topologies Ryan Rivera, Marcos Turqueti, Alan Prosser, Simon Kwan Electronic.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
A Collaborative Framework for Scientific Data Analysis and Visualization Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Cluster Consistency Monitor. Why use a cluster consistency monitoring tool? A Cluster is by definition a setup of configurations to maintain the operation.
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Designing Abstract Interfaces for Device Independency Designing Abstract Interfaces for Device Independency Review of A Procedure for Designing Abstract.
Object Oriented Analysis and Design 1 Chapter 9 From Design to Implementation  Implementation Model  Forward, Reverse, and Round-Trip Engineering  Mapping.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Rehab AlFallaj.  Network:  Nodes: Service units: PC Interface processing Modules: it doesn’t generate data, but just it process it and do specific task.
1 Expanding the Application Base of the SAGE Collaboration Platform Javier Delgado.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Scientific Computing Goals Past progress Future. Goals Numerical algorithms & computational strategies Solve specific set of problems associated with.
A Sorting Classification of Parallel Rendering Molnar et al., 1994.
Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Advanced Computer Networks
Computer Graphics Lecture 32
Direct Attached Storage and Introduction to SCSI
Operating Systems (CS 340 D)
An Overview of the ITTC Networking & Distributed Systems Laboratory
Direct Attached Storage and Introduction to SCSI
Software Defined Networking (SDN)
Data Path through host/ANP.
Connectors, Repeaters, Hubs, Bridges, Switches, Routers, NIC’s
Presentation transcript:

Sort-First, Distributed Memory Parallel Visualization and Rendering Wes Bethel, R3vis Corporation and Lawrence Berkeley National Laboratory Parallel Visualization and Graphics Workshop Sunday October 18, 2003, Seattle, Washington

2 The Actual Presentation Title: Why Distributed Memory Parallel Rendering is a Challenge: Combining OpenRM Scene Graph and Chromium for General Purpose Use on Distributed Memory Clusters Outline: Problem Statement, Desired Outcomes Sort-First Parallel Architectural Overview The Many Ways I Broke Chromium Scene Graph Considerations Demo? Conclusions

3 Motivation and Problem Statement The Allure of COTS solutions: Performance of COTS GPUs exceed that of custom silicon. Attractive price/performance of COTS platforms (e.g., x86 PCs). Gigabit Ethernet is cheap: $100/NIC, $500 8-port switch. Can build a screamer cluster for about $2K/node. We’re accustomed to “nice, friendly” software infrastructure. E.g., hardware accelerated Xinerama. Enter Chromium – the means to use a bunch of PCs to do parallel rendering. Parallel submission of graphics streams is a “custom solution,” and presents challenges. Want a flexible, resilient API to interface between parallel visualization/rendering applications and Chromium.

4

5 Our Approach Distributed memory parallel visualization application design: amortizes expensive data I/O and visualization across many nodes. The scene graph layer mediates interaction between the application and the rendering subsystem: portability, “hide” the icky parallel rendering details, provides an infrastructure for accelerating rendering. Chromium provides routing of graphics commands to support hardware accelerated rendering on a variety of platforms. Focus on COTS solutions: all hardware and software we used is cheap (PC cluster) or free (software). Focus on simplicity: our sample applications are straightforward in implementation, easily reproducible by others and highly portable. Want an infrastructure that is suitable for use regardless of the type of parallel programming model used by the application: No “parallel objects” in the Scene Graph!!!!!

6 Our Approach, ctd.

7 The Many Ways I Broke Chromium Retained mode object “namespace” collisions in parallel submission. Broadcasting: how to burn up bandwidth without even trying! Scene Graph Issues: to be discussed in our PVG paper presentation on Monday.

8 The “Collision” Problem Want to use OpenGL retained-mode semantics and structures to realize performance gains in DM environment Problem: “Namespace collision” of retained-mode identifiers during parallel submission of graphics commands. Example: The problem exists for all OpenGL retained mode objects: display lists, texture objects and programs. The problem extends to all OpenGL retained mode objects: display lists, texture object id’s, programs. Process A: GLuint n = glNewList(1); printf(“ id==%d\n”); // id==0 // build list, draw with list Process A: GLuint n = glNewList(1); printf(“ id==%d\n”,n); // id == 0 // build list, draw with list

9 Manifestation of Collision Problem Show image of four textured quads when the problem is present.

10 Desired Result Show image of four textured quads when the problem is fixed.

11 Resolving the Collision Problem New CR configuration file options: shared_textures, shared_display_lists, shared_programs When set to 1, beware of collisions in parallel submission. When set to 0, collisions resolved in parallel submission. Using shared_* to zero will enforce “unique retained mode identifiers” across all parallel submitters. Thanks, Brian!

12 The Broadcast Problem What’s the Problem?: Geometry and textures from N application PEs is replicated across M crservers. Bumped into limits of memory and bandwidth. To Chromium, a display list is an opaque blob of stuff. Tilesort doesn’t peek inside a display list to see where it should be sent. Early performance testing showed two types of broadcasting: Display lists being broadcast from one tilesort to all servers. Textures associated with textured geometry in display lists was being broadcast from one tilesort to all servers.

13 Broadcast Workarounds (Short Term) Help Chromium decide how to route textures with the GL_OBJECT_BBOX_CR extension. Don’t use display lists (for now). Immediate mode geometry isn’t broadcast. Sorting/routing is accelerated using GL_OBJECT_BBOX_CR to provide hints to tilesort; it doesn’t have to look at all vertices in a geometry blob to perform routing decisions. For scientific visualization, which generates lots of geometry, this is clearly a problem. Our volume rendering application uses 3D textures (N**3 data) and textured geometry (N**2 data), so the “heavy payload” data isn’t broadcast. The cost is immediate mode transmission of geometry (approximately 36KB/frame of geometry as compared to 160MB/frame of texture data).

14 Broadcast Workarounds (Long Term) Funding for Chromium developers to implement display list caching and routing, similar to existing capabilities to manage texture objects. Lurking problems: “Aging” of display lists: the crserver is like a “roach motel”: display lists check in, but they never check out. Adding retained mode object aging and management to applications is an unreasonable burden (IMO). There exists no commonly accepted mechanism for LRU aging, etc. in the graphics API. Such an aging mechanism will probably show up as an extension. Better as an extension with tunable parameters than requiring applications to “reach deeply” into graphics API implementations.

15 Scene Graph Issues and Performance Analysis Discussed in our 2003 PVG Paper (Monday afternoon). Our “parallel scene graph” implementation can be used by any parallel application, regardless of the type of parallel programming model used by the application developer. The “big issue” in sort-first is how much data is duplicated. What we’ve seen so far is about 1.8x duplication was required for the first frame (in a hardware accelerated volume rendering application). While the scene graph supports any type of parallel operation, certain types of synchronization are required to ensure correct rendering. These can be achieved using only Chromium barriers – no “parallel objects” in the scene graph are required.

16 Some Performance Graphs Bandwidth vs. Traffic

17 Parallel Scene Graph API Stuff Collectives rmPipeSetCommSize(), rmPipeGetCommSize() rmPipeSetMyRank(), rmPipeGetMyRank() Chromium-specific rmPipeBArrierCreateCR() Creates a Chromium barrier, number of participants is set by value specified using rmPipeSetCommSize() rmPipeBarrierExecCR() Doesn’t block application code execution. Used to synchronize rendering execution from rmPipeGetCommSize() streams of graphics commands.

18 Demo Applications: Parallel Isosurface

19 Demo Application: Parallel Volume Rendering

20 Demo Application: Parallel Volume Rendering with LOD Volumes

21 Conclusions We met our objectives: General purpose infrastructure for doing parallel visualization and hardware accelerated rendering on PC clusters. The infrastructure can be used by any parallel application, regardless of the parallel programming model. The architecture scaled well from one to 24 displays, supporting extremely high-resolution output (e.g, 7860x4096). We bumped into network bandwidth limits (not a big surprise). Display lists are still broadcast in Chromium. Please fund them to add this much-needed capability, which is fundamental for efficient sort-first operation of clusters.

22 Sources of Software OpenRM Scene Graph: Source code for OpenRM+Chromium applications: Chromium: chromium.sourceforge.net.

23 Acknowledgement This work was supported by the U. S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under SBIR grant DE-FE03- 02ER The authors wish to thank Randall Frank of the ASCI/VIEWS program at Lawrence Livermore National Laboratory and the Scientific Computing and Imaging Institute at the University of Utah for use of computing facilities during the course of this research. The Argon shock bubble dataset was provided courtesy of John Bell and Vince Beckner at the Center for Computational Sciences and Engineering, Lawrence Berkeley National Laboratory.