1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor U. Alaska Fairbanks 2009-10-13 8.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Virtual Machines What Why How Powerpoint?. What is a Virtual Machine? A Piece of software that emulates hardware.  Might emulate the I/O devices  Might.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

Memory Management 2010.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Contemporary Languages in Parallel Computing Raymond Hummel.

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.

1 Parallel Rendering In the GPU Era Orion Sky Lawlor U. Alaska Fairbanks

ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 Data Structures for Scientific Computing Orion Sky Lawlor charm.cs.uiuc.edu 2003/12/17.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Computer Graphics Graphics Hardware

GPU Computation Strategies & Tricks Ian Buck Stanford University.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Cg Programming Mapping Computational Concepts to GPUs.

1 GPGPU Clusters Hardware Performance Model & SW: cudaMPI, glMPI, and MPIglut Orion Sky Lawlor U. Alaska Fairbanks

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Adaptive Mesh Modification in Parallel Framework Application of parFUM Sandhya Mangala (MIE) Prof. Philippe H. Geubelle (AE) University of Illinois, Urbana-Champaign.

GPU Architecture and Programming

Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

1 Towards a Usable Programming Model for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.

1 MPIglut: Powerwall Programming made Easier Dr. Orion Sky Lawlor U. Alaska Fairbanks WSCG '08,

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Background Computer System Architectures Computer System Software.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.

Computer Graphics Graphics Hardware

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Chapter 9 – Real Memory Organization and Management

Parallel Objects: Virtualization & In-Process Components

From Turing Machine to Global Illumination

Component Frameworks:

NVIDIA Fermi Architecture

Computer Graphics Graphics Hardware

6- General Purpose GPU Programming

Presentation transcript:

1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor U. Alaska Fairbanks

2 Where is Fairbanks, Alaska?

3 Fairbanks in summer: 80F

4 Fairbanks in winter: -20F

5 CSAR Retrospective CSAR work: Charm++ and FEM Framework Mesh Modification Solution Transfer

Charm++ FEM Framework Handles parallel details in the runtime Leaves physics and numerics to user Presents clean, “almost serial” interface: One call to update cross-processor boundaries Not just for Finite Element computations! (renamed “ParFUM”) ‏ Now supports ghost cells Builds on top of AMPI or native MPI Allows use of advanced Charm++ features: adaptive parallel computation Dynamic, automatic load balancing Other libraries: Collision, adaptation, visualization, …

FEM Mesh Partitioning: Serial to Parallel

FEM Mesh Partitioning: Communication Summing forces from other processors only takes one call: FEM_Update_field Can also access values in ghost elements

Charm++ FEM Framework Users Rocflu fluids solver, a part of GENX [Haselbacher] Frac3D mechanics/fracture code [Geubelle] DG, metal solidification/dendritic growth implicit solver [Danzig] CPSD spacetime meshing code [Haber/Erickson]

Charm++ Collision Detection Detect collisions (intersections) between objects scattered across processors Built on Charm++ Arrays Overlay regular, sparse 3D grid of voxels (boxes)‏ Send objects to all voxels they touch Collide voxels independently and collect results Leaves collision response to caller

Collision Detection Algorithm: Sparse 3D voxel grid

Remeshing As the solids burn away, the domain changes dramatically ALE: Fluids mesh expands ALE: Solids mesh contracts This distorts the elements of the mesh We need to be able to fix the deformed mesh

Initial mesh consists of small, good quality tetrahedra

As solids burn away, fluids domain becomes more and more stretched

Significant degradation in element quality compared to original!

Remeshing: Solution Transfer Can use existing (off-the-shelf) tools to remesh our domain Must also handle solution data Density, velocity, displacement fields Gas pressure/temperature Boundary conditions! Accurate transfer of solution data is a difficult mathematical problem Solution data (and mesh) are scattered across processors

Remeshing and Solution Transfer FEM: reassemble a serial boundary mesh Call serial remeshing tools: YAMS, TetMesh ™ FEM: partition new serial mesh Collision Library: match up old and new volume meshes Transfer Library: conservative, accurate volume-to-volume data transfer using Jiao method

Original mesh after deformation

Closeup: before remeshing-- note stretched elements on boundary

After remeshing-- better element size and shape

Remeshing restores element size and quality

Fluids Gas Velocity Transfer Deformed mesh New mesh

Remeshing: Continue the Simulation Theory: just start simulation using the new mesh, solution, and boundaries! In practice: not so easy with the real code (genx) ‏ Lots of little pieces: fluids, solids, combustion, interface,... Each have their own set of needs and input file formats! Prototype: treat remeshing like a restart Remeshing system writes mesh, solution, boundaries to ordinary restart files Integrated code thinks this is an ordinary restart Few changes needed inside genx

Can now continue simulation using new mesh (prototype) ‏

Remeshed simulation resolves boundary better than old!

Remeshing: Future Work Automatically decide when to remesh Currently manual Remesh the solids domain Currently only Rocflu is supported Remesh during a real parallel run Currently only serial data format supported Remesh without using restart files Remesh only part of the domain (e.g. burning crack) ‏ Currently remeshes entire domain at once Remesh without using serial tools Currently YAMS, TetMesh are completely serial

Rocrem: Remeshing For Rocstar Fully parallel mesh search & solution transfer Build on Charm++ parallel collision detection framework Scalable to millions of elements -- full 3.8m tet Titan IV mesh Supports node and element-centered data Supports node and face-centered boundary conditions Preserved during remeshing and data transfer Unique method to support transfer along boundaries: boundary extrusion Integrated with Rocket simulation data formats Titan IV Boundary Conditions

Parallel Online Mesh Modification 3D on the way! 1pe2pe4pe 2D triangle mesh refinement Longest-edge bisection [Rivara et al] Fully parallel implementation Works with real application code [Geubelle] Refines solution data Updates parallel communication and external boundary conditions

38 New Work: GPGPU Modern GPU Hardware Quantitative GPU Performance GPU-Network Interfacing cudaMPI and glMPI Performance MPIglut: Powerwall Parallelization Conclusions & Future Work

Modern GPU Hardware

40 Latency, or Bandwidth? time=latency + size / bandwidth To mimimize latency: Caching (memory) Speculation (branches, memory) ‏ To maximize bandwidth: Clockrate (since 1970) ‏ Pipelining (since 1985) ‏ Parallelism (since 2004) ‏ For software, everything's transparent except parallelism!

41 CPU vs GPU Design Targets Typical CPU optimizes for instruction execution latency Millions of instructions / thread Less than 100 software threads Hardware caches, speculation Typical GPU optimizes for instruction execution bandwidth Thousands of instructions / thread Over 1 million software threads Hardware pipelining, queuing, request merging

42 Modern GPUs: Good Things Fully programmable Branches, loops, functions, etc IEEE 32-bit Float Storage ‏ Huge parallelism in hardware Typical card has 100 cores Each core threaded ways Hardware thread switching (SMT) ‏ E.g., $250 GTX 280 has 240 cores, >900 gigaflops peak GPU is best in flops, flops/watt, flops/$ [Datta... Yelik 2008]

43 Modern GPUs: Bad Things Programming is really difficult Branch & load divergence penalties Non-IEEE NaNs, 64-bit much slower ‏ Huge parallelism in software Need millions of separate threads! Usual problems: decomposition, synchronization, load balance E.g., $250 GTX 280 has *only* 1GB of storage, 4GB/s host RAM GPU is definitely a mixed blessing for real applications

44 Modern GPU Programming Model “Hybrid” GPU-CPU model GPU cranks out the flops CPU does everything else Run non-GPU legacy or sequential code Disk, network I/O Interface with OS and user Supported by CUDA, OpenCL, OpenGL, DirectX,... Likely to be the dominant model for the foreseeable future CPU may eventually “wither away”

45 GPU Programming Interfaces OpenCL (2008) ‏ Hardware independent (in theory) ‏ SDK still difficult to find NVIDIA's CUDA (2007) ‏ Mixed C++/GPU source files Mature driver, widespread use OpenGL GLSL, DirectX HLSL (2004) ‏ Supported on NVIDIA and ATI on wide range of hardware Graphics-centric interface (pixels) ‏

46 NVIDIA CUDA Fully programmable (loops, etc) ‏ Read and write memory anywhere Memory accesses must be “coalesced” for high performance Zero protection against multithreaded race conditions “texture cache” can only be accessed via restrictive CUDA arrays Manual control over a small shared memory region near the ALUs CPU calls a GPU “kernel” with a “block” of threads Only runs on new NVIDIA hardware

47 OpenGL: GL Shading Language Fully programmable (loops, etc) ‏ Can only read “textures”, only write to “framebuffer” (2D arrays) ‏ Reads go through “texture cache”, so performance is good (iff locality) ‏ Attach non-read texture to framebuffer So cannot have synchronization bugs! Today, greyscale GL_LUMINANCE texture pixels store floats, not GL_RGBA 4-vectors (zero penalty!) ‏ Rich selection of texture filtering (array interpolation) modes GLSL runs on nearly every GPU

48 Modern GPU Fast Stuff Divide, sqrt, rsqrt, exp, pow, sin, cos, tan, dot all sub-nanosecond Fast like arithmetic; not library call Branches are fast iff: They're short (a few instructions) ‏ Or they're coherent across threads GPU memory accesses fast iff: They're satisfied by “texture cache” Or they're coalesced across threads (contiguous and aligned) ‏ GPU “coherent” ≈ MPI “collective”

49 Modern GPU Slow Stuff Not enough work per kernel >4us latency for kernel startup Need >>1m flops / kernel! Divergence: costs up to 97%! Noncoalesced memory accesses Adjacent threads fetching distant regions of memory Divergent branches Adjacent threads running different long batches of code Copies to/from host >10us latency, <4GB/s bandwidth

Quantitative GPU Performance

51 CUDA: Memory Output Bandwidth Measure speed for various array sizes and number of threads per block __global__ void write_arr(float *arr) { arr[threadIdx.x+blockDim.x*blockIdx.x]=0.0; } int main() {... start timer write_arr >>(arr);... stop timer } Build a trivial CUDA test harness:

52 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280 Number of Floats/Kernel Threads per Block Output Bandwidth (Gigafloats/second)‏

53 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280 Number of Floats/Kernel Threads per Block Output Bandwidth (Gigafloats/second)‏ Over 64K blocks needed

54 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280 Number of Floats/Kernel Threads per Block Output Bandwidth (Gigafloats/second)‏ Kernel Startup Time Dominates

55 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280 Number of Floats/Kernel Threads per Block Output Bandwidth (Gigafloats/second)‏ Block Startup Time Dominates

56 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280 Number of Floats/Kernel Threads per Block Output Bandwidth (Gigafloats/second)‏ Good Performance

57 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block

58 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block Kernel startup latency: 4us Kernel output bandwidth: 80 GB/s

59 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block Kernel startup latency: 4us Kernel output bandwidth: 80 GB/s t = 4us / kernel + bytes / 80 GB/s

60 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block Kernel startup latency: 4us Kernel output bandwidth: 80 GB/s t = 4us / kernel + bytes * ns / byte

61 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block Kernel startup latency: 4us Kernel output bandwidth: 80 GB/s t = 4000ns / kernel + bytes * ns / byte

62 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block t = 4000ns / kernel + bytes * ns / byte WHAT!? time/kernel = 320,000 x time/byte!?

63 CUDA: Device-to-Host Memcopy NVIDIA GeForce GTX 280, pinned memory Copy startup latency: 10us Copy output bandwidth: 1.7 GB/s t = 10000ns / kernel + bytes * 0.59 ns / byte

64 CUDA: Device-to-Host Memcopy NVIDIA GeForce GTX 280, pinned memory t = 10000ns / kernel + bytes * 0.59 ns / byte

65 CUDA: Memory Output Bandwidth NVIDIA GeForce GTX 280, fixed 128 threads per block t = 4000ns / kernel + bytes * ns / byte

66 Performance Analysis Take-Home GPU Kernels: high latency, and high bandwidth GPU-CPU copy: high latency, but low bandwidth High latency means use big batches (not many small ones) ‏ Low copy bandwidth means you MUST leave most of your data on the GPU (can't copy it all out) ‏ Similar lessons as network tuning! (Message combining, local storage.) ‏

Why GPGPU clusters?

68 Parallel GPU Advantages Multiple GPUs: more compute & storage bandwidth

69 Parallel GPU Communication Shared memory, or messages? History: on the CPU, shared memory won for small machines; messages won for big machines Shared memory GPUs: Tesla Message-passing GPUs: clusters Easy to make in small department E.g., $2,500 for ten GTX 280's (8 TF!) ‏ Or both, like NCSA Lincoln Messages are lowest common denominator: run easily on shared memory (not vice versa!) ‏

70 Parallel GPU Application Interface On the CPU, for message passing the Message Passing Interface (MPI) is standard and good Design by committee means if you need it, it's already in there! E.g., noncontiguous derived datatypes, communicator splitting, buffer mgmt So let's steal that! Explicit send and receive calls to communicate GPU data on network

71 Parallel GPU Implementation Should we send and receive from GPU code, or CPU code? GPU code seems more natural for GPGPU (data is on GPU already) ‏ CPU code makes it easy to call MPI Network and GPU-CPU data copy have high latency, so message combining is crucial! Hard to combine efficiently on GPU (atomic list addition?) [Owens] But easy to combine data on CPU

72 Parallel GPU Send and Receives cudaMPI_Send is called on CPU with a pointer to GPU memory cudaMemcpy into (pinned) CPU communication buffer MPI_Send communication buffer cudaMPI_Recv is similar MPI_Recv to (pinned) CPU buffer cudaMemcpy off to GPU memory cudaMPI_Bcast, Reduce, Allreduce, etc all similar Copy data to CPU, call MPI (easy!) ‏

73 Asynchronous Send and Receives cudaMemcpy, MPI_Send, and MPI_Recv all block the CPU cudaMemcpyAsync, MPI_Isend, and MPI_Irecv don't They return a handle you occasionally poll instead Sadly, cudaMemcpyAsync has much higher latency than blocking version (40us vs 10us) ‏ cudaMPI_Isend and Irecv exist But usually are not worth using!

74 OpenGL Communication It's easy to write corresponding glMPI_Send and glMPI_Recv glMPI_Send: glReadPixels, MPI_Send glMPI_Recv: MPI_Recv, glDrawPixels Sadly, glDrawPixels is very slow: only 250MB/s! glReadPixels is over 1GB/s glTexSubImage2D from a pixel buffer object is much faster (>1GB/s) ‏

75 CUDA or OpenGL? cudaMemcpy requires contiguous memory buffer glReadPixels takes an arbitrary rectangle of texture data Rows, columns, squares all equally fast! Still, there's no way to simulate MPI's rich noncontiguous derived data types

76 Noncontiguous Communication Network Data Buffer GPU Target Buffer Scatter kernel? Fold into read?

cudaMPI and glMPI Performance

78 Parallel GPU Performance CPU GPU 0.1 GB/s Gigabit Ethernet Networks are really low bandwidth! GPU RAM CPU GPU 80 GB/s Graphics Card Memory 1.7 GB/s GPU-CPU Copy GPU RAM

79 Parallel GPU Performance: CUDA Ten GTX 280 GPUs over gigabit Ethernet; CUDA

80 Parallel GPU Performance: OpenGL Ten GTX 280 GPUs over gigabit Ethernet; OpenGL

Conclusions

82 Conclusions Consider OpenGL or DirectX instead of CUDA or OpenCL Watch out for latency on GPU! Combine data for kernels & copies Use cudaMPI and glMPI for free: Future work: More real applications! GPU-JPEG network compression Load balancing by migrating pixels DirectXMPI? cudaSMP?

Related work: MPIglut

MPIglut: Motivation ● All modern computing is parallel Multi-Core CPUs, Clusters Athlon 64 X2, Intel Core2 Duo Multiple Multi-Unit GPUs nVidia SLI, ATI CrossFire Multiple Displays, Disks,... ● But languages and many existing applications are sequential Software problem: run existing serial code on a parallel machine Related: easily write parallel code

What is a “Powerwall”? ● A powerwall has: Several physical display devices One large virtual screen I.E. “parallel screens” ● UAF CS/Bioinformatics Powerwall Twenty LCD panels 9000 x 4500 pixels combined resolution 35+ Megapixels

Sequential OpenGL Application

Parallel Powerwall Application

MPIglut: The basic idea ● Users compile their OpenGL/glut application using MPIglut, and it “just works” on the powerwall ● MPIglut's version of glutInit runs a separate copy of the application for each powerwall screen ● MPIglut intercepts glutInit, glViewport, and broadcasts user events over the network ● MPIglut's glViewport shifts to render only the local screen

MPIglut Conversion: Original Code #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... }

MPIglut: Required Code Changes #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } This is the only source change. Or, you can just copy mpiglut.h over your old glut.h header!

MPIglut Runtime Changes: Init #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } MPIglut starts a separate copy of the program (a “backend”) to drive each powerwall screen

MPIglut Runtime Changes: Events #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } Mouse and other user input events are collected and sent across the network. Each backend gets identical user events (collective delivery) ‏

MPIglut Runtime Changes: Sync #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } Frame display is (optionally) synchronized across the cluster

MPIglut Runtime Changes: Coords #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } User code works only in global coordinates, but MPIglut adjusts OpenGL's projection matrix to render only the local screen

MPIglut Runtime Non-Changes #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } MPIglut does NOT intercept or interfere with rendering calls, so programmable shaders, vertex buffer objects, framebuffer objects, etc all run at full performance

MPIglut Assumptions/Limitations ● Each backend app must be able to render its part of its screen Does not automatically imply a replicated database, if application uses matrix-based view culling ● Backend GUI events (redraws, window changes) are collective All backends must stay in synch Automatic for applications that are deterministic function of events Non-synchronized: files, network, time

MPIglut: Bottom Line ● Tiny source code change ● Parallelism hidden inside MPIglut Application still “feels” sequential ● Fairly major runtime changes Serial code now runs in parallel (!) ‏ Multiple synchronized backends running in parallel User input events go across network OpenGL rendering coordinate system adjusted per-backend But rendering calls are left alone

Load Balancing a Powerwall ● Problem: Sky really easy Terrain really hard ● Solution: Move the rendering for load balance, but you've got to move the finished pixels back for display!

Future Work: Load Balancing ● AMPIglut: principle of persistence should still apply ● But need cheap way to ship back finished pixels every frame ● Exploring GPU JPEG compression DCT + quantize: really easy Huffman/entropy: really hard Probably need a CPU/GPU split MB/s inside GPU MB/s on CPU 100+ MB/s on network