1 Parallel Rendering In the GPU Era Orion Sky Lawlor U. Alaska Fairbanks 2009-04-16 8.

Slides:



Advertisements
Similar presentations
Programming with OpenGL - Getting started - Hanyang University Han Jae-Hyek.
Advertisements

COMPUTER GRAPHICS SOFTWARE.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU Virtualization on VMware’s Hosted I/O Architecture Micah Dowty Jeremy Sugerman USENIX WIOV
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Portable Image File Viewer ENEE 408G: Multimedia Signal Processing Seun Fabayo John Glancy Gordon Krauthamer.
Real-Time Geometric and Color Calibration for Multi-Projector Displays Christopher Larson, Aditi Majumder Large-Area High Resolution Displays Motivation.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
Parallel Rendering Ed Angel
Programming with OpenGL Part 1: Background Mohan Sridharan Based on slides created by Edward Angel CS4395: Computer Graphics 1.
Hardware and Software Basics. Computer Hardware  Central Processing Unit - also called “The Chip”, a CPU, a processor, or a microprocessor  Memory (RAM)
Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.
Under the Hood: 3D Pipeline. Motherboard & Chipset PCI Express x16.
1 Perception, Illusion and VR HNRS 299, Spring 2008 Lecture 19 Other Graphics Considerations Review.
Visualization Linda Fellingham, Ph. D Manager, Visualization and Graphics Sun Microsystems Shared Visualization 1.1 Software Scalable Visualization 1.1.
1 KIPA Game Engine Seminars Jonathan Blow Seoul, Korea November 29, 2002 Day 4.
Computer System Architectures Computer System Software
Chapter 2 IT Foundation Data: facts about objects Store data in computer: – binary data – bits – bytes Five types of data.
Electronic Visualization Laboratory University of Illinois at Chicago “Sort-First, Distributed Memory Parallel Visualization and Rendering” by E. Wes Bethel,
Chep06 1 High End Visualization with Scalable Display System By Dinesh M. Sarode, S.K.Bose, P.S.Dhekne, Venkata P.P.K Computer Division, BARC, Mumbai.
Parallel Rendering 1. 2 Introduction In many situations, standard rendering pipeline not sufficient ­Need higher resolution display ­More primitives than.
Introduction to Computers By: Najam Khan What we will learn about: Hardware: The term used to describe the physical parts of a computer. Ex. The box,
CSE 381 – Advanced Game Programming Basic 3D Graphics
Computer Graphics Graphics Hardware
Cg Programming Mapping Computational Concepts to GPUs.
BASS Application Sharing System Omer Boyaci September 10,
CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS – PART 2 SPRING 2015 DR. MICHAEL J. REALE.
1 GPGPU Clusters Hardware Performance Model & SW: cudaMPI, glMPI, and MPIglut Orion Sky Lawlor U. Alaska Fairbanks
Seminar II: Rendering Architectures Yan Cui Love Joy Mendoza Oscar Kozlowski John Tang.
Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.
Graphics Systems and OpenGL. Business of Generating Images Images are made up of pixels.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Hardware. Make sure you have paper and pen to hand as you will need to take notes and write down answers and thoughts that you can refer to later on.
1 Graphics CSCI 343, Fall 2015 Lecture 4 More on WebGL.
CS 4363/6353 OPENGL BACKGROUND. WHY IS THIS CLASS SO HARD TO TEACH? (I’LL STOP WHINING SOON) Hardware (GPUs) double in processing power ever 6 months!
1 Introduction to Computer Graphics with WebGL Ed Angel Professor Emeritus of Computer Science Founding Director, Arts, Research, Technology and Science.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Parallel Rendering. 2 Introduction In many situations, a standard rendering pipeline might not be sufficient ­Need higher resolution display ­More primitives.
1 cudaMPI and glMPI: Message Passing for GPGPU Clusters Orion Sky Lawlor U. Alaska Fairbanks
GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.
Computer Graphics II University of Illinois at Chicago Volume Rendering Presentation for Computer Graphics II Prof. Andy Johnson By Raj Vikram Singh.
GIS in the cloud: implementing a Web Map Service on Google App Engine Jon Blower Reading e-Science Centre University of Reading United Kingdom
Based on paper by: Rahul Khardekar, Sara McMains Mechanical Engineering University of California, Berkeley ASME 2006 International Design Engineering Technical.
A Neural Network Implementation on the GPU By Sean M. O’Connell CSC 7333 Spring 2008.
1 MPIglut: Powerwall Programming made Easier Dr. Orion Sky Lawlor U. Alaska Fairbanks WSCG '08,
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Ray Tracing using Programmable Graphics Hardware
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
1 E. Angel and D. Shreiner: Interactive Computer Graphics 6E © Addison-Wesley 2012 Models and Architectures 靜宜大學 資訊工程系 蔡奇偉 副教授 2012.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
GLSL Review Monday, Nov OpenGL pipeline Command Stream Vertex Processing Geometry processing Rasterization Fragment processing Fragment Ops/Blending.
1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Computer Graphics (Fall 2003) COMS 4160, Lecture 5: OpenGL 1 Ravi Ramamoorthi Many slides courtesy Greg Humphreys.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
What is GPU? how does it work?
VirtualGL.
Chapter 6 GPU, Shaders, and Shading Languages
Computer Graphics Graphics Hardware
High Performance Computing
Presentation transcript:

1 Parallel Rendering In the GPU Era Orion Sky Lawlor U. Alaska Fairbanks

2 Importance of Computer Graphics “The purpose of computing is insight, not numbers!” R. Hamming Vision is a key tool for analyzing and understanding the world Your eyes are your brain’s highest bandwidth input device Vision: >300MB/s 1600x bit 60Hz Sound: <1 MB/s 44KHz 24-bit 5.1 Surround sound Touch: <1 KB/s (?) ‏ Smell/taste: <10 per second Plus, pictures look really cool...

Prior work: GPUs, NetFEM, impostors

4 GPU Rendering Drawbacks Graphics cards are fast But not at rendering lots of tiny geometry: 1M primitives/frame OK 1G pixels/frame OK 1G primitives/frame not OK Problems with billions of primitives do not utilize current graphics hardware well Graphics cards only have a few gigabytes of RAM (vs. parallel machine, with terabytes of RAM) ‏

5 Graphics Card: Usable Fill Rate Small triangles Large triangles NVIDIA GeForce 8800M GTS

6 Parallel Rendering Advantages Multiple processors can render geometry simultaneously Achieved rendering speedup for large particle dataset Can store huge datasets in memory BUT: No display on parallel machine! Ignores cost of shipping images to client 48 nodes of Hal cluster: 2-way 550MHz Pentium III nodes connected with fast ethernet

7 Parallel Rendering Disadvantage Parallel Machine Desktop Machine Display 100 MB/s Gigabit Ethernet 100 GB/s Graphics Card Memory Link to client is too slow! Cannot ship frames to client at full framerate/ full resolution WAY TOO SLOW!

8 Basic model: NetFEM Serial OpenGL Client Parallel FEM Framework Server Client connects Server sends client the current FEM mesh (nodes and elements) ‏ Includes all attributes Client can display, rotate, examine Not just for postmortem! Making movies on the fly Dumping simulation output Monitoring running simulation ‏

9 NetFEM: visualization tool Connect to running parallel machine See, e.g., wave dispersion off a crack

10 Impostors : Basic Idea Camera Impostor Geometry

11 Parallel Impostors Technique Key observation: impostor images don’t depend on one another So render impostors in parallel! Uses the speed and memory of the parallel machine Fine grained-- lots of potential parallelism Geometry is partitioned by impostors No “shared model” assumption Reassemble world on serial client Uses rendering bandwidth of client graphics card Impostor reuse cuts required network bandwidth to client Only update images when necessary Impostors provide latency tolerance

12 Client/Server Architecture Parallel machine can be anywhere on network Keeps the problem geometry Renders and ships new impostors as needed Impostors shipped using TCP/IP sockets CCS & PUP protocol [Jyothi and Lawlor 04] Works over NAT/firewalled networks Client sits on user’s desk Sends server new viewpoints Receives and displays new impostors

13 Client Architecture Latency tolerance: client never waits for server Displays existing impostors at fixed framerate Even if they’re out of date Prefers spatial error (due to out of date impostor) to temporal error (due to dropped frames) ‏ Implementation uses OpenGL for display Two separate kernel threads for network handling

New work: liveViz pixel transport

15 Basic model: LiveViz Serial 2D Client Parallel Charm++ Server Client connects Server sends client the current 2D image pixels (just pixels) ‏ Can be from a 3D viewpoint (liveViz3D mode) ‏ Can be color (RGB) or grayscale Recently extended to support JPEG compressed network transport Big win on slow networks!

LiveViz – What is it? Charm++ library Visualization tool Inspect your program’s current state Java client runs on any machine You code the image generation 2D and 3D modes

LiveViz Request Model LiveViz Server Library Client GUI LiveViz Application Client sends request Server code broadcasts request to application Application array element render image pieces Server code assembles full 2D image Server sends 2D image back to client Client displays image

LiveViz Request Model LiveViz Server Library Client GUI LiveViz Application Client sends request Server code broadcasts request to application Application array element render image pieces Server code assembles full 2D image Server sends 2D image back to client Client displays image Bottleneck!

LiveViz Compressed requests LiveViz Server Library Client GUI LiveViz Application Client sends request Server code broadcasts request to application Application array element render image pieces Server code assembles full 2D image Server compresses 2D image to a JPEG Server sends JPEG to client Client decompresses and displays image

LiveViz Compressed requests On a gigabit network, JPEG compression is CPU-bound, and just slows us down! Compression hence optional

LiveViz Compressed requests On a slow 2MB/s wireless or WAN network, uncompressed liveViz is network bound Here, JPEG data transport is a big win!

New work: Cosmology Rendering

23 Large astrophysics simulation (Quinn et al) ‏ >=50M particles >=20 bytes/particle => 1 GB of data Large Particle Dataset

24 Rendering process (in principle) ‏ For each pixel: Find maximum mass along 3D ray Look up mass in color table Large Particle Rendering

25 Rendering process (in practice) ‏ For each particle: Project 3D particle onto 2D screen Keep maximum mass at each pixel Ship image to client Apply color table to 2D image at client Large Particle Rendering

26 Large Particle Rendering (2D) ‏

27 Rendering process (in practice) ‏ For each particle: Project 3D particle onto 2D screen Keep maximum mass at each pixel Ship image to client Apply color table to 2D image at client Large Particle Rendering (2D) ‏

28 Particle Set to Volume Impostors

29 Shipping Volume Impostors Slices of 3D Volume Stack of 2D Slices

30 Shipping Volume Impostors Stack of 2D Slices Hey, that's just a 2D image! So we can use liveViz: Render slices in parallel Assemble slices across processors (Optionally) JPEG compress image Ship across network to (new) client

31 Volume Impostors Technique 2D impostors are flat, and can't rotate 3D voxel dataset can be rendered from any viewpoint on the client Practical problem: Render voxels into a 2D image on the client by drawing slices with OpenGL Store maximum across all slices: glBlendEquation(GL_MAX); To look up (rendered) maximum in color table, render slices to texture and run a programmable shader

32 Volume Impostors: GLSL Code GLSL code to look up the rendered color in our color table texture: varying vec2 texcoords; uniform sampler2D rendered, color_table; void main() ‏ { vec4 rend=texture2D(rendered,texcoords ); gl_FragColor = texture2D(color_table, vec2(rend.r+0.5/255,0)); } Frustration: color table values don't interpolate (use GL_NEAREST, but it's pretty blocky!) ‏

New Work: MPIglut

MPIglut: Motivation ● All modern computing is parallel Multi-Core CPUs, Clusters Athlon 64 X2, Intel Core2 Duo Multiple Multi-Unit GPUs nVidia SLI, ATI CrossFire Multiple Displays, Disks,... ● But languages and many existing applications are sequential Software problem: run existing serial code on a parallel machine Related: easily write parallel code

What is a “Powerwall”? ● A powerwall has: Several physical display devices One large virtual screen I.E. “parallel screens” ● UAF CS/Bioinformatics Powerwall Twenty LCD panels 9000 x 4500 pixels combined resolution 35+ Megapixels

Sequential OpenGL Application

Parallel Powerwall Application

MPIglut: The basic idea ● Users compile their OpenGL/glut application using MPIglut, and it “just works” on the powerwall ● MPIglut's version of glutInit runs a separate copy of the application for each powerwall screen ● MPIglut intercepts glutInit, glViewport, and broadcasts user events over the network ● MPIglut's glViewport shifts to render only the local screen

MPIglut uses glut sequential code ● GL Utilities Toolkit Portable window, event, and GUI functionality for OpenGL apps De facto standard for small apps Several implementations: Mark Kilgard original, FreeGLUT,... Totally sequential library, until now! ● MPIglut intercepts several calls But many calls still unmodified We run on a patched freeglut 2.4 Minor modification to window creation

Parallel Rendering Taxonomy ● Molnar's influential 1994 paper Sort-first: send geometry across network before rasterization (GLX/DMX, Chromium) ‏ Sort-middle: send scanlines across network during rasterization Sort-last: send rendered pixels across the network after rendering (Charm++ liveViz, IBM's Scalable Graphics Engine, ATI CrossFire) ‏

Parallel Rendering Taxonomy ● Expanded taxonomy: Send-event (MPIglut, VR Juggler) ‏ Send only user events (mouse clicks, keypresses). Just kilobytes/sec! Send-database Send application-level primitives, like terrain model. Can cache/replicate data! Send-geometry (Molnar sort-first) ‏ Send-scanlines (Molnar sort-middle) ‏ Send-pixels (Molnar sort-last) ‏

MPIglut Code & Runtime Changes

MPIglut Conversion: Original Code #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... }

MPIglut: Required Code Changes #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } This is the only source change. Or, you can just copy mpiglut.h over your old glut.h header!

MPIglut Runtime Changes: Init #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } MPIglut starts a separate copy of the program (a “backend”) to drive each powerwall screen

MPIglut Runtime Changes: Events #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } Mouse and other user input events are collected and sent across the network. Each backend gets identical user events (collective delivery) ‏

MPIglut Runtime Changes: Sync #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } Frame display is (optionally) synchronized across the cluster

MPIglut Runtime Changes: Coords #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } User code works only in global coordinates, but MPIglut adjusts OpenGL's projection matrix to render only the local screen

MPIglut Runtime Non-Changes #include void display(void) { glBegin(GL_TRIANGLES);... glEnd(); glutSwapBuffers(); } void reshape(int x_size,int y_size) { glViewport(0,0,x_size,y_size); glLoadIdentity(); gluLookAt(...); }... int main(int argc,char *argv[]) { glutInit(&argc,argv); glutCreateWindow(“Ello!”); glutMouseFunc(...);... } MPIglut does NOT intercept or interfere with rendering calls, so programmable shaders, vertex buffer objects, framebuffer objects, etc all run at full performance

MPIglut Assumptions/Limitations ● Each backend app must be able to render its part of its screen Does not automatically imply a replicated database, if application uses matrix-based view culling ● Backend GUI events (redraws, window changes) are collective All backends must stay in synch Automatic for applications that are deterministic function of events Non-synchronized: files, network, time

MPIglut: Bottom Line ● Tiny source code change ● Parallelism hidden inside MPIglut Application still “feels” sequential ● Fairly major runtime changes Serial code now runs in parallel (!) ‏ Multiple synchronized backends running in parallel User input events go across network OpenGL rendering coordinate system adjusted per-backend But rendering calls are left alone

MPIglut Application Performance

Performance Testing ● MPIglut programs perform about the same on 20 screens as they do on 1 screen ● We compared performance against two other packages for running unmodified OpenGL apps: DMX: OpenGL GLX protocol interception and replication (MPIglut gets screen sizes via DMX) ‏ Chromium: libgl OpenGL rendering call interception and routing

Benchmark Applications soar UAF CS Bioinformatics Powerwall Switched Gigabit Ethernet Interconnect 10 Dual-Core 2GB Linux Machines: 7 nVidia QuadroFX nVidia QuadroFX 1400

MPIglut Performance

Chromium Tilesort Performance

Gigabit Ethernet Network Saturated!

DMX Performance

MPIglut Conclusions ● MPIglut: an easy route to high- performance parallel rendering ● Hiding parallelism inside a library is a broadly-applicable technique THREADirectX? OpenMPQt? ● Still much work to do: Multicore / multi-GPU support Need better GPGPU support (tiles, ghost edges, load balancing) ‏ Need load balancing (AMPIglut!) ‏

Load Balancing a Powerwall ● Problem: Sky really easy Terrain really hard ● Solution: Move the rendering for load balance, but you've got to move the finished pixels back for display!

Future Work: Load Balancing ● AMPIglut: principle of persistence should still apply ● But need cheap way to ship back finished pixels every frame ● Exploring GPU JPEG compression DCT + quantize: really easy Huffman/entropy: really hard Probably need a CPU/GPU split MB/s inside GPU MB/s on CPU 100+ MB/s on network