Efficient Data Parallel Computing on GPUs Cliff Woolley University of Virginia / NVIDIA.

Slides:



Advertisements
Similar presentations
Lecture 1: Introduction
Advertisements

Programmability Issues
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Performance Dominik G ö ddeke. 2Overview Motivation and example PDE –Poisson problem –Discretization and data layouts Five points of attack for GPGPU.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Accelerating Marching Cubes with Graphics Hardware Gunnar Johansson, Linköping University Hamish Carr, University College Dublin.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
General-Purpose Computation on Graphics Hardware.
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.
Enhancing GPU for Scientific Computing Some thoughts.
Programmable Pipelines. Objectives Introduce programmable pipelines ­Vertex shaders ­Fragment shaders Introduce shading languages ­Needed to describe.
Real-time Graphical Shader Programming with Cg (HLSL)
Computer Graphics Graphics Hardware
GPU Computation Strategies & Tricks Ian Buck Stanford University.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
PolyCube-Maps seamless texture mapping
Lecture 3 : Direct Volume Rendering Bong-Soo Sohn School of Computer Science and Engineering Chung-Ang University Acknowledgement : Han-Wei Shen Lecture.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
GPU Program Optimization Cliff Woolley University of Virginia / NVIDIA.
General-Purpose Computation on Graphics Hardware.
The programmable pipeline Lecture 3.
CSE 690: GPGPU Lecture 6: Cg Tutorial Klaus Mueller Computer Science, Stony Brook University.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Finding Body Parts with Vector Processing Cynthia Bruyns Bryan Feldman CS 252.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
1 University of Virginia Computer Science 2 NVIDIA Research A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics.
CSE 690: GPGPU Lecture 8: Image Processing PDE Solvers Klaus Mueller Computer Science, Stony Brook University.
CSCI 440.  So far we have learned how to  build shapes  create movement  change views  add simple lights  But, our objects still look very cartoonish.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
David Luebke 1 1/25/2016 Programmable Graphics Hardware.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Ray Tracing using Programmable Graphics Hardware
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.
The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Computer Graphics Graphics Hardware
Using the VTune Analyzer on Multithreaded Applications
Graphics Processing Unit
From Turing Machine to Global Illumination
Static Image Filtering on Commodity Graphics Processors
Graphics Processing Unit
GPGPU: Parallel Reduction and Scan
Computer Graphics Graphics Hardware
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
Mastering Memory Modes
Computer Graphics Practical Lesson 10
Debugging Tools Tim Purcell NVIDIA.
University of Virginia
CSE 502: Computer Architecture
Presentation transcript:

Efficient Data Parallel Computing on GPUs Cliff Woolley University of Virginia / NVIDIA

OverviewOverview Data Parallel Computing Computational Frequency Profiling and Load Balancing

Data Parallel Computing

Vector Processing –small-scale parallelism Data Layout –large-scale parallelism Data Parallel Computing

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params) { frag2frame OUT; float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f));... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; }... return OUT; } Learning by example: A really naïve shader

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params) { frag2frame OUT; float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f));... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; }... return OUT; } Learning by example: A really naïve shader

float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); Data Parallel Computing: Vector Processing

float2 offset = center.xy - 0.5f; offset = offset * params.xx + 0.5f; // MADR is cool too float4 neighbor = center.xxyy + float4(-1.0f,1.0f,-1.0f,1.0f); Data Parallel Computing: Vector Processing

float2 offset = center.xy - 0.5f; offset = offset * params.xx + 0.5f; // MADR is cool too float4 neighbor = center.xxyy + float4(-1.0f,1.0f,-1.0f,1.0f); Data Parallel Computing: Vector Processing

Data Parallel Computing: Data Layout Pack scalar data into RGBA in texture memory

Computational Frequency

Precompute

Computational Frequency: Precomputing texcoords Take advantage of under-utilized hardware –vertex processor –rasterizer Reduce instruction count at the per-fragment level Avoid lookups being treated as texture indirections

vert2frag smooth(app2vert IN, uniform float4x4 xform : C0, uniform float2 srcoffset, uniform float size) { vert2frag OUT; OUT.position = mul(xform,IN.position); OUT.center = IN.center; OUT.redblack = IN.center - srcoffset; OUT.operator = size*(OUT.redblack - 0.5f) + 0.5f; OUT.hneighbor = IN.center.xxyx + float4(-1.0f, 1.0f, 0.0f, 0.0f); OUT.vneighbor = IN.center.xyyy + float4(0.0f, -1.0f, 1.0f, 0.0f); return OUT; } Computational Frequency: Precomputing texcoords

Computational Frequency: Precomputing other values Same deal! Factor other computations out: –Anything that varies linearly across the geometry –Anything that has a complex value computed per-vertex –Anything that is uniform across the geometry

Computational Frequency: Precomputing on the CPU Use glMultiTexCoord4f() creatively Extract as much uniformity from uniform parameters as you can

// Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + 0.5f) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) {... Computational Frequency: Precomputed lookup tables

half4 mask = f4texRECT(RedBlack, IN.redblack); /* * mask.x and mask.w tell whether IN.center.x and IN.center.y * are both odd or both even, respectively. either of these two * conditions indicates that the fragment is red. params.x==1 * selects red; params.y==1 selects black. */ if (dot(mask,params.xyyx)) {... Computational Frequency: Precomputed lookup tables

Computational Frequency: Avoid inner-loop branching Fast-path vs. slow-path –write several variants of each fragment program to handle boundary cases –eliminates conditionals in the fragment program –equivalent to avoiding CPU inner-loop branching slow path with boundaries fast path, no boundaries

Computational Frequency: Avoid inner-loop branching Fast-path vs. slow-path –write several variants of each fragment program to handle boundary cases –eliminates conditionals in the fragment program –equivalent to avoiding CPU inner-loop branching slow path with boundaries fast path, no boundaries

slow path with boundaries fast path, no boundaries Computational Frequency: Avoid inner-loop branching Fast-path vs. slow-path –write several variants of each fragment program to handle boundary cases –eliminates conditionals in the fragment program –equivalent to avoiding CPU inner-loop branching

Profiling and Load Balancing

Software profiling GPU pipeline profiling GPU load balancing

Profiling and Load Balancing: Software profiling Run a standard software profiler! –Rational Quantify –Intel VTune

Profiling and Load Balancing: GPU pipeline profiling This is where it gets tricky. Some tools exist to help you: –NVPerfHUD –Apple OpenGL Profiler

Profiling and Load Balancing: GPU load balancing This is a whole talk in and of itself –e.g., http ://developer.nvidia.com/docs/IO/8343/Performance-Optimisation.pdf Sometimes you can get more hints from third parties than from the vendors themselves – –

ConclusionsConclusions

ConclusionsConclusions Get used to thinking in terms of vector computation Understand how frequently each computation will run, and reduce that frequency wherever possible Track down bottlenecks in your application, and shift work to other parts of the system that are idle

Questions?Questions? Acknowledgements –Pat Brown and Nick Triantos at NVIDIA –NVIDIA for having given me a job this summer –Dave Luebke, my advisor –GPGPU course presenters –NSF Award #