Balancing the Graphics Pipeline for Optimal Performance

Slides:

Advertisements

Similar presentations

DirectX11 Performance Reloaded

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Optimized Stencil Shadow Volumes

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

Graphics Pipeline.

Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.

RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Real-Time Rendering TEXTURING Lecture 02 Marina Gavrilova.

Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.

Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.

Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.

The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.

Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.

3D Rendering & Algorithms__ Sean Reichel & Chester Gregg a.k.a. “The boring stuff happening behind the video games you really want to play right now.”

AGD: 5. Game Arch.1 Objective o to discuss some of the main game architecture elements, rendering, and the game loop Animation and Games Development.

Sprite Batching and Texture Atlases Randy Gaul. Overview Batches Sending data to GPU Texture atlases Premultiplied alpha Note: Discussion on slides is.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.

High Performance in Broad Reach Games Chas. Boyd

© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.

REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.

GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.

NVIDIA PROPRIETARY AND CONFIDENTIAL Occlusion (HP and NV Extensions) Ashu Rege.

® GDC’99 Performance Tuning with Intel ® Graphics Tools Larry Wickstrom Sr. Software Engineer Judith Stanley Application Engineer Intel Corporation March.

Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.

OpenGL ES Performance (and Quality) on the GoForce5500 Handheld GPU Lars M. Bishop, NVIDIA Developer Technologies.

OpenGL Performance John Spitzer. 2 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering

CS 638, Fall 2001 Multi-Pass Rendering The pipeline takes one triangle at a time, so only local information, and pre-computed maps, are available Multi-Pass.

The programmable pipeline Lecture 3.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.

Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.

Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.

09/16/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Environment mapping Light mapping Project Goals for Stage 1.

CSCE 552 Spring D Models By Jijun Tang. Triangles Fundamental primitive of pipelines  Everything else constructed from them  (except lines and.

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.

What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.

Mesh Skinning Sébastien Dominé. Agenda Introduction to Mesh Skinning 2 matrix skinning 4 matrix skinning with lighting Complex skinning for character.

Shadows David Luebke University of Virginia. Shadows An important visual cue, traditionally hard to do in real-time rendering Outline: –Notation –Planar.

UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.

The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.

VAR/Fence: Using NV_vertex_array_range and NV_fence Cass Everitt.

© Copyright 3Dlabs, Page 1 - PROPRIETARY & CONFIDENTIAL Virtual Textures Texture Management in Silicon Chris Hall Director, Product Marketing 3Dlabs.

COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.

1 Geometry for Game. Geometry Geometry –Position / vertex normals / vertex colors / texture coordinates Topology Topology –Primitive »Lines / triangles.

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

- Introduction - Graphics Pipeline

Discrete Techniques.

Week 2 - Friday CS361.

Patrick Cozzi University of Pennsylvania CIS Fall 2013

Computer Graphics Index Buffers

A Crash Course on Programmable Graphics Hardware

Graphics Processing Unit

Deferred Lighting.

Chapter 6 GPU, Shaders, and Shading Languages

From Turing Machine to Global Illumination

The Graphics Rendering Pipeline

CS451Real-time Rendering Pipeline

Graphics Hardware CMSC 491/691.

Introduction to Programmable Hardware

Graphics Processing Unit

UMBC Graphics for Games

Introduction to geometry instancing

UMBC Graphics for Games

RADEON™ 9700 Architecture and 3D Performance

Computer Graphics Introduction to Shaders

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Balancing the Graphics Pipeline for Optimal Performance Cyril Zeller

Overview GPU CPU The bottleneck determines the throughput In general, the bottleneck varies over the course of a frame In a pipeline architecture, getting good performance is about finding and eliminating bottlenecks GPU CPU application potential bottleneck driver

Locating and eliminating bottlenecks For each stage, vary its workload If pipeline throughput varies → It is the bottleneck Eliminating: Decrease workload of the bottleneck: Increase workload of the non-bottleneck stages: workload workload workload

Graphics rendering pipeline video memory on-chip cache memory vertex shading (T&L) pre-TnL cache geometry system memory commands post-TnL cache CPU triangle setup rasterization texture cache textures fragment shading and raster operations frame buffer

Graphics rendering pipeline bottlenecks video memory on-chip cache memory transfer limited vertex shading (T&L) pre-TnL cache transform limited geometry system memory commands post-TnL cache setup limited CPU triangle setup texture limited raster limited rasterization CPU limited fragment shader limited texture cache textures fragment shading and raster operations frame buffer frame buffer limited

Graphics rendering pipeline bottlenecks The term “transform bound” often means the bottleneck is “anywhere before the rasterizer” The term “fill bound” often means the bottleneck is “anywhere after setup” Can be both transform and fill bound over the course of a single frame!

Bottleneck measurement Vary frame buffer (and / or other render targets) color depth (16-bit to 32-bit) If frame rate varies, application is frame buffer limited Otherwise, vary texture sizes and / or texture filtering If frame rate varies, application is texture limited Otherwise, vary resolution If frame rate varies, Vary the number of instructions of your fragment programs If frame rate varies, application is fragment shader limited Otherwise, application is raster limited Otherwise, vary the number of instructions of your vertex programs If frame rate varies, application is vertex shader limited Otherwise, vary vertex format size and / or AGP transfer rate If frame rate varies, application is transfer limited Otherwise, application is CPU limited

Overall optimization: Indexing, batching, sorting Use indexed primitives (strips > fans or lists) Only way to use the pre- and post-TnL cache! Re-order vertices to be sequential in use To maximize cache usage! Eliminate small batches: Use thousands of vertices per vertex buffer/array Draw >500 triangles per call Use degenerate triangles to join strips together Use texture page Use a vertex shader to batch instanced geometry Sort objects front to back Sort batches per texture and render states

Overall optimization: Occlusion query Use occlusion query to protect vertex and pixel throughput: Multi-pass rendering: During the first pass, attach a query to every object If not enough pixels have been drawn for an object, skip the subsequent passes Rough visibility determination: Draw a quad with a query to know how much of the sun is visible for lens flare Draw a bounding box with a query to know if a portal or a complex object is visible and if not, skip its rendering

Overall optimization: Beware of resource locking! A call that locks a resource (Lock, glReadPixels) is potentially blocking if misplaced: CPU is idling, waiting for the GPU to flush Avoid it if possible Otherwise place it so that the GPU has time to flush: CPU GPU Render to texture N Lock texture N Idle Render to texture N+1 Lock texture N+1 Render to texture N Lock texture N-1 Render to texture N+1 Lock texture N Render to texture N Render to texture N+1

CPU bottlenecks: Causes and solutions Application limited: Game logic, AI, network, file I/O Graphics should be limited to simple culling and sorting Driver or API limited: It usually means something is wrong! Off the fast path Pathological use of the API To confirm that the application is CPU or GPU limited, vary the CPU and / or the GPU clock. Most graphics applications are CPU limited Use VTune (Intel performance analyzer) Driver should spend most of its time idling

GPU bottlenecks – transfer: Causes and solutions Too much data transferred across the AGP bus: Useless data Solution: Eliminate useless vertex attributes Use 16-bit indices instead of 32-bit if possible Too much traffic of dynamic vertices Solution: Decrease number of dynamic vertices by using vertex shaders to animate static vertices, for example Poor management of dynamic data Solution: Use the right API calls Overloaded video memory Solution: Make sure frame buffer, textures and static vertex buffers fit into video memory

GPU bottlenecks – transfer: Causes and solutions Data transferred in an inadequate format: Vertex size should be multiples of 32 bytes Solution: Adjust vertex size to multiples of 32 bytes by: Compressing components and use the vertex shader to decompress Or padding to next multiple Non sequential use of vertices (pre-TnL cache) Solution: Re-order vertices to be sequential in use Use NVTriStrip

Optimizing geometry transfer - OpenGL Instead of immediate mode or ordinary vertex arrays, it is better to use one of the following extensions: EXT_draw_range_elements (in OpenGL 1.3) simple, and gets great perf benefits when ( end – start )  count EXT_compiled_vertex_array lock/unlock objects for drawing objects still must be copied every frame ATI_vertex_array_object useful for static objects obviates per-frame copy of static objects NV_vertex_array_range and NV_vertex_array_range2 great for static and dynamic data rules for effective use can be somewhat complex use GL_VERTEX_ARRAY_RANGE_WITHOUT_FLUSH_NV

Optimizing geometry transfer - Direct3D Static geometry: Create a write-only vertex buffer and only write to it once Dynamic geometry: Lock with DISCARD at start of vertex buffer Then append with NOOVERWRITE until full Use NOOVERWRITE more often than DISCARD Each DISCARD takes either more time or more memory So NOOVERWRITE should be most common Never use no flags Semi-dynamic geometry: For procedural or demand-loaded geometry Lock once, use for many frames Try both static & dynamic methods

GPU bottlenecks – transform: Causes and solutions Too many vertices Solution: Use level of detail (LOD) But: Rarely a problem because GPU has a lot of vertex processing power (~ 500000 triangles per frame) So: Don’t over-analyze your level of details determination or computation in the CPU

GPU bottlenecks – transform: Causes and solutions Too much computation per vertex: Vertex lighting with lots of lights or expensive lights or lighting model (local viewer) Texgen enabled or texture matrices aren’t identity Vertex shaders with: Lots of instructions Lots of loop iterations or branching Post-TnL vertex cache is under-utilized

GPU bottlenecks – transform: Causes and solutions Too much computation per vertex Solution: Re-order vertices to be sequential in use (NVTriStrip) Take per-object calculations out of the shader, compute in the CPU, and save as program constants Use fewer instructions by leveraging complex instructions and vector operations Scrutinize every mov instruction These can often be eliminated using swizzling Consider using shader level of details Do far-away objects really need 4-bone skinning? Consider moving per-vertex work to per-fragment Or change nothing and increase screen-resolution or use anti-aliasing!

GPU bottlenecks – setup Never the bottleneck since it is the fastest stage Speed influenced by: The number of triangles The number of vertex attributes to be rasterized Slowest for extremely small triangles (like degenerate triangles)

GPU bottlenecks – rasterization It is the bottleneck in the case of lots of big z-culled triangles, which is rare Speed influenced by: The number of triangles The size of the triangles

GPU bottlenecks – fragment shader In past architectures, the fixed, then simply configurable nature of the shader made its performance match the rest of the pipeline pretty well In NV1X (DirectX 7), using more general combiners could reduce fragment shading performance, but often it was still not the bottleneck In NV2X (DirectX 8), more complex fragment shader modes introduced an even larger range of throughput in fragment shading NV3X (CineFX / DirectX 9) can run fragment shaders of 1024 instructions (OpenGL only - DX9 has 512 limit) It’s easy to make fragment shading the bottleneck now.

GPU bottlenecks – fragment shader: Causes and solutions Too many fragments Solution: Draw in rough front-to-back order Consider using a first pass to lay down Z That way you only shade the visible fragments in subsequent passes But: You also spend vertex throughput to improve fragment throughput So: Don’t do this for fragments with a simple shader

GPU bottlenecks – fragment shader: Causes and solutions Too much computation per fragment Solution: Use fewer instructions by leveraging complex instructions, vector operations and co-issuing (RGB/Alpha) Use a mix of texture and combiner instructions (they run in parallel) Use an even number of combiner instructions Use an even number of (simple) textures instructions Use the alpha blender to help SRCCOLOR*SRCALPHA for modulating in the dot3 result SRCCOLOR*SRCCOLOR for a free squaring Consider using shader level of details Turn off detail map computations in the distance Consider moving per-fragment work to per-vertex

CineFX fragment shader optimizations Additional guidance to maximize performance: Use 12 bit fixed-precision [-2, 2) instructions when you don’t need floating point precision Works great for traditional color blending Minimize temporary storage Use 16-bit registers where applicable (most cases) Reuse registers and use all components in each

GPU bottlenecks – texture: Causes and solutions Textures are too big: Overload texture cache: Lots of cache misses Overload video memory: Textures are fetched from system memory Solution: Texture resolutions should be as big as needed and no bigger Avoid expensive internal formats CineFX allows floating point 4xfp16 and 4xfp32 formats Compress textures: Collapse monochrome channels into alpha Use 16-bit color depth when possible (environment maps and shadow maps) Use palletized texture (normal maps) Use DXT compression

GPU bottlenecks – texture: Causes and solutions Texture cache is under-utilized: Lots of cache misses Solution: Localize texture access Beware of dependent texture look-up Use mipmapping: Avoid negative LOD bias to sharpen: Texture caches are tuned for standard LODs Prefer anisotropic filtering for sharpening

GPU bottlenecks – texture: Causes and solutions Too many samples per look-up Trilinear filtering cuts fillrate in half Anisotropic filtering is even worse Depending on level of anisotropy The hardware is intelligent in this regard, you only pay for the anisotropy you use Solution: Use trilinear or anisotropic filtering only when needed: Typically, only diffuse maps truly benefit Light maps are too low resolution to benefit Environment maps are distorted anyway Reduce the maximum ratio of anisotropy

Fast Texture Uploads - Direct3D Use managed resources rather than your own scheme Rely on the run-time and the driver For truly dynamic textures: Mark them D3DUSAGE_DYNAMIC (DirectX 8.1 only) Lock them with the D3DLOCK_WRITEONLY & D3DLOCK_DISCARD flags

Fast Texture Uploads - OpenGL Stick to native texture formats (hardware dependent) Use subtexture loads where practical (glTexSubImage) For maximum texture download performance (for applications like video streaming / editing), check out the NV_pixel_data_range extension Allocates fast memory to load the texture Over twice as fast as the standard method

GPU bottlenecks – frame buffer: Causes and solutions Too much read / write to the frame buffer: Solution: Turn off z writes: For subsequent passes of a multi-pass rendering scheme where you lay down z in the first pass For alpha-blended geometry (like particles) But, do not mask off only some color channels: It is actually slower because the GPU has to read the masked color channels from the frame buffer first before writing them again Use alpha test (except when you mask off all colors)

GPU bottlenecks – frame buffer: Causes and solutions Solution (continued): Use 16-bit z depth if you don’t use stencil Many indoor scenes can get away with this just fine Reduce size of render-to-texture targets Cube maps and shadow maps can be of small resolution and at 16-bit color depth and still look good Try turning cube-maps into hemisphere maps for reflections instead Can be smaller than an equivalent cube map Fewer render target switches

GPU bottlenecks – frame buffer: Causes and solutions Solution (continued): Use hardware fast paths: Buffer clears z buffer and stencil buffer are one buffer, so: If you use the stencil buffer, clear the z and stencil buffers together If you don’t use the stencil buffer, say it otherwise it defeats z clear optimizations Z-cull is optimized for when zbias and alpha tests are turned off and stencil buffer is not used Try using the new DirectX 9 constant color blend instead of a full-screen quad for tinting effects D3DRS_BLENDFACTOR Also standard in OpenGL 1.2

Conclusion Modern GPUs are programmable pipelines, as opposed to simply configurable, which means more potential bottlenecks, more complex tuning The goal is to keep each stage (including the CPU) busy creating interesting portions of the scene Understand what you are bound by in various sections of the scene The skybox is probably texture limited The skinned, dot3 characters are probably transfer or transform limited Exploit inefficiencies to get things for free Objects with expensive fragment shaders can often utilize expensive vertex shaders at little or no additional cost

Questions, comments, feedback? Cyril Zeller, czeller@nvidia.com For developer support: Go to developer.nvidia.com Email DeveloperRelations@nvidia.com