Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.

Slides:



Advertisements
Similar presentations
Batch, Batch, Batch: What Does It Really Mean? Matthias Wloka.
Advertisements

CS123 | INTRODUCTION TO COMPUTER GRAPHICS Andries van Dam © 1/16 Deferred Lighting Deferred Lighting – 11/18/2014.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
Real-Time Rendering TEXTURING Lecture 02 Marina Gavrilova.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Status – Week 277 Victor Moya.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Parallel Graphics Rendering Matthew Campbell Senior, Computer Science
3D Rendering & Algorithms__ Sean Reichel & Chester Gregg a.k.a. “The boring stuff happening behind the video games you really want to play right now.”
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.
1 KIPA Game Engine Seminars Jonathan Blow Seoul, Korea November 29, 2002 Day 4.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
NVIDIA PROPRIETARY AND CONFIDENTIAL Occlusion (HP and NV Extensions) Ashu Rege.
Computer Graphics Graphics Hardware
® GDC’99 Performance Tuning with Intel ® Graphics Tools Larry Wickstrom Sr. Software Engineer Judith Stanley Application Engineer Intel Corporation March.
Kenneth Hurley Sr. Software Engineer
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS Textures.
09/09/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Event management Lag Group assignment has happened, like it or not.
Cg Programming Mapping Computational Concepts to GPUs.
OpenGL ES Performance (and Quality) on the GoForce5500 Handheld GPU Lars M. Bishop, NVIDIA Developer Technologies.
OpenGL Performance John Spitzer. 2 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering
IT253: Computer Organization
The programmable pipeline Lecture 3.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Shader Study 이동현. Vision engine   Games Helldorado The Show Warlord.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
09/16/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Environment mapping Light mapping Project Goals for Stage 1.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Emerging Technologies for Games Deferred Rendering CO3303 Week 22.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Mobile Graphics Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.
The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.
09/23/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Reflections Shadows Part 1 Stage 1 is in.
© Copyright 3Dlabs, Page 1 - PROPRIETARY & CONFIDENTIAL Virtual Textures Texture Management in Silicon Chris Hall Director, Product Marketing 3Dlabs.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
- Introduction - Graphics Pipeline
A Crash Course on Programmable Graphics Hardware
Graphics Processing Unit
Deferred Lighting.
The Graphics Rendering Pipeline
CS451Real-time Rendering Pipeline
Computer Graphics Graphics Hardware
UMBC Graphics for Games
RADEON™ 9700 Architecture and 3D Performance
Balancing the Graphics Pipeline for Optimal Performance
Presentation transcript:

Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka

NVIDIA PROPRIETARY AND CONFIDENTIAL Introduction At a minimum, PC is a 2 processor system CPU GPU Maximum efficiency IFF All processors are busy All the time CPU GPU AGP Bus

NVIDIA PROPRIETARY AND CONFIDENTIAL Actually, It’s Worse CPU AGP Bus Application Large Cache API GPU Vertex Processing Triangle Setup Fragment Shading Framebuffer Access

NVIDIA PROPRIETARY AND CONFIDENTIAL Multi-Processor System Conceptually, 5 processors CPU Vertex-processor(s) Setup processor(s) Fragment processor(s) Blending processor(s) All connected via some form of cache To smooth data flow To keep things humming

NVIDIA PROPRIETARY AND CONFIDENTIAL MP Systems Become Inefficient If… One or more processors sync to each other For example, frame-buffer lock Insures that all caches drain Insures that all processors idle (CPU and GPU!) Overhead in restarting the processors A single processor bottlenecks all others

NVIDIA PROPRIETARY AND CONFIDENTIAL Overview CPU AGP Bus Vertex Processing Triangle Setup Rasterization Memory bandwidth Writing to and blending with video memory

NVIDIA PROPRIETARY AND CONFIDENTIAL Overview: For Each Stage What are its characteristics? How does it behave? How to measure whether it is the bottleneck How to influence it

NVIDIA PROPRIETARY AND CONFIDENTIAL CPU Characteristics Stay within on-chip cache for maximum performance Use CPU for Collision detection Physics AI Etc.

NVIDIA PROPRIETARY AND CONFIDENTIAL CPU Characteristics (cont.) Note that graphics is capable of 20+ MTri/s (2 year old high-end) 20+ MTri/s (integrated graphics) 100+ MTri/s (current high-end) CPU also responsible for pushing data to GPU Cannot look at every triangle Don’t limit graphics with CPU processing

NVIDIA PROPRIETARY AND CONFIDENTIAL CPU Measurement Use VTune Or any other profiler Most games are CPU-limited Little to no time in the graphics driver: CPU is the bottleneck Faster GPU will NOT result in faster graphics Use VTune to track where you spend your time Optimize those places

NVIDIA PROPRIETARY AND CONFIDENTIAL CPU Measurement (cont.) But even if most time is spent in graphics driver: CPU might still be the bottleneck Faster GPU will NOT result in faster graphics Use Nvidia Stats-driver (NVTune) to trace into the GPU Timing graphics calls is pointless Remember the large cache between CPU/GPU Use Nvidia Stats-driver (NVTune) instead NVTune available from Nvidia’s registered developer site

NVIDIA PROPRIETARY AND CONFIDENTIAL CPU Common Problems Small batches of geometry being sent to the GPU 100 triangles per batch should be your minimum Would like to see ~500 triangles/batch Up to 10,000 triangles/batch Combination of causes kill your performance Runtime Driver Hardware

NVIDIA PROPRIETARY AND CONFIDENTIAL CPU: Batch Size Characteristic

NVIDIA PROPRIETARY AND CONFIDENTIAL CPU: Batching Solutions Sort by render-state Texture switches Combine textures into one large (4kx4k) texture Modify uv-coordinates accordingly Tessellate geometry to overcome mirroring and wrapping Mip-mapping works just fine Transform switches Pre-transform on the CPU into world-space Replicate data into VBs (costs AGP memory)

NVIDIA PROPRIETARY AND CONFIDENTIAL Other Common CPU Problems Specify vertex buffers as WRITEONLY Minimize state changes consider using a PURE device, iff you are optimal Do not lock and read data from GPU Multi-processor sync!

NVIDIA PROPRIETARY AND CONFIDENTIAL AGP Bus Characteristics AGP 4x supports 20+ MTri/s Even if all vertices and indices are dynamic BenMark5 does just that Too often AGP 4x support is busted Use BenMark5 to test for AGP 4x support AGP Bus through-put influenced by Size of vertex format of dynamically written vertices How many vertices are dynamically written

NVIDIA PROPRIETARY AND CONFIDENTIAL AGP Bus Characteristics (cont.) But if frame-buffer and textures exceed video- memory, AGP is also used to transfer STATIC vertices to GPU every frame to transfer textures to GPU every frame Make sure you avoid partial writes See “Fast AGP Writes for Dynamic Vertex Data” by Dean Macri for details Always modify all vertex-data, even if only some data changes Pentium 3: write in 32 byte chunks Pentium 4: write in 64 byte chunks

NVIDIA PROPRIETARY AND CONFIDENTIAL AGP Bus Characteristics (cont.) GPU caches vertex fetches Hitting this cache causes no data to cross the bus Cache has 32-byte lines Vertex sizes that are multiples of 32 are beneficial See also fer_Statistics fer_Statistics

NVIDIA PROPRIETARY AND CONFIDENTIAL AGP Bus Characteristics

NVIDIA PROPRIETARY AND CONFIDENTIAL AGP Bus Measurement You can tell you’re bound by the bus if: Increasing/decreasing vertex format size significantly impacts performance Best to increase vertex format size using components not needed by rasterizer for example, normals

NVIDIA PROPRIETARY AND CONFIDENTIAL Increasing AGP Bus Performance Make sure frame buffer and textures fit into video-memory Decrease number of dynamic objects (vertices) Use vertex-shaders to animate static VBs! Decrease vertex size Let vertex-shader generate vertex-components! Compress components and use vertex shader to decompress For example, use 16bit short normals Reorder vertices in VB to be sequential in use Can use NVTriStrip to do this Pad to multiples of 32-bytes

NVIDIA PROPRIETARY AND CONFIDENTIAL Vertex Processing Characteristics Each vertex is transformed and lit Performance correlates directly to Number of vertices processed Length of vertex shader or Fixed-function factors, such as Number of active lights Type of lights Specular on/off LOCALVIEWER on/off Texgen on/off GPU core clock frequency

NVIDIA PROPRIETARY AND CONFIDENTIAL Vertex Processing Characteristics

NVIDIA PROPRIETARY AND CONFIDENTIAL Vertex Processing Characteristics After processing, vertices land in post-TnL FIFO GeForce1/2/4 MX: effectively 10 entries GeForce3/4 Ti: effectively 18 entries Cache-hit saves: all TnL work! Everything before TnL in the pipeline Only works with indexed primitives

NVIDIA PROPRIETARY AND CONFIDENTIAL Vertex Processing Performance Do not be afraid to use triangles Rarely the bottleneck Even if it is, it would make us happy A lot of vertex processing power available 6 * 6 pixel-quad with 2 tris is not vertex bound If you can tell an object is made from triangles, you are not using enough triangles ~10k triangles/frame is off by 2 (two!) orders of magnitude

NVIDIA PROPRIETARY AND CONFIDENTIAL Code Creatures Demo Grass scenes are NOT vertex-bound In excess of 1,000,000 tris/frame for opening scene ~250k tris/frame minimum CodeCreatures demo available from:

NVIDIA PROPRIETARY AND CONFIDENTIAL Vertex Processing Measurement You are bound by vertex processing if: Increasing/decreasing vertex shader length significantly influences performance Adding unnecessary instructions may be optimized out by driver, though Instead, use instructions that access constant memory to add zero to a result, for example Fixed-function TnL performance improves when Reducing number of lights Turning off texgen Simplifying light types

NVIDIA PROPRIETARY AND CONFIDENTIAL Improving Vertex Processing Optimize for the post-TnL vertex cache Use indexed primitives Access vertices mostly sequentially, revisiting only recently accessed vertices Let NVTriStrip or ID3DXMesh do the work Turn off unnecessary calculations LOCALVIEWER often unnecessary for specular Prefer cheap approximations for lighting and other math when using vertex shaders

NVIDIA PROPRIETARY AND CONFIDENTIAL Improving Vertex Processing (cont.) Optimize your vertex shaders Use swizzling/masking extensively Question all MOV instructions Storing lookup tables in constant memory for example, to compute sin/cos See “Implementation of ‘Missing’ Vertex Shader Instructions” for more ideas tion_Missing_Instructions

NVIDIA PROPRIETARY AND CONFIDENTIAL Improving Vertex Processing (cont.) Consider moving per-vertex work to per-pixel Consider using ‘shader-LODing’ Do far-away objects really need 4-bone skinning? Can always increase screen-res/use AA to NOT be vertex-processing bound!

NVIDIA PROPRIETARY AND CONFIDENTIAL Triangle Setup Characteristics Triangle setup is never the bottleneck Except when rating the GPU Since it is the fastest stage Setup speed influenced by: Number of triangles Vertex attributes needed by rasterization Extremely small triangles running very simple TnL i.e., degenerate triangles! No TnL cost, since most likely hits post-TnL cache No fill-cost, since rejected in setup

NVIDIA PROPRIETARY AND CONFIDENTIAL Measuring/Improving Triangle Setup Has never come up Reduce ratio of degenerate triangles to real triangles Reduce unnecessary components written out from the vertex shader

NVIDIA PROPRIETARY AND CONFIDENTIAL Rasterization Characteristics Prefer the term “fragment” to “pixel” May not correspond to any pixel in framebuffer, for example, due to z/stencil/alpha tests May correspond to more than one pixel due to multisampling Commonly referred to as “fill-rate”

NVIDIA PROPRIETARY AND CONFIDENTIAL Fill-Rate Characteristics Fill-rate is function of number of fragments filled cost of each fragment GPU’s core clock Parallel SIMD operation, processes Up to 4 pixels per clock on GeForce1/2/3/4 Ti Up to 2 pixels per clock on GeForce2 MX / 4 MX Broken into a number of parts: Texture fetching Texture addressing operations Color blending operations

NVIDIA PROPRIETARY AND CONFIDENTIAL Texture Fetching Characteristics Texture fetches are From AGP to local video-memory, only if frame- buffer and textures exceed video-memory (to be avoided), then From local video-memory to on-chip cache

NVIDIA PROPRIETARY AND CONFIDENTIAL Texture Fetching Characteristics (cont.) Minimize cache-misses: Use mip-mapping! Avoid LOD bias to sharpen: it hurts caching and adds aliasing Prefer anisotropic filtering for sharpening Use DXT everywhere you can Texture size as big as needed and no bigger Texture format as small as possible 16 vs. 32 bit Localize texture access E.g., normal texture reads Dependent texture reads are less local Per-pixel reflection potentially really bad

NVIDIA PROPRIETARY AND CONFIDENTIAL Texture Fetching Characteristics (cont.) Number of samples taken also affects performance: Trilinear filtering cuts fillrate in half Anisotropic even worse Depending on level of anisotropy The hardware is intelligent in this regard, you only pay for the anisotropy you use

NVIDIA PROPRIETARY AND CONFIDENTIAL Texture Addressing Characteristics Different texture addressing operations have wildly different performance characteristics But texture cache hits/misses more significant

NVIDIA PROPRIETARY AND CONFIDENTIAL Texture Addressing Characteristics Also, every two textures cuts fill-rate in half: 1 or 2 textures runs at full speed 3 or 4 textures runs at half speed (two clocks)

NVIDIA PROPRIETARY AND CONFIDENTIAL Color Blending Characteristics Color blending operations also called ‘Register Combiners’ 1 or 2 instructions (combiners) – full speed 3 or 4 instructions (combiners) – half speed 5 or 6 instructions (combiners) – one third speed 7 or 8 instructions (combiners) – one quarter speed These numbers are for GF3 / 4 Ti But if using 4 textures Already at half-speed or less Using up to 4 combiners is free

NVIDIA PROPRIETARY AND CONFIDENTIAL Fill-Rate Measurement You are bound by fill-rate, if Reducing texture sizes Or better turning off texturing Increases performance significantly Turning on / off trilinear affects performance Increasing texture units used to 4, but not actually fetching from any textures (using pixel shader instructions like texcoord), causes you to slow down

NVIDIA PROPRIETARY AND CONFIDENTIAL Improving Fill-Rate Render z-only pass first Because z-optimizations happen before rasterization Helps with memory bandwidth as well Even for older chips without z-optimizations Do everything to reduce texture cache misses Turn on anisotropic, but turn off trilinear filtering Mip-map transitions are less visible with anisotropic filtering on

NVIDIA PROPRIETARY AND CONFIDENTIAL Improving Fill-Rate (cont.) Consider palletized normal maps for compression Consider moving per-pixel work to per-vertex Consider ‘shader LODing’ Turn off detail map computations in the distance

NVIDIA PROPRIETARY AND CONFIDENTIAL Memory Bandwidth Characteristics Memory bandwidth is often the bottleneck especially at high resolutions Memory bandwidth influenced by: Screen and render-target resolutions Render-target color / z bit depth FSAA Texture sizes and formats (texture fetching) Overdraw complexity Alpha blending GPU’s memory-interface width Memory clock

NVIDIA PROPRIETARY AND CONFIDENTIAL Memory Bandwidth Characteristics FSAA hits memory bandwidth exclusively no fill-rate hit with multi-sample Failing the z/stencil/alpha test means Pixel color is not written Z is not written

NVIDIA PROPRIETARY AND CONFIDENTIAL Measuring Memory Bandwidth Switch frame-buffer format to 16bit Switch all render-targets to 16bit If performance doubles App was 100% memory-bandwidth bound If performance unchanged App is not memory-bandwidth bound

NVIDIA PROPRIETARY AND CONFIDENTIAL Improving Memory Bandwidth Overdraw Reduce as much as possible Lightly sort objects front to back All architectures benefit, since z-test fails Reduce blending as much as possible Always enable alpha-test when blending Tweak test-value as much as possible Consider using 2-pass alpha-test/-blend technique Always clear z/stencil (using clear()) Do not clear color if not necessary Writing z from shader destroys early z

NVIDIA PROPRIETARY AND CONFIDENTIAL Improving Memory Bandwidth (cont.) Prefer FSAA over high resolution Consider using z-only pass Turn off z-writing for all subsequent passes

NVIDIA PROPRIETARY AND CONFIDENTIAL Conclusion A lot of different performance bottle-necks Know which one to tweak Use suggestions here to make things faster w/o making it visibly worse Make things prettier for free!

NVIDIA PROPRIETARY AND CONFIDENTIAL Questions… ?