Batch, Batch, Batch: What Does It Really Mean? Matthias Wloka.

Slides:

Advertisements

Similar presentations

Best fit line Graphs (scatter graphs) Looped Intro Presentation.

Advertisements

1 Perspectives from Operating a Large Scale Website Dennis Lee VP Technical Operations, Marchex.

Today’s Objective To be able to find the x and y intercepts of an equation and use them to draw a quick graph.

Dynamic Programming Introduction Prof. Muhammad Saeed.

© 2007 Open Grid Forum Grids in the IT Data Center OGF 21 - Seattle Nick Werstiuk October 16, 2007.

Equation of a line y = m x + b

Data-Assimilation Research Centre

Firaxis LORE And other uses of D3D11.

Improving Performance in Your Game

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Optimization on Kepler Zehuan Wang

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Latency considerations of depth-first GPU ray tracing

Tools for Investigating Graphics System Performance

CUDA and the Memory Model (Part II). Code executed on GPU.

Monday, 17 April 2017 Enlargements

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Sprite Batching and Texture Atlases Randy Gaul. Overview Batches Sending data to GPU Texture atlases Premultiplied alpha Note: Discussion on slides is.

Chapter 5.2 Character Animation. CS Overview Fundamental Concepts Animation Storage Playing Animations Blending Animations Motion Extraction Mesh.

© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.

What is ? Open Graphics Library A cross-language, multi-platform API for rendering 2D and 3D computer graphics. The API is used to interact with a Graphics.

Verification and Performance Estimation Environment for 3D Graphics Geometry Acceleration System Young-Su Kwon.

Enhancing GPU for Scientific Computing Some thoughts.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &

CSE 381 – Advanced Game Programming Basic 3D Graphics

4.7. I NSTANCING Introduction to geometry instancing.

Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.

Next-Generation Graphics APIs: Similarities and Differences Tim Foley NVIDIA Corporation

Esri UC 2014 | Technical Workshop | Animating Thousands of Graphics with ArcGIS Runtime SDK for Java Mark Baird and Vijay Gandhi.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

OpenGL Performance John Spitzer. 2 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering

What is an error? An error is a mistake of some kind... …causing an error in your results… …so the result is not accurate.

Nathan Reed, VR Software Engineer

Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.

Mark Nelson 3d projections Fall 2013

Lecture 18 Page 1 CS 111 Online OS Use of Access Control Operating systems often use both ACLs and capabilities – Sometimes for the same resource E.g.,

Memory Management. Why memory management? n Processes need to be loaded in memory to execute n Multiprogramming n The task of subdividing the user area.

LODManager A framework for rendering multiresolution models in real-time applications J. Gumbau O. Ripollés M. Chover.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

reflect shapes in a mirror line defined by an equation.

The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.

CHC ++: Coherent Hierarchical Culling Revisited Oliver Mattausch, Jiří Bittner, Michael Wimmer Institute of Computer Graphics and Algorithms Vienna University.

DRAWING LINEAR FUNCTIONS AS GRAPHS Slideshow 27, Mathematics Mr Richard Sasaki, Room 307.

Elevation- an artistic drawing of the set from the audience’s point of view.

Martin Kruliš by Martin Kruliš (v1.0)1.

2D Graphics Optimizations

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Week 2 - Monday CS361.

The world’s most advanced mobile platform

CS427 Multicore Architecture and Parallel Computing

Bingo Bingo Summary of questions Answers.

Computer Graphics Index Buffers

Build /24/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.

XenFS Sharing data in a virtualised environment

Thursday, 13 September 2018 Enlargements

Tell a Story with the Data

Introduction to geometry instancing

UMBC Graphics for Games

CS533 Concepts of Operating Systems Class 13

Move this box to see the hints (there are two hints for this code)

Chapter Seven Construction and Scale Drawings

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Reference Angles.

TensorFlow: A System for Large-Scale Machine Learning

Geometric Transformations

Presentation transcript:

Batch, Batch, Batch: What Does It Really Mean? Matthias Wloka

What Is a Batch? Every DrawIndexedPrimitive() is a batch –Submits n number of triangles to GPU –Same render state applies to all tris in batch –SetState calls prior to Draw are part of batch Assuming efficient use of API –No Draw*PrimitiveUP() –DrawPrimitive() permissible if warranted –No unnecessary state changes Changing state means at least two batches

Why Are Small Batches Bad? Games would rather draw 1M objects/batches of 10 tris each –versus 10 objects/batches of 1M tris each Lots of guesses –Changing state inefficient on GPUs (WRONG) –GPU triangle start-up costs (WRONG) –OS kernel transitions (WRONG) Future GPUs will make it better!? Really?

Test app does… –Degenerate triangles (no fill cost) –100% PostTnL cache vertices (no xform cost) –Static data (minimal AGP overhead) –~100k tris/frame, i.e., floor(100k/x) draws –Toggles state between draw calls: (VBs, w/v/p matrix, tex-stage and alpha states) Timed across 1000 frames Theoretical maximum triangle rates! Lets Write Code! Testing Small Batch Performance

Measured Batch-Size Performance Axis scale change

Optimization Opportunities 40x >100x Axis scale change

Measured Batch-Size Performance Axis scale change GPU-independent <130 tris/batch: - App is GPU-independent - Completely CPU-limited

CPU-Limited? Then performance results only depend on –How fast the CPU is Not GPU –How much data the CPU processes Not how many triangles per batch! CPU processes draw calls (and SetStates), i.e., batches Lets graph batches/s!

What To Expect If CPU Limited batch-size: triangles/batch batches/s fast CPU slow CPU GPU 1 GPU 2 GPU 3

Effects of Different CPU Speeds Two distinct bands, corresponding to different CPU speeds batch-size: triangles/batch batches/s fast CPU slow CPU GPU 1 GPU 2 GPU 3

Effects of Number of Tris/Batch Straight horizontal lines: batches/s independent of number of triangles per batch batch-size: triangles/batch batches/s fast CPU slow CPU GPU 1 GPU 2 GPU 3

Effects of Different GPUs Different GPUs perform similarly; slight variations due to different driver paths batch-size: triangles/batch batches/s fast CPU slow CPU GPU 1 GPU 2 GPU 3

Measured Batches Per Second 1GHz Pentium 3 Athlon XP 2.7+ ~170k batches/s ~60k batches/s x ~2.7

Side Note: OpenGL Performance OpenGL Direct3D x OpenGL Direct3D

CPU Limited? Yes, at < 130 tris/batch (avg) you are –completely, –utterly, –totally, –100% –CPU limited! CPU is busy doing nothing, but submitting batches!

How Real Is Test App? Test app only does SetState, Draw, repeat; –Stays in CPU cache –No frustum culling, no nothing –So pretty much best case Test app changes arbitrary set of states –Types of state changes? –And how many states change? –Maybe real apps do fewer/better state changes?

Real World Performance % 1.4GHz CPU: 26fps % 1.4GHz CPU: 25fps % 1.4GHz CPU: 25fps % 1.4GHz CPU: 25fps % (!) 1.5GHz CPU: 50fps % (!) 1.5GHz CPU: 40fps % (?) 2.2GHz CPU: 27fps % (?) 3.0GHz CPU: 18fps % (?) 3.0GHz CPU: 21fps

Normalized Real World Performance ~41k 100% of 1GHz CPU ~32k 100% of 1GHz CPU ~42k 100% of 1GHz CPU ~38k 100% of 1GHz CPU ~25k 100% of 1GHz CPU ~ 8k 100% of 1GHz CPU ~25k 100% of 1GHz CPU 10k – 40k batches/s (100% 1GHz CPU)

Small Batches Feasible In Future? VTune (1GHz Pentium 3 w/ 2 tri/batch): –78% driver; 14% D3D; 6% Other32; rest noise Driver doing little per Draw/SetState, but –Little times very large multiplier is still large Nvidia is optimizing drivers, but… Submitting X batches: O(X) work for CPU –CPU (game, runtime, driver) processes batch –Can reduce constants but not order O()

GPUs Getting Faster More Quickly Than CPUs Avg. 18month CPU Speedup: 2.2 Avg. 18month GPU Speedup:

GPUs Continue To Outpace CPUs CPU processes batches, thus –Number of batches/frame MUST scale with: Driver/Runtime optimizations CPU speed increases GPU processes triangles (per batch), thus –Number of triangles/batch scales with: GPU speed increases GPUs getting faster more quickly than CPUs –Batch sizes CAN increase

So, How Many Tris Per Batch? 500? 1000? It does not matter! –Impossible to fit everything into large batches –A few 2 tris/batch do NOT kill performance! –N tris/batch: N increases every 6 months I am a donut! Ask not how many tris/batch, but rather how many batches/frame! You get X batches per frame, depending on: –Target CPU spec –Desired frame-rate –How much % CPU available for submitting batches

You get X batches per frame, X mainly depends on CPU spec

What is X? 25k 100% 1 GHz CPU –Target: 30fps; 2GHz CPU; 20% (0.2) Draw/SetState: –X = 333 batches/frame Formula: 25k * GHz * Percentage/Framerate –GHz = target spec CPU frequency –Percentage = value 0..1 corresponding to CPU percentage available for Draw/SetState calls –Framerate = target frame rate in fps

Please Hang Over Your Bed 25k 100% 1GHz CPU

How Many Triangles Per Batch? Up to you! –Anything between 1 to 10,000+ tris possible If small number, either –Triangles are large or extremely expensive –Only GPU vertex engines are idle Or –Game is CPU bound, but dont care because you budgeted your CPU ahead of time, right? –GPU idle (available for upping visual quality)

GPU Idle? Add Triangles For Free!

GPU Idle? Complicate Pixel Shaders For Free!

300 Batches Per Frame Sucks (Ab)use GPU to pack multiple batches together Critical NOW! –For increasing number of objects in game world Will only become more critical in the future

Batch Breaker: Texture Change Use all of Geforce FXs 16 textures –Fit 8 distinct dual-textured batches into 1 single batch Pack multiple textures into 1 surface –Works as long as no wrap/repeat –Requires tool support –Potentially wastes texture space –Potential problems w/ multi-sampling

Batch Breaker: Transform Change Pre-transform static geometry –Once in a while –Video memory overhead: model replication 1-Bone matrix palette skinning –Encode world matrix as 2 float4s axis/angle translate/uniform scale –Video memory overhead: model replication Data-dependent vertex branching –Render variable # of bones/lights in one batch

Batch Breaker: Material Change Compute multiple materials in pixel-shaders –Choose/Interpolate based on Per-vertex attribute Texture-map More performance optimization tips and tricks: Friday 3:00pm Graphics Pipeline Performance C. Cebenoyan and M. Wloka

But Only High-End GPUs Have That Feature!? Yes, but high-end GPUs most likely CPU- bound High-End GPUs most suited to deal with: –Longer vertex-shaders –Longer pixel-shaders –More texture accesses –Bigger video memory requirements To improve batching

But These Things Slow GPU Down!? Remember: CPU-limited –GPU is mostly idle Making GPU work, so CPU does NOT Overall effect: faster game

25k 100% 1GHz CPU

Acknowledgements Many thanks to Gary McTaggart, Valve Jay Patel, Blizzard Tom Gambill, NCSoft Scott Brown, NetDevil Guillermo Garcia-Sampedro, PopTop

Questions, Comments, Feedback? Matthias Wloka:

Can You Afford to Loose These Speed-Ups? 2 tris/batch –Max. of ~0.1 MTriangles/s for 1GHz Pentium 3 Factor 1500x away from max. throughput –Max. of ~0.4 MTriangles/s for Athlon XP 2.7+ Factor 375x away from max. throughput