Ultimate Graphics Performance for DirectX 10 Hardware Nicolas Thibieroz European Developer Relations AMD Graphics Products Group

Slides:

Advertisements

Similar presentations

Introduction to Direct3D 10 Course Porting Game Engines to Direct3D 10: Crysis / CryEngine2 Carsten Wenzel.

Advertisements

DirectCompute Performance on DX11 Hardware

The A to Z of DX10 Performance

Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD.

OIT and Indirect Illumination using DX11 Linked Lists

Exploration of advanced lighting and shading techniques

DirectX11 Performance Reloaded

Deferred Shading Optimizations

DirectX Graphics: Direct3D 10 and Beyond

CS123 | INTRODUCTION TO COMPUTER GRAPHICS Andries van Dam © 1/16 Deferred Lighting Deferred Lighting – 11/18/2014.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.

Graphics Pipeline.

Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.

RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.

CS 4363/6353 BASIC RENDERING. THE GRAPHICS PIPELINE OVERVIEW Vertex Processing Coordinate transformations Compute color for each vertex Clipping and Primitive.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Direct3D 11 Performance Tips & Tricks Holger GruenAMD ISV Relations Cem CebenoyanNVIDIA ISV Relations.

Tools for Investigating Graphics System Performance

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.

GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.

Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

Advanced D3D10 Rendering Emil Persson May 24, 2007.

You can use 3D graphics to enhance and differentiate your Metro style app.

Shader Model 5.0 and Compute Shader

High Performance in Broad Reach Games Chas. Boyd

Filtering Approaches for Real-Time Anti-Aliasing /

© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.

REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.

GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.

UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS Visual quality techniques.

Geometric Objects and Transformations. Coordinate systems rial.html.

4/23/2017 4:23 AM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.

Cg Programming Mapping Computational Concepts to GPUs.

The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,

Advanced Computer Graphics Depth & Stencil Buffers / Rendering to Textures CO2409 Computer Graphics Week 19.

Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.

GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.

GPU Computation Strategies & Tricks Ian Buck NVIDIA.

Shader Study 이동현. Vision engine   Games Helldorado The Show Warlord.

Accelerated Stereoscopic Rendering using GPU François de Sorbier - Université Paris-Est France February 2008 WSCG'2008.

Ritual ™ Entertainment: Next-Gen Effects on Direct3D ® 10 Sam Z. Glassenberg Program Manager Microsoft ® – Direct3D ® Doug Service Director of Technology.

Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.

Stencil Routed A-Buffer

Emerging Technologies for Games Deferred Rendering CO3303 Week 22.

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.

Emerging Technologies for Games Capability Testing and DirectX10 Features CO3301 Week 6.

Parallel Lighting Directional Lighting Distant Lighting (take your pick) Paul Taylor 2010.

UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.

DX10, Batching, and Performance Considerations Bryan Dudash NVIDIA Developer Technology.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Computer Graphics Index Buffers

A Crash Course on Programmable Graphics Hardware

Deferred Lighting.

Chapter 6 GPU, Shaders, and Shading Languages

The Graphics Rendering Pipeline

CS451Real-time Rendering Pipeline

UMBC Graphics for Games

Introduction to geometry instancing

UMBC Graphics for Games

RADEON™ 9700 Architecture and 3D Performance

Computer Graphics Introduction to Shaders

Emerging Technologies for Games Review & Revision Strategy

Presentation transcript:

Ultimate Graphics Performance for DirectX 10 Hardware Nicolas Thibieroz European Developer Relations AMD Graphics Products Group V1.00

Generic API Usage DX10 designed for performance No legacy code No fixed function Validation at creation time Immutable state objects User mode driver Powerful API

DX10 Batch Performance The truth about DX10 batch performance “Simple” porting job will not yield expected performance Need to use DX10 features to yield gains: Geometry instancing Intelligent usage of state objects Intelligent usage of constant buffers Texture arrays Render Target selection (but...)

Geometry Instancing Great instancing support in DX10 Use “System Values” to vary rendering SV_InstanceID, SV_PrimitiveID, SV_VertexID Additional streams not strictly required Pass these to PS for texture array indexing Highly-varied visual results in a single draw call Watch out for: Texture cache trashing if sampling textures from system values (SV_PrimitiveID)

State Management DX10 uses immutable “state objects” DX10 state objects require a new way to manage states A naïve DX9 to DX10 port will cause problems here Always create state objects at load-time Avoid duplicating state objects Recommendation to sort by states still valid in DX10! Implement “dirty states” mechanism to avoid redundancy

Constant Buffer Management Major cause of low performance in DX10 apps! Constants are declared in buffers in DX10 When a CB is updated and bound to a shader stage its whole contents are uploaded to the GPU Need to strike a good balance between: Amount of constant data to upload Number of calls required to do it cbuffer PerFrameUpdateConstants { float4x4 mView; float fTime; float3 fWindForce; // etc. }; cbuffer SkinningConstants { float4x4 mSkin[64]; };

Constant Buffer Management 2 Always use a pool of constant buffers sorted by frequency of updates Don’t go overboard with number of CBs! Less than 5 is a good target CB sharing between shader stages can be a good thing Global constant buffer unlikely to yield good performance Especially with regard to CB contention Group constants by access patterns in a given buffer cbuffer PerFrameConstants { float4 vLightVector; float4 vLightColor; float4 vOtherStuff[32]; }; GOOD cbuffer PerFrameConstants { float4 vLightVector; float4 vOtherStuff[32]; float4 vLightColor; }; BAD

Resource Updates In-game creation and destruction of resources is slow! Runtime validation, driver checks, memory alloc... Take into account for resource management Especially with regard to texture management Create all resources in non-performance situations Up-front, level load, cutscenes, etc. At run-time replace contents of resources rather than destroying/creating new ones

Resource Updates: Textures Avoid UpdateSubresource() for texture updates Slow path in DX10 (think DrawPrimitiveUP() in DX9) Especially bad with larger textures! E.g. texture atlas, imposters, streaming data... Perform all updates into a pool of D3D10_USAGE_STAGING textures Use Map(D3D10_MAP_WRITE,...) with D3D10_MAP_FLAG_DO_NOT_WAIT to avoid stalls Then upload staging resources into video memory CopyResource() CopySubresourceRegion()

Resource Updates: Textures (2) UpdateSubresource D3D10_USAGE_DEFAULT D3D10_USAGE_STAGING Map CopySubresourceRegion Non-local Video Memory

UpdateSubresource Video Memory D3D10_USAGE_DEFAULT Non-local Video Memory D3D10_USAGE_DEFAULT Map CopySubresourceRegion Map CopySubresourceRegion D3D10_USAGE_STAGING Resource Updates: Textures (3)

Resource Updates: Buffers To update a Constant buffer Map(D3D10_MAP_WRITE_DISCARD, …); UpdateSubResource() To update a dynamic Vertex/Index buffer Map(D3D10_MAP_WRITE_NO_OVERWRITE, …); Ring-buffer type; only write to empty portions of buffer Map(D3D10_MAP_DISCARD) when buffer full UpdateSubresource() not as good as Map() in this case

GS Recommendations Geometry Shader can write data out to memory This gives you “free” ALUs because of latency Minimize the size of your output vertex structure Yields higher throughput Allows more output vertices (max output is 1024 floats) Consider adding pixel shader work to reduce output size [maxvertexcount(18)] GSOUTPUT GS(point GSINPUT Input, inout PointStream OutputStream) { GSOUTPUT Output; for (int i=0; i<nNumPoints; i++) { //... Output.vLight = LightPos.xyz - Position.xyz; Output.vView = CameraPos.xyz – Position.xyz; OutputStream.Append(Output); } } [maxvertexcount(18)] GSOUTPUT GS(point GSINPUT Input, inout PointStream OutputStream) { GSOUTPUT Output; for (int i=0; i<nNumPoints; i++) { //... OutputStream.Append(Output); } float4 PS(float4 Pos : SV_POSITION) : SV_TARGET { float4 position = mul(float4(Pos, 1.0), mInvViewProjectionViewport); vLight = LightPos.xyz - Position.xyz; vView = CameraPos.xyz – Position.xyz; }

GS Recommendations 2 Cull triangles in Geometry Shader Backface culling Frustum culling Minimize the size of your input vertex structure Combine inputs into single 4D iterator (e.g. 2 UV pairs) Pack data and use ALU instructions to unpack it Calculate values instead of storing them (e.g. Binormal)

Shader Model 4 Good writing practices enable optimal throughput: Write parallel code Avoid scalar dependencies Execute scalar instructions first when mixed with vector ops Parentheses can help Some instructions operate at a slower rate Integer multiplication and division Type conversions (float to int, int to float) Others (depends on IHV) Declare constants in the right format float a; float b; float4 V; V = V * (a * b);

Shader Model 4 - cont. All DX10(.1) hardware support more maths instructions than texture operations Always target a high ALU:TEX ratio when writing shaders! Ratio affected by texture type, format and instruction Use dynamic branching to skip instructions Can be a huge saving Especially when TEX instructions can be skipped But make sure branching has high coherency Use IHV tools to check compiler output E.g. GPUShaderAnalyzer for AMD Radeon cards

Alpha Test Alpha test deprecated in DirectX 10 Use discard() or clip() in HLSL pixel shaders This requires the creation of two shader versions One without clip() for opaque primitives One with clip() for transparent primitives Don’t be tempted to a single clip()-enabled shader This will impact GPU performance w.r.t. Z culling A single shader with a static/dynamic branch will still not be as good as having two versions Side-effect: contribution towards “shaders explosion” Put clip() / discard() as early as possible in shader Compiler may be able to skip remaining instructions

Accessing Depth and Stencil DX10 enables the depth buffer to be read back as a texture Enables features without requiring a separate depth render Atmosphere pass Soft particles DOF Forward shadow mapping Screen-space ambient occlusion This is proving very popular to most game engines

Accessing Depth and Stencil with MSAA DX10.0: reading a depth buffer as SRV is only supported in single sample mode Requires a separate depth render path for MSAA Workarounds: Store depth in alpha of main FP16 RT Render depth into texture in a depth pre-pass Use a secondary render target in main color pass DX10.1 allows depth buffer access as Shader Resource View in all cases Fewer shaders Smaller memory footprint Better orthogonality

Multisampling Anti-Aliasing No need to allocate SwapChain as MSAA Apply MSAA onto to render targets that matter Back buffer only receives resolved contents of RT MSAA resolve operations are not free (on any HW) This means ResolveSubresource() costs performance It is essential to limit the number of MSAA resolves Requires good design of effects and post-process chain MSAA FP16 Render Target Resolve Operation Non-MSAA 8888 Back Buffer

DX10.1 Overview DX10.1 incremental update to DX10 Released with Windows Vista SP1 Supported on current graphic hardware Adds new features and performance paths Fixed most DX10 “unorthogonalities” Mandatory HW support (4xAA, FP32 filtering...) Resource copies Better support for MSAA Allow new algorithms to be implemented Mandatory AA sample location Shader Model 4.1 Etc.

DX10.1 Performance Advantages Multi-pass reduction: MSAA depth buffer readback as texture Per-sample PS execution & per-pixel coverage output mask 32 pixel shader inputs Individual MRT blend modes More with less: Gather4 (better filter kernel for e.g. shadows) Cube map arrays for e.g. global illumination Etc.

Conclusion Ensure the DX10 API is used optimally Sort by states Geometry Instancing Right balance of resource updates Right flags to Map() calls Limit Geometry Shader output size Write ALU-friendly shaders Use DX10.1 if supported Talk to IHVs!

Questions? ?