Direct3D 11 Performance Tips & Tricks Holger GruenAMD ISV Relations Cem CebenoyanNVIDIA ISV Relations.

Slides:



Advertisements
Similar presentations
Introduction to Direct3D 10 Course Porting Game Engines to Direct3D 10: Crysis / CryEngine2 Carsten Wenzel.
Advertisements

Advanced Visual Effects with Direct3D
OIT and Indirect Illumination using DX11 Linked Lists
DirectX11 Performance Reloaded
Deferred Shading Optimizations
Agenda Windows Display Driver Model (WDDM) What is GPUView?
CS123 | INTRODUCTION TO COMPUTER GRAPHICS Andries van Dam © 1/16 Deferred Lighting Deferred Lighting – 11/18/2014.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
N-Buffers for efficient depth map query Xavier Décoret Artis GRAVIR/IMAG INRIA.
Fast GPU Histogram Analysis for Scene Post- Processing Andy Luedke Halo Development Team Microsoft Game Studios.
I3D Fast Non-Linear Projections using Graphics Hardware Jean-Dominique Gascuel, Nicolas Holzschuch, Gabriel Fournier, Bernard Péroche I3D 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Skin Rendering GPU Graphics Gary J. Katz University of Pennsylvania CIS 665 Adapted from David Gosselin’s Power Point and article, Real-time skin rendering,
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
You can use 3D graphics to enhance and differentiate your Metro style app.
Shader Model 5.0 and Compute Shader
High Performance in Broad Reach Games Chas. Boyd
Computer Graphics Mirror and Shadows
Filtering Approaches for Real-Time Anti-Aliasing /
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
Basics of a Computer Graphics System Introduction to Computer Graphics CSE 470/598 Arizona State University Dianne Hansford.
Enhancing GPU for Scientific Computing Some thoughts.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
4/23/2017 4:23 AM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
Cg Programming Mapping Computational Concepts to GPUs.
CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS – PART 2 SPRING 2015 DR. MICHAEL J. REALE.
Next-Generation Graphics APIs: Similarities and Differences Tim Foley NVIDIA Corporation
The programmable pipeline Lecture 3.
Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Ritual ™ Entertainment: Next-Gen Effects on Direct3D ® 10 Sam Z. Glassenberg Program Manager Microsoft ® – Direct3D ® Doug Service Director of Technology.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
Stencil Routed A-Buffer
Computer Graphics Blending CO2409 Computer Graphics Week 14.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Parallel Lighting Directional Lighting Distant Lighting (take your pick) Paul Taylor 2010.
DX10, Batching, and Performance Considerations Bryan Dudash NVIDIA Developer Technology.
VAR/Fence: Using NV_vertex_array_range and NV_fence Cass Everitt.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.
Build your own 2D Game Engine and Create Great Web Games using HTML5, JavaScript, and WebGL. Sung, Pavleas, Arnez, and Pace, Chapter 5 Examples 1.
Using the VTune Analyzer on Multithreaded Applications
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
5.2 Eleven Advanced Optimizations of Cache Performance
Graphics Processing Unit
Renderer Design for Multiple Lights
Deferred Lighting.
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
Graphics Processing Unit
UMBC Graphics for Games
UMBC Graphics for Games
UE4 Vulkan Updates & Tips
RADEON™ 9700 Architecture and 3D Performance
Frame Buffer Applications
Presentation transcript:

Direct3D 11 Performance Tips & Tricks Holger GruenAMD ISV Relations Cem CebenoyanNVIDIA ISV Relations

Agenda  Introduction  Shader Model 5  Resources and Resource Views  Multithreading  Miscellaneous  Q&A

Introduction  Direct3D 11 has numerous new features  However these new features need to be used wisely for good performance  For generic optimization advice please refer to last year‘s talk sets/The A to Z of DX10 Performance.pps sets/The A to Z of DX10 Performance.pps

Shader Model 5 (1)  Use Gather*/GatherCmp*() for fast multi-channel texture fetches  Use smaller number of RTs while still fetching efficiently  Store depth to FP16 alpha for SSAO  Use Gather*() for region fetch of alpha/depth  Fetch 4 RGB values in just three ops  Image post processing

Fetch 4 RGB values in just three texture ops red0 blue0 alpha0 red1 green1 blue1 alpha1 red2 green2 blue2 alpha2 red3 green3 blue3 alpha3 green0 red0 green0blue0 alpha0 red1 green1blue1 alpha1 red2 green2blue2 alpha2 red3 green3blue3 alpha3 red2 red3 red1red0 green2green3green1 blue2 blue3blue1 blue0

Shader Model 5 (2)  Use ‘Conservative Depth’ to keep early depth rejection active for fast depth sprites  Output SV_DepthGreater/LessEqual instead of SV_Depth from your PS  Keeps early depth rejection active even with shader-modified Z  The hardware/driver will enforce legal behavior  If you write an invalid depth value it will be clamped to the rasterized value

Depth Sprites under Direct3D 11 SceneGeometry drawn first Depth sprite for a sphere Direct3D 11 can fully cull this depth sprite if SV_DepthGreaterEqual is output by the PS

Shader Model 5 (3)  Use EvaluateAttribute*() for fast shader AA without super sampling  Call EvaluateAttribute*() at subpixel positions  Simpler shader AA for procedural materials  Input SV_COVERAGE to compute a color for each covered subsample and write average color  Slightly better image quality than pure MSAA  Output SV_Coverage for MSAA alpha-test  This feature has been around since 10.1  EvaluateAttribute*() makes implementation simpler  But check if alpha to coverage gives you what you need already, as it should be faster.

Shader Model 5 (4)  A quick Refresher on UAVs and Atomics  Use PS scattering and UAVs wisely  Use Interlocked*() Operations wisely  See DirectCompute performance presentation!

Shader Model 5 (5)  Reduce stream out passes  Addressable stream output  Output to up to 4 streams in one pass  All streams can have multiple elements  Write simpler code using Geometry shader instancing  Use SV_SInstanceID instead of loop index

Shader Model 5 (6)  Force early depth-stencil testing for your PS using [earlydepthstencil]  Can introduce significant speedup specifically if writing to UAVs or AppendBuffers  AMD‘s OIT demo uses this  Put ‘[earlydepthstencil]’ above your pixel shader function declaration to enable it

Early Depth Stencil and OIT Opaque Geometry drawn first Transparent Geometry Drawn after all opaque Geometry A ‘[earlydepthstencil]’ pixel shader that writes OIT color layers to a UAV only will cull all pixels outside the purple area! Projection Plane

Shader Model 5 (7)  Use the numerous new intrinsics for faster shaders  Fast bitops – countbits(), reversebits() (needed in FFTs), etc.  Conversion instructions - fp16 to fp32 and vice versa (f16to32() and f32to16())  Faster packing/unpacking  Fast coarse deriatives (ddx/y_coarse) ...

Shader Model 5 (8)  Use Dynamic shader linkage of subroutines wisely  Subroutines are not free  No cross function boundary optimizations  Only use dynamic linkage for large subroutines  Avoid using a lot of small subroutines

Resources and Resource Views (1)  Reduce memory size and bandwidth for more performance  BC6 and BC7 provide new capabilities  Very high quality, and HDR support  All static textures should now be compressible

BC7 image quality BC7 Compressed BC1 Compressed Original Image

Resources and Resource Views (2)  Use Read-Only depth buffers to avoid copying the depth buffer  Direct3D 11 allows the sampling of a depth buffer still bound for depth testing  Useful for deferred lighting if depth is part of the g-buffer  Useful for soft particles  AMD: Using a depth buffer as a SRV may trigger a decompression step  Do it as late in the frame as possible

Free Threaded Resource Creation  Use fast Direct3D 11 asynchronous resource creation  In general it should just be faster and more parallel  Do not destroy a resource in a frame in which it’s used  Destroying resources would most likely cause synchronizing events  Avoid create-render-destroy sequences

Display Lists (aka command lists created from a deferred context)  First make sure your app is multi- threaded well  Only use display lists if command construction is a large enough bottleneck  Now consider display lists to express parallelism in GPU command construction  Avoid fine grained command lists  Drivers are already multi-threade d

Deferred Contexts  On deferred contexts Map() and UpdateSubResource() will use extra memory  Remember, all initial Maps need to use the DISCARD semantic  Note that on a single core system a deferred context will be slower than just using the immediate context  For dual core, it is also probably best to just use the immediate context  Don’t use Deferred Contexts unless there is significant parallelism

Miscellaneous  Use DrawIndirect to further lower your CPU overhead  Kick off instanced draw calls/dispatch using args from a GPU written buffer  Could use the GPU for limited scene traversal and culling  Use Append/Consume Buffers for fast ’stream out‘  Faster than GS as there are no input ordering constraints  One pass SO with ’unlimited‘ data amplification

Questions?