Chas. Boyd Principal PM Microsoft OSG Graphics

Slides:



Advertisements
Similar presentations
Introduction to Direct3D 10 Course Porting Game Engines to Direct3D 10: Crysis / CryEngine2 Carsten Wenzel.
Advertisements

DirectX11 Performance Reloaded
Agenda Windows Display Driver Model (WDDM) What is GPUView?
Improving Performance in Your Game
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Advanced Graphics and Performance
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Tools for Investigating Graphics System Performance
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
OPTIMIZING AND DEBUGGING GRAPHICS APPLICATIONS WITH AMD'S GPU PERFSTUDIO 2.5 GPG Developer Tools Gordon Selley Peter Lohrmann GDC 2011.
You can use 3D graphics to enhance and differentiate your Metro style app.
Computer Graphics Introducing DirectX
High Performance in Broad Reach Games Chas. Boyd
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
CHAPTER 4 Window Creation and Control © 2008 Cengage Learning EMEA.
Real-time Graphical Shader Programming with Cg (HLSL)
Computer Graphics Graphics Hardware
차세대 그래픽 개발 환경.NET & DirectX 강성재 Community Specialist Microsoft Corporation.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
4/23/2017 4:23 AM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
Next-Generation Graphics APIs: Similarities and Differences Tim Foley NVIDIA Corporation
Computer Graphics Using Direct 3D Introduction. 2 What are we doing here? Simply, learning how to make the computer draw.
The programmable pipeline Lecture 3.
Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Ritual ™ Entertainment: Next-Gen Effects on Direct3D ® 10 Sam Z. Glassenberg Program Manager Microsoft ® – Direct3D ® Doug Service Director of Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Maths & Technologies for Games Graphics Optimisation - Batching CO3303 Week 5.
Advanced D3D10 Shader Authoring Presentation/Presenter Title Slide.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Postmortem: Deferred Shading in Tabula Rasa Rusty Koonce NCsoft September 15, 2008.
How to use a Pixel Shader CMT3317. Pixel shaders There is NO requirement to use a pixel shader for the coursework though you can if you want to You should.
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
The Graphics Rendering Pipeline
Graphics Processing Unit
Computer Graphics Graphics Hardware
Desktop Window Manager
UE4 Vulkan Updates & Tips
RADEON™ 9700 Architecture and 3D Performance
Computer Graphics Introduction to Shaders
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Chas. Boyd Principal PM Microsoft OSG Graphics Direct3D12 Chas. Boyd Principal PM Microsoft OSG Graphics Goal is to highlight details in a few areas that are newest (haven’t been presented before) Also provide a deeper explanation of why: philosophy.

Outline Overall objectives of DirectX12 Schedule -shipped last week DirectX12 Execution Model: Root Signatures, ExecuteIndirect, Multi-engine, Multi-adapter Tools, debugging Hardware Feature Levels and Tiers

Direct3D The 3D Graphics API for DirectX Targeted primarily at games Innovation and evolution over time Balance: Ease of programming Hardware features Performance

Evolution 1995 DirectX 1 DirectDraw, hardware blit and page flip 1996 DirectX 2 Direct3D, software render, execute buffers 1996 DirectX 3 Hardware-accelerated rasterization 1997 DirectX 5 DrawPrimitive, dual-texture, 1-bit ‘shader’ 1998 DirectX 6 Multi-texture blending, DXTC compression, bump mapping 1999 DirectX 7 Hardware vertex processing transformation and lighting. 2000 DirectX 8 First Programmable shaders 2001 DirectX 8.1 More instructions 2002 DirectX 9 High Level Shading Language, shaders of 32 instructions 2003 DirectX 9.0c float pixels, HLSL with 1000s of instructions per shader 2006 DirectX 10  Caps-free, geometry shaders, 2009 DirectX 11 Tessellation, DirectCompute 2012 DirectX 11.1 Performance and ARM CPU support 2013 DirectX 11.2 Tiled resources (aka megatexture) 2015 DirectX 12 Performance: Multithreading, Multi-Engine, Multi-adapter Evolution of DirectX releases and key features of each version

Direct3D 12 This version is about performance API/DDI model runs on most current GPUs Don’t wait for hardware installbase Optimizes entire stack: app, engine, driver, os, gpu Especially the driver Result is major shift in work distribution A more ‘Direct’ API Work is more consistent, less magic behind the scenes Back in 2007, we noticed that drivers were analyzing command streams to identify scheduling and parallelism opportunities. This is a potentially unbounded search problem, not something you want to have happen on every Draw call.

Core Features Command buffers and queues Resource indexing and tables Heaps, resources, views Resource transitions are finite duration Pipeline State Objects With caching Execution Model

Asynchronous Resource Access Execution is not constrained by resource access pattern No enforced serialization of access to memory objects Resource synchronization is now ‘opt-in’

A GPU Function Call Executing code on the GPU is *like* calling a function GPUs have special memory for the function ‘arguments’ This is not a stack, but very fast 32-bit ‘registers’ Apps can use this to pass in high-frequency-change parameters like constants or resources (via descriptors) Language-style explanation: Executing code on the GPU is like a function call. It is asynchronous since it runs on a separate core, but it still a call. It turns out that the hardware has some memory that can be used to pass the arguments of that function call. This makes sense since all our GPU code uses register-based calling conventions.

GPU Root Arguments Resource descriptors take 2 DWORDs Matrices take many constants… What if you need more than 32-64 DWORDs of state? Create a constant buffer and specify it’s descriptor Create a resource descriptor table and specify its index The root signature is the declaration of these arguments The root signature is the definition (number, types, etc) of these arguments

Root Signature Root Signature defines the number of arguments and their types: Constants Descriptors Descriptor Tables Performance improves with fewer DWORDs used Keep argument list short Try not to change this signature too often A few times per frame Analog of function signature and function call arguments The main( int argc, char *argv[] ) {}; for your GPU code

Using Root Signatures Defined using API syntax so both App and Driver agree Specified as part of PSO creation PSO will likely have many dependencies on it Separate signature for graphics and compute tasks

API – Root Parameter Types struct D3D12_ROOT_SIGNATURE_SLOT { D3D12_ROOT_ARGUMENT_TYPE ArgumentType; union D3D12_DESCRIPTOR_TABLE_LAYOUT DescriptorTable; D3D12_ROOT_CONSTANTS Constants; D3D12_ROOT_DESCRIPTOR Descriptor; } …

Root Signature Creation D3D12_ROOT_SIGNATURE_SLOT SigSlots[4]; ID3D12RootSignature* pSig; SigSlots[0].ArgumentType = D3D12_ROOT_ARGUMENT_32BIT_CONSTANTS; SigSlots[1].ArgumentType = D3D12_ROOT_ARGUMENT_CBV; SigSlots[2].ArgumentType = D3D12_ROOT_ARGUMENT_DESCRIPTOR_TABLE; SigSlots[3].ArgumentType = D3D12_ROOT_ARGUMENT_DESCRIPTOR_TABLE; … pDevice->CreateRootSignature(SigSlots, sizeof(SigSlots), &pSig);

Setting Root Arguments pCommandList->SetGraphicsRootSignature(pSignature); pCommandList->SetGraphicsRoot32bitConstant(0, BaseOffsetInCBV); pCommandList->SetGraphicsRootConstantBufferView(1, CBVDescriptorHandle); pCommandList->SetGraphicsDescriptorTable(2, SamplerDescriptorTable); pCommandList->SetGraphicsDescriptorTable(3, TextureDescriptorTable); This is how you actually set the values that you are passing in to the GPU via the root arguments

HLSL Works Unchanged cbuffer DrawConstants { UINT ConstantBufferOffset; } : register(b0) Buffer ObjectPerDrawParams : register(t7); Texture2D ObjectTextureArray[5] : register(t2); Sampler ObjectSamplers[2] : register(s0);

Can Define Signature in HLSL #define MyRS1 "RootFlags( ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT | " \ "DENY_VERTEX_SHADER_ROOT_ACCESS), " \ "CBV(b0, space = 1), " \ "SRV(t0), " \ "UAV(u0), " \ "DescriptorTable( CBV(b1), " \ "SRV(t1, numDescriptors = 8), " \ "UAV(u1, numDescriptors = unbounded)), " \ "DescriptorTable(Sampler(s0, space=1, numDescriptors = 4)), " \ "RootConstants(num32BitConstants=3, b10), " \ "StaticSampler(s1)," \ "StaticSampler(s2, " \ "addressU = TEXTURE_ADDRESS_CLAMP, " \ "filter = FILTER_MIN_MAG_MIP_LINEAR )" If you want to define the root signature using a shader syntax, you can.

ExecuteIndirect() Perform multiple Draws with a single API call ‘Arguments’ of Draw calls come from a buffer App defines buffer contents via a ‘command signature’ struct Number of draws can be controlled by CPU or by GPU Works on all DirectX12-capable hardware from FL 11.0 and up Much like everything else in DirectX, we’ve abstracted the nuances of all the hardware and enabled this feature on every 12 GPU

ExecuteIndirect Cmd Signature Operations performed by ExecuteIndirect described by a ‘command signature’ Describes the layout of the argument buffer and the set of commands Operations include: Set vertex or index buffer Change root constants Set root resource views (SRV, UAV, CBV) Draw, DrawIndexed, or Dispatch Currently Draw call type is set for the entire buffer, at least on PC. Since PSO is fixed for the entire command. At least currently on PC.

ExecuteIndirect vs Draw Loop for (UINT drawIdx = drawStart; drawIdx < drawEnd; ++drawIdx) { // Set bindings cmdLst->SetGraphicsRootConstantBufferView(RT_CBV, constantsPointer); constantsPointer += sizeof(DrawConstantBuffer); auto textureSRV = textureStartSRV.MakeOffsetted(staticData->textureIndex, handleIncrementSize); cmdLst->SetGraphicsRootDescriptorTable(RT_SRV, textureSRV); cmdLst->DrawIndexedInstanced(dynamicData->indexCount, 1, dynamicData->indexStart, staticData->vertexStart, 0); } mCmdLst->SetGraphicsRootDescriptorTable(RT_SRV, mTextureStart); mCmdLst->ExecuteIndirect(mCommandSignature, settings.numAsteroids, frame->mIndirectArgBuffer->Heap(), 0, nullptr, 0);

ExecuteIndirect() Performance DX11 DX12 DX12 Bindless DX12 ExecuteIndirect CPU 39.19 ms 33.41 ms 28.77 ms 5.69 ms GPU 34.81 ms 12.85 ms 11.86 ms 10.59 ms FPS 13.5 fps 21.6 fps 24.6 fps 60.0 fps Total CPU time Some simple apps have been able to put almost all their work for a given frame into one ExecuteIndirect call. Orders of magnitude reduction in CPU API overhead.

Multi-Engine

Multi-Engine GPUs contain multiple cores today 3D Cores, compute cores, copy engines, etc. In most hardware these can operate asynchronously Some variance in granularity of pre-emption

Programming Model in DirectX11 CPU0 CPU1 GPU CPU2 CPU3

Asynchronous Execution in DirectX 12 CPU0 CPU1 Graphics Engine Copy Engine Copy Engine Copy Engine CPU2 CPU3 Compute Engine And there are other components on there like the encoders and decoder and the display scan-out engines, etc.

Multi-Engine Model All of these are just cores aka ‘engines’ They can be invoked asynchronously Model is a queue per core for independent async operation A queue guarantees serial order of execution on a single engine Can specify priorities between queues Enables background processing in ‘idle’ clock cycles And also implement semaphores across queues Implementations vary only in the granularity of pre-emption.

Multi-Engine Hierarchy 3D Compute Copy Queue Types: 3D, Compute, Copy Extract all the parallelism out of the hardware that’s available Why do we have these nested? Because that’s how the hardware actually works: Really the 3D engine can do anything. It can do compute tasks and also the highest bandwidth copy tasks. A compute queue is just using the 3D engine when you know you can power down the graphics-specific portions of that core. A copy queue can be done on a separate blitter core aka DMA engine.

Tools for Multi-Engine This shows how the model is even expressed in the tools. You can see that the GPU engines (3D, and Copy) are peers to the CPU cores in the model.

Multi-Engine Scenario Hybrid Device Main rendering on discrete GPU Asynchronous copy engine sends image to integrated GPU Discrete GPU can start on next frame Integrated GPU applies post processing fx Prototype of this is working now We see benefits from this, and they increase as the performance of the integrated GPU grows.

Multi-adapter

Multi-adapter PCs can contain multiple Graphics Cards Some graphics cards have multiple GPUs Applications should be able to assign work to any engine on any graphics card And create memory resources on any engine’s memory Driver can over-ride app if it thinks it can improve performance.

Multi-adapter App can enumerate ‘adapters’ (graphics cards) from PCI Can create a D3D Device for each Each adapter may have multiple ‘nodes’ (GPUs) Each with own engines and memory Apps can create queues on any engine and submit command buffers Apps can allocate resources in memory associated with any GPU Drivers can ‘link’ multiple adapters to make them look to the app/runtime as a single adapter. Usually won’t do this unless/until the app does a poor job.

Hardware Model PCIe GPU CPU GPU GPU GPU ...

More API Capability: Predication, Queries, and Counters Efficiently managed in large numbers via heap model Resource transitions are finite duration

New Hardware Features Conservative Rasterization Tiled Resource Volumes Standard Swizzle Raster Ordered Views Compute Shader Pixel Format Conversion Hardware ecosystem is not standing still ROVs enable spatial random access, but temporal serialization. Useful when starting from a graphics tasks and writing to a general data structure (UAB) E.g. for when you sort input triangles beforehand and want to retain that, or other algorithms where order matters. Texture compression ASTC was on this list, but I personally messed up some paperwork on that. The hardware is still coming to the ecosystem as fast as possible.

Reporting Implementations Need to inform app developers re hardware characteristics Original model was individual caps bits DirectX9 had ~400 caps (~500 counting pixel formats) Issues: What is good vs bad? Combinatoric explosion? What if I need multiple features for a technique? Did not provide indication of direction for industry Looked to developer like millions of combinations were possible, even though there were only a few implementations No way to know how much hardware supported the specific set of caps required for a particular technique

Organizing Implementations Individual features now have ‘tiers’ e.g: Tiled resources tier 2 Conservative rasterization tier 1 A ‘Feature Level’ is a grouping of tiers Enables devs to identify a set of features as a unit Orthogonal to API version! API version number defines syntax/API used Direct3D 12 API supports FEATURE_LEVEL_11, _12, etc. Direct3D 11 API supports FEATURE_LEVEL_9_3 .. _11_3 DX10/11 introduced Feature Level as a category for grouping implementations DX11 first introduced Tiers with tiled resources, DX12 uses tiers for several things. When you go back and target existing hardware, it is hard to get it to align. These are getting simpler over time. We are able to reduce the number of tiers in the hardware as we work with the IHVs.

Tools SDK layer can be enabled for detailed validation Tools are now built in concert with the API Capture/Playback Timing Analysis Visualization of intermediate results Collaboration with the other tools teams (visual studio) New instrumentation has been added to drivers Detailed stats on internal registers

Visual Studio 2015 Visual Studio 2015 4/21/2017 VS 2015 Unified CPU, GPU, System profiling and debugging tool for the Universal App Platform and full breadth of Windows devices Visual Studio 2015 © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Shader Edit and Apply Side by side windows for HLSL source code and shader compiler output Edit shader code and apply changes to the log file to view impacts

Summary DirectX12 execution model enables Flexible access to CPU/GPU memory resources Multi-threaded scalability for CPU efficiency GPU side work preparation via ExecuteIndirect Multiple asynchronous queues: 3D, Compute, Copy Ability to target any processor in the machine via Multi-Engine and Multi-adapter GPGPU was not the main focus of DX12, yet there are several that massively improve the DirectCompute capabilities and performance Support for multi-GPU, and for VR/Stereo.

Fin

Resources Follow @DirectX12 on twitter http://blogs.msdn.com/directx Sign up for Early Access program at: http://tinyurl.com/o9wq7fb Or http://1drv.ms/1pmVF6c

DirectX12 the Movie BUILD 2014         https://channel9.msdn.com/Events/Build/2014/3-564 GDC 2015            https://channel9.msdn.com/Events/GDC/GDC-2015/Advanced- DirectX12-Graphics-and-Performance GDC 2015            https://channel9.msdn.com/Events/GDC/GDC-2015/Better- Power-Better-Performance-Your-Game-on-DirectX12 BUILD 2015         https://channel9.msdn.com/Events/Build/2015/3-673  Slightly updated version of Max’s GDC 2015 talk GDC 2015 http://channel9.msdn.com/events/GDC/GDC-2015/Solve- the-Tough-Graphics-Problems-with-your-Game-Using-DirectX-Tools

DirectX12 Videos New Youtube Channel: Microsoft Graphics Education Talks by the developers