Tessellation in a Low Poly World Nicolas Thibieroz AMD Graphics Products Group Original materials from Bill Bilodeau 1 15/01/2014.

Tessellation in a Low Poly World Nicolas Thibieroz AMD Graphics Products Group nicolas.thibieroz@amd.com Original materials from Bill Bilodeau 1 15/01/2014 GDC Paris 2008

Medium What is Tessellation? Tessellation is the process of adding new primitives into an existing model Triangle counts can be dialed in by adjusting the tessellation level Low High

AMD Hardware Tessellator Output Merger Rasterizer Pixel Shader Memory / Resources Vertex Shader Memory / Resources Input Assembler Tessellator

Hardware tessellation allows you to render more polygons for better silhouettes Initial concept artwork from Bay Raitt, Valve

Surface control cages are easier to work with than individual triangles Artists prefer to create models this way Animations are simpler on a control cage Control cage can be animated on the GPU, then tessellated in a second pass Animated Control Cage Vertex Shader Pixel Shader R2VB Vertex Shader Pixel Shader Tessellator

Hardware tessellation is a form of compression Smaller footprint – you only need to store the control cage and possibly a displacement map Improved bandwidth – less data to transfer from memory to GPU

Three types of primitives, or superprims, are supported Triangles Quads Lines

There are two tessellation modes - Continuous - Adaptive

Continuous Tessellation Specify floating point tessellation level per-draw call –Tessellation levels range from 1.0 to 14.99 Eliminates popping as vertices are added through tessellation Level 1.0 Level 2.0

Level = 1.0 Level = 1.1 Level = 1.3 Level = 1.7 Level = 2.0 Continuous Tessellation Level 1.0 Level 2.0 Specify floating point tessellation level per-draw call –Tessellation levels range from 1.0 to 14.99 Eliminates popping as vertices are added through tessellation

Adaptive allows different levels of tessellation within the same mesh Edge tessellation factor = 5.x Edge tessellation factor = 3.x Edge tessellation factor = 5.x Edge tessellation factor = 7.x Edge tessellation factor = 3.x

Adaptive tessellation can be done in real-time using multiple passes Transformed Superprim Mesh Superprim Mesh Vertex Shader Pixel Shader Superprim Mesh Vertex Shader Pixel Shader Sampler Stream 0 Vertex Shader Pixel Shader Superprim Mesh Stream 1 Tessellator Tessellation Factors R2VB

Code Example: Continuous Tessellation // Enable tessellation: TSSetTessellationMode( pd3dDevice, TSMD_ENABLE_CONTINUOUS ); // Set tessellation level: TSSetMaxTessellationLevel( pd3dDevice, sg_fMaxTessellationLevel ); // Select appropriate technique to render our tessellated objects: sg_pEffect->SetTechnique( "RenderTessellatedDisplacedScene" ); // Render all passes with tessellation V( sg_pEffect->Begin( &cPasses, 0 ) ); for ( iPass = 0; iPass < cPasses; iPass++ ) { V( sg_pEffect->BeginPass( iPass ) ); V( TSDrawMeshSubset( sg_pMesh, 0 ) ); V( sg_pEffect->EndPass() ); } V( sg_pEffect->End() ); // Disable tessellation: TSSetTessellationMode( pd3dDevice, TSMD_DISABLE );

Displacement Map The vertex shader is used as an evaluation shader Tessellator Super-prim Mesh Tessellated and Displaced Mesh Tessellated Mesh Vertex Shader (Evaluation Shader) Sampler

Example Code: Evaluation Vertex Shader struct VsInputTessellated { // Barycentric weights for this vertex float3 vBarycentric: BLENDWEIGHT0; // Data from superprim vertex 0: float4 vPositionVert0 : POSITION0; float2 vTexCoordVert0 : TEXCOORD0; float3 vNormalVert0 : NORMAL0; // Data from superprim vertex 1: float4 vPositionVert1 : POSITION4; float2 vTexCoordVert1 : TEXCOORD4; float3 vNormalVert1 : NORMAL4; // Data from superprim vertex 2: float4 vPositionVert2 : POSITION8; float2 vTexCoordVert2 : TEXCOORD8; float3 vNormalVert2 : NORMAL8; };

Example Code: Evaluation Vertex Shader VsOutputTessellated VSRenderTessellatedDisplaced( VsInputTessellated i ) { VsOutputTessellated o; // Compute new position based on the barycentric coordinates: float3 vPosTessOS = i.vPositionVert0.xyz * i.vBarycentric.x + i.vPositionVert1.xyz i.vBarycentric.y + i.vPositionVert2.xyz * i.vBarycentric.z; // Output world-space position: o.vPositionWS = vPosTessOS; // Compute new normal vector for the tessellated vertex: o.vNormalWS = i.vNormalVert0.xyz * i.vBarycentric.x + i.vNormalVert1.xyz * i.vBarycentric.y + i.vNormalVert2.xyz * i.vBarycentric.z; // Compute new texture coordinates based on the barycentric coordinates: o.vTexCoord = i.vTexCoordVert0.xy * i.vBarycentric.x + i.vTexCoordVert1.xy * i.vBarycentric.y + i.vTexCoordVert2.xy * i.vBarycentric.z; // Displace the tessellated vertex (sample the displacement map) o.vPositionWS = DisplaceVertex( vPosTessOS, o.vTexCoord, o.vNormalWS ); // Transform position to screen-space: o.vPosCS = mul( float4( o.vPositionWS, 1.0 ), g_mWorldViewProjection ); return o; } // End of VsOutputTessellated VSRenderTessellatedDisplaced(..)

What if you want to do more? DirectX 9 has a limit of 15 float4 vertex input components – High order surfaces need more inputs TSToggleIndicesRetrieval() allows you to fetch the super-prim data from a vertex texture Bezier Control Points Vertex Shader Sampler Tessellator (u,v) P 0,0, P 0,1 … P 3,3

Other Tessellation Library Functions TSDrawIndexed(…) –Analogous to DrawIndexedPrimitive(…) TSDrawNonIndexed(…) –Needed for adaptive tessellation, since every edge needs its own tessellation level TSSetMinTessellationLevel(…) –Sets the minimum tessellation level for adaptive tessellation TSComputeNumTessellatedPrimitives(…) –Calculates the number of tessellated primitives that will be generated by the tessellator

Displacement mapping alters tangent space To do normal mapping we need to rotate tangent space Alternatively, use model space normal maps Doesnt work with animation or tiling

Displacement map lighting Use the displacement map to calculate the per-pixel normal Central differencing with neighboring displacements can approximate the derivative Light with the computed normal No need to use a normal map

Terrain Rendering: Performance Results Both use the same displacement map (2K x 2K) and identical pixel shaders Low Resolution with Tessellation High Resolution, No Tessellation On-disk model polygon count (pre-tessellation) 840 triangles1,280,038 triangles Original model rendering cost 1210 fps (0.83 ms) Actual rendered model polygon count 1,008,038 triangles1,280,038 triangles VRAM Vertex buffer size70 KB31 MB VRAM Index buffer size23 KB14 MB Rendering time821.41 fps (1.22 ms)301 fps (3.32 ms) Rendering with tessellation is > 6X faster and provides memory savings over 44MB! Subtracting the cost of shading

Terrain Tessellation Sample

AMD GPU MeshMapper New tool for generate normal, displacement, and ambient occlusion maps from hi-res and low-res mesh pairs

Advantages of the Tessellator Saves memory bandwidth and reduces memory footprint Flexible support for displacement mapping and many kinds of high order surfaces Easier content creation – artists and animators only need to work with low resolution geometry Continuous LOD avoids unnecessary triangles The tessellator is available now on the Xbox 360 and the latest ATI Radeon and FireGL graphics cards Public availability of tessellation SDK very soon

Harnessing the Power of Multiple GPUs Nicolas Thibieroz AMD Graphics Products Group nicolas.thibieroz@amd.com Original materials from Jon Story & Holger Grün 25 15/01/2014 GDC Paris 2008

Why MGPU? MGPUs can be used to dramatically increase performance and visual quality –At higher screen resolutions –Especially with increased use of MSAA Many applications become GPU limited at higher screen resolutions –High resolution monitors => mainstream affordability Achieve next generation performance on todays HW –Prototype your next engine Provides an upgrade path for mainstream parts 26 15/01/2014

Multiple Boards An increasing number of motherboards can accept 2 or more discrete video cards Connected by high speed crossover cables Now possible to fit 4 Radeon HD3850 boards to a single motherboard CrossFireX technology allows you to harness that performance 27 15/01/2014 4x 2x

Multiple GPUs per Board The Radeon HD3870 X2 is a single-board multi-GPU architecture –AFR is on by default Heavy peer to peer communication –Bi-directional 16x lane pipe connecting the 2 GPUs CrossFireX supports 2 HD3870 X2 boards for Quad GPU performance 28 15/01/2014 4x 2x

Hybrid Crossfire Combination of integrated and discrete graphics 3D graphics performance boost –Laptops –Mainstream desktop PCs Use less power during non- taxing graphical tasks 29 15/01/2014

CrossFire Rendering Modes Split Frame Rendering / Scissor –Screen is divided into number of GPUs –Dynamic load balancing Alternate Frame Rendering –GPUs take alternate frames –Vertex processing not duplicated –Highest performing mode 30 15/01/2014

How does AFR Work? 31 15/01/2014 CPU GPU0 (Frame N) GPU1 (Frame N+1) Command

Hardware Considerations Current MGPU setups are not shared memory architectures –Resources placed in local video memory are duplicated for each GPU Driver initiates peer to peer (P2P) copies to keep resources in sync –On some chipsets this may involve the CPU –Synchronizes all GPUs –Very heavy impact on performance that can even result in negative scaling 32 15/01/2014

Driver Modes Compatible AFR Mode –Default mode –Driver checks for AFR unfriendly behaviour –Will P2P copy stale resources Full AFR Mode (Application Profile) –Driver recognises EXE name –Use a unique name and dont change it –Behaviour fully guided by profile –Best performance – no checking –Rename EXE to AFR-FriendlyD3D.exe –Use AFR-FriendlyOGL.exe for OpenGL –No checking : Speed & compatibility test 33 15/01/2014

Detecting the Number of GPUs Visit http://ati.amd.com/developerhttp://ati.amd.com/developer – Download project called CrossFire Detect Statically link to: –atimgpud_s_x86.lib 32 bit version –atimgpud_s_x64.lib 64 bit version Include header file: –atimgpud.h Call this function: –INT count = AtiMultiGPUAdapters(); 34 15/01/2014

Common Pitfalls & Solutions 35 15/01/2014

Pitfall: Dependencies Between Frames 36 15/01/2014 Update resource A Present (N) Draw using A Update resource A Present (N+1) GPU1 (Frame N+1) GPU0 (Frame N) resource A Present (N-1) Draw using A P2P copy from GPU0 to GPU1

Solution: Resources that Change Every Frame 37 15/01/2014 Draw using A Present (N) Update resource A Draw using A Present (N+1) GPU1 (Frame N+1) GPU0 (Frame N) resource A Present (N-1) Update resource A There are no P2P copies if one always modifies the resource before using it within a frame !

Solution: Resources that Change Every Few Frames 38 15/01/2014 Draw using A Present (N) Update resource A Draw using A Present (N+1) GPU1 (Frame N+1) GPU0 (Frame N) resource A Present (N-1) Update resource A Draw using A Present (N+2) Draw using A Present (N+4) Draw using A Present (N+3) Repeat the modification for N GPU frames to ensure that each GPU has the same data! No P2P copies will happen!

Pitfalls: In DX10 there are Other Ways to Update Resources... Drawing to vertex/index buffers Stream Out CopyResource() calls CopySubresourceRegion() calls GenerateMips() calls ResolveSubresource() calls 39 15/01/2014

Pitfall: Waiting on Queries 40 15/01/2014 CPU GPU0 (Frame N) GPU1 (Frame N+1) Command Waiting for Query Result!!! Waiting starves GPU queues Waiting limits parallelismWaiting => CPU limitation

Solution: Queries Avoid using queries whenever possible - For occlusion queries consider a CPU-based approach Avoid waiting on query results - Pick up the result of a query at least N-GPU frames after it was issued For queries issued every frame - Create additional query objects for each GPU - Cycle through them

Pitfall: CPU Access to a Renderable Resource When the CPU locks a renderable resource it must wait for all GPUs to finish using the resource before acquiring the pointer All GPUs now have to wait until the CPU unlocks the resource pointer After the unlock the driver has to update the resource on each GPU via P2P copies Just dont do this – it destroys performance even on a single GPU setup, and is catastrophic for MGPUs 42 15/01/2014

Solutions: Locks / Maps In DX10 stream to and copy from STAGING textures In DX9 StretchRect() is always better than Lock() At resource creation time use the appropriate flags from: –D3D10_USAGE –D3D10_CPU_ACCESS_FLAG In DX9 never lock static Vertex/Index Buffers because it will cause P2P copies 43 15/01/2014

Concluding Pitfalls & Solutions Drivers take a conservative approach –Performs checks on resource synchronization –P2P copy if necessary You know the application best –Determine if a P2P copy is necessary –Talk to us about a profile 44 15/01/2014

AFR-Friendly SDK Sample Part of the ATI developer SDK –http://ati.amd.com/developerhttp://ati.amd.com/developer Detects the number of GPUs Correctly deals with textures used as render targets Provides a solution for dealing with mouse cursor lag Go and take a look!! 45 15/01/2014

Call to Action MGPUs provide demonstrable performance gains MGPUs boost visual quality Plan from day one to make your rendering scale Detect the number of GPUs Regularly check for AFR unfriendly behavior Talk to us... 46 15/01/2014

QUESTIONS? ? nicolas.thibieroz@amd.com

Tessellation in a Low Poly World Nicolas Thibieroz AMD Graphics Products Group Original materials from Bill Bilodeau 1 15/01/2014.

Similar presentations

Presentation on theme: "Tessellation in a Low Poly World Nicolas Thibieroz AMD Graphics Products Group Original materials from Bill Bilodeau 1 15/01/2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tessellation in a Low Poly World Nicolas Thibieroz AMD Graphics Products Group Original materials from Bill Bilodeau 1 15/01/2014.

Similar presentations

Presentation on theme: "Tessellation in a Low Poly World Nicolas Thibieroz AMD Graphics Products Group Original materials from Bill Bilodeau 1 15/01/2014."— Presentation transcript:

Similar presentations

About project

Feedback