Technology Behind AMD’s “Leo Demo” Jay McKee MTS Engineer, AMD
Why Forward Rendering? Complex materials Multiple light types Supports hardware anti-aliasing Efficient memory usage Supports transparency BUT, previously could not support a large number of lights
Forward+ Rendering Modified forward renderer. Add computer shader for light culling. Modify main light loop. Lighting and shading done in the same place, all information is preserved.
Forward+ Rendering (continued) No limits on parameters for lights and materials Omni Spot Cinematic (arbitrary falloffs, barndoor) BRDF per material instance Simple design, concentrate on rendering, not engine maintenance.
Important DX11 features Compute Shaders UAV support.
Compute Shaders In Leo demo we use two compute shaders: One for culling lights. Another for spawning Virtual Point Lights (VPLs) for indirect lighting. Culling 3,072 lights takes 1.7 ms on high end GPU.
UAVs Array(s) of scene light information. Array of u32 light indices for storing start/end lights per-tile. Array of material instance data
Algorithm summary Depth Pre-Pass Light Culling Screen divided into tiles. Launch compute shader per tile. Light info such as position, radius, direction, length passed to light culling compute shader. Light culling shader projects lights bounds to screen-space tiles. Uses scene depth from z pre-pass for z testing against light volumes. Outputs to UAV describing per tile light list start/end along with a large UAV of u32 array of light indices. Output UAVs are passed to main light shaders for looping through lights per-pixel.
Algorithm summary continued Render scene materials Base light accumulation function Use screen x, y location to determine tileID From tileID, get light start and end indices From start index to end index, loop Entry is index into light array. Accumulate light hitting pixel Returns total direct and indirect light hitting pixel.
Algorithm summary continued Material shader Decides what to do with total incoming light Passed into material’s BRDF for example Uses light accumulation building blocks Env. lighting, base light accumulation, BRDF, etc. are put together for final pixel color.
Light Culling Shader Details (1/3) // 1. prepare float4 frustum[4]; float minZ, maxZ; { ConstructFrustum( frustum ); minZ = thread_REDUCE(MIN, depth ); maxZ = thread_REDUCE(MAX, depth ); ldsMinZ = SIMD_REDUCE(MIN, minZ ); ldsMaxZ = SIMD_REDUCE(MAX, maxZ ); minZ = ldsMinZ; maxZ = ldsMaxZ; }
Light Culling Shader Details (2/3) __local u32 ldsNLights = 0; __local u32 ldsLightBuffer[MAX]; // 2. overlap check, accumulate in LDS for(int i=threadIdx; i<nLights; i+=WG_SIZE) { Light light = fetchAndTransform( lightBuffer[ i ] ); if( overlaps( light, frustum ) && overlaps ( light, minZ, maxZ ) ) AtomicAppend( ldsLightBuffer, i ); }
Light Culling Shader Details (3/3) // 3. export to global __local u32 ldsOffset; if( threadIdx == 0 ) { ldsOffset = AtomAdd( ldsNLights ); globalLightStart[tileIdx] = ldsOffset; globalLightEnd[tileIdx] = ldsOffset + ldsNLights; } for(int i=threadIdx; i< ldsNLights; i+=WG_SIZE) int dstIdx = ldsOffset + i; globalLightIndexBuffer[dstIdx] = ldsLightBuffer[i];
Light Accumulation Pseudo-code // BaseLighting.inc // THIS INC FILE IS ALL THE COMMON LIGHTING CODE StructuredBuffer<float4> LightParams : register(u0); StructuredBuffer<uint> LowerBoundLights : register(u1); StructuredBuffer<uint> UpperBoundLights : register(u2); StructuredBuffer<int2> LightIndexBuffer : register(u3); uint GetTileIndex(float2 screenPos) { float tileRes = (float)m_tileRes; uint numCellsX = (m_width + m_tileRes - 1)/m_tileRes; uint tileIdx = floor(screenPos.x/tileRes)+floor(screenPos.y/tileRes)*numCellsX; return tileIdx; }
Light Accumulation (2): StartHLSL BaseLightLoopBegin // THIS IS A MACRO, INCLUDED IN MATERIAL SHADERS uint tileIdx = GetTileIndex( pixelScreenPos ); uint startIdx = LowerBoundLights[tileIdx]; uint endIdx = UppweBoundLights[tileIdx]; [loop] for ( uint lightListIdx = startIdx; lightListIdx < endIdx; lightListIdx++ ) { int lightIdx = LightIndexBuffer[lightListIdx]; // Set common light parameters float ndotl = max(0, dot(normal, lightVec)); float3 directLight = 0; float3 indirectLight = 0;
Light Accumulation (3): if( lightIdx >= numDirectLightsThisFrame ) { CalculateIndirectLight(lightIdx , indirectLight); } else { if( IsConeLight( lightIdx ) ) { // <<== Can add more light types here CalculateDirectSpotlight(lightIdx , directLight); } else { CalculateDirectSpherelight(lightIdx , directLight); } float3 incomingLight = (directLight + indirectLight)*ndotl; float shadowTerm = CalcShadow(); EndHLSL StartHLSL BaseLightLoopEnd
Material Shader Template: #include "BaseLighting.inc" float4 PS ( PSInput i ) : SV_TARGET { float3 totalDiffuse = 0; float3 totalSpec = GetEnvLighting();; $include BaseLightLoopBegin // unique material code goes here!! Light accumulation on the pixel for a given light // we have total incoming light and direct/indirect light components as well as material params and shadow term // use these building blocks to integrate lighting terms totalDiffuse += GetDiffuse(incomingLight); totalSpec += CalcPhong(incomingLight); $include BaseLightLoopEnd float3 finalColor = totalDiffuse + totalSpec; return float4( finalColor, 1 ); }
Debug Mode Demo
Benchmark 3k dynamic lights
Compute-based Deferred v.s. Forward+ Takahiro Harada, Jay McKee, Jason C.Yang, Forward+: Bringing Deferred Lighting to the Next Level, Eurographics Short Paper (2012)
Depth Pre-Pass Critical Pixel overdraw cripples this technique so depth pre-pass is required. Depth pre-pass is good opportunity to use MRT to generate other full-screen data needed for post-fx and other render fx (optional).
Other important points XBOX 360 has good bandwidth so given limitations on forward rendering, deferred makes a lot of sense. However, ALU computation growing at faster rate than bandwidth. more and more feasible to just do the calculations than to read/write so much data. Dynamic branching penalties not nearly as bad as before. As an optimization, compute shader can sort by light-type for example to minimize penalties. All that "light management" CPU side code to decide which lights hit each object for setting constant registers can be ditched!
Summary Modified forward renderer that handles scenes with 1000s of lights. Hardware anti-aliasing (MSAA) “automatic” Bandwidth friendly. Makes the most of the GPU's ALU power (which is growing faster than bandwidth)
Thanks! Contact: Takahiro.Harada@amd.com jay.mckee@amd.com jasonc.yang@amd.com Leo Demo website: http://developer.amd.com/samples/demos/pages/AMDRadeonHD7900SeriesGraphicsReal-TimeDemos.aspx Eurographics 2012: 'Forward+: Bringing Deferred Lighting to the Next Level'