DirectCompute Accelerated Separable Filtering 28th February 20112AMD‘s Favorite Effects.

Slides:



Advertisements
Similar presentations
CRYTEK © 2010 Crytek GmbH BRINGING STEREO TO CONSOLES Nicolas Schulz, R&D Graphics Engineer GDC Europe 2010, AAA Stereo-3D in CryENGINE.
Advertisements

Accelerating Real-Time Shading with Reverse Reprojection Caching Diego Nehab 1 Pedro V. Sander 2 Jason Lawrence 3 Natalya Tatarchuk 4 John R. Isidoro 4.
DirectCompute Performance on DX11 Hardware
Advanced Visual Effects with Direct3D
An Optimized Diffusion Depth Of Field Solver (DDOF)
Filtering Approaches for Real-Time Anti-Aliasing
Exploration of advanced lighting and shading techniques
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
DirectX11 Performance Reloaded
A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
CS123 | INTRODUCTION TO COMPUTER GRAPHICS Andries van Dam © 1/16 Deferred Lighting Deferred Lighting – 11/18/2014.
CSE 781 Anti-aliasing for Texture Mapping. Quality considerations So far we just mapped one point – results in bad aliasing (resampling problems) We really.
DSPs Vs General Purpose Microprocessors
Normal Map Compression with ATI 3Dc™ Jonathan Zarge ATI Research Inc.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Frame Buffer Postprocessing Effects in DOUBLE-S.T.E.A.L (Wreckless)
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
Week 11 - Wednesday.  Image based effects  Skyboxes  Lightfields  Sprites  Billboards  Particle systems.
Render Cache John Tran CS851 - Interactive Ray Tracing February 5, 2003.
Week 7 - Wednesday.  What did we talk about last time?  Transparency  Gamma correction  Started texturing.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
CRYTEK © 2010 Crytek GmbH AAA STEREO-3D IN CRYENGINE 3 Jens Schobel Francesco Carucci
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
Final Gathering on GPU Toshiya Hachisuka University of Tokyo Introduction Producing global illumination image without any noise.
Advanced lighting and rendering Multipass rendering.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.
Post-rendering Cel Shading & Bloom Effect
Part II: Addressing Modes
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Shader Model 5.0 and Compute Shader
Filtering Approaches for Real-Time Anti-Aliasing /
Noise and Procedural Techniques John Spitzer Simon Green NVIDIA Corporation.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
09/09/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Event management Lag Group assignment has happened, like it or not.
December 4, 2014Computer Vision Lecture 22: Depth 1 Stereo Vision Comparing the similar triangles PMC l and p l LC l, we get: Similarly, for PNC r and.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
Tiger Woods 2008: Advancements in Environments Peter Arisman Technical Art Director Tiger Woods 2008.
Advanced Computer Graphics Advanced Shaders CO2409 Computer Graphics Week 16.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Shader Study 이동현. Vision engine   Games Helldorado The Show Warlord.
09/16/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Environment mapping Light mapping Project Goals for Stage 1.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Real-Time Relief Mapping on Arbitrary Polygonal Surfaces Fabio Policarpo Manuel M. Oliveira Joao L. D. Comba.
Real-Time Dynamic Shadow Algorithms Evan Closson CSE 528.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
Week 7 - Wednesday CS361.
Week 11 - Wednesday CS361.
DX11 TECHNIQUES IN HK2207 Takahiro Harada AMD. DX11 TECHNIQUES IN HK2207 Takahiro Harada AMD.
CSE 455 HW 1 Notes.
Deferred Lighting.
The Graphics Rendering Pipeline
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
Presented by: Isaac Martin
Memory Management 11/17/2018 A. Berrached:CS4315:UHD.
Lecture 5: GPU Compute Architecture for the last time
Wavelet “Block-Processing” for Reduced Memory Transfers
UMBC Graphics for Games
UMBC Graphics for Games
RADEON™ 9700 Architecture and 3D Performance
Lecture 4: Instruction Set Design/Pipelining
6- General Purpose GPU Programming
Presentation transcript:

DirectCompute Accelerated Separable Filtering 28th February 20112AMD‘s Favorite Effects

Separable Filters Much faster than executing a box filter Classically performed by the Pixel Shader Consists of a horizontal and vertical pass Source image over-sampling increases with kernel size – Shader is usually TEX instruction limited 28th February 2011AMD‘s Favorite Effects3

Separable? – Who Cares In many cases developers use this technique even though the filter may not actually be separable – Results are often still acceptable – Much faster than performing a real box filter – Accelerates many bilateral cases 28th February 2011AMD‘s Favorite Effects4

Typical Pipeline Steps 28th February 2011AMD‘s Favorite Effects5 Source RT Intermediate RT Destination RT Horizontal Pass Vertical Pass

Use Bilinear HW filtering? Bilinear filter HW can halve the number of ALU and TEX instructions – Just need to compute the correct sampling offsets Not possible with more advanced filters – Usually because weighting is a dynamic operation – Think about bilateral cases... 28th February 2011AMD‘s Favorite Effects6

Where to start with DirectCompute Is the Pixel Shader version TEX or ALU limited? – You need to know what to optimize for! – Use IHV tools to establish this Achieving peak performance is not easy – so write a highly configurable kernel – Will allow you to easily experiment and fine tune 28th February 2011AMD‘s Favorite Effects7

Thread Group Shared Memory (TGSM) TGSM can be used to reduce TEX ops TGSM can also be used to cache results – Thus saving ALU ops too Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32) – Choose a good common factor (multiples of 64) 28th February 2011AMD‘s Favorite Effects8

Kernel #1 Redundant compute threads  28th February 2011AMD‘s Favorite Effects threads load 128 texels 128 – ( Kernel Radius * 2 ) threads compute results Kernel Radius

Avoid Redundant Threads Should ensure that all threads in a group have useful work to do – wherever possible Redundant threads will not be reassigned work from another group This would involve alot of redundancy for a large kernel diameter 28th February 2011AMD‘s Favorite Effects10

Kernel #2 28th February 2011AMD‘s Favorite Effects threads load 128 texels 128 threads compute results Kernel Radius No redundant compute threads Kernel Radius * 2 threads load 1 extra texel each

Multiple Pixels per Thread Allows for natural vectorization – 4 works well on AMD HW – Doesn‘t hurt performance on scalar HW Possible to cache TGSM reads on General Purpose Registers (GPRs) – Quartering TGSM reads - absolute winner!! 28th February 2011AMD‘s Favorite Effects12

Kernel #3 Compute threads not a multiple of 64  28th February 2011AMD‘s Favorite Effects threads compute 128 results Kernel Radius 32 threads load 128 texels Kernel Radius * 2 threads load 1 extra texel each

Multiple Lines per Thread Group Process multiple lines per thread group – Better than one long line – 2 or 4 works well Improved texture cache efficiency Compute threads back to a multiple of 64 28th February 2011AMD‘s Favorite Effects14

Kernel #4 28th February 2011AMD‘s Favorite Effects Kernel Radius 64 threads compute 256 results 64 threads load 256 texels Kernel Radius * 4 threads load 1 extra texel each

Kernel Diameter Kernel diameter needs to be > 7 to see a DirectCompute win – Otherwise the overhead cancels out the advantage The larger the kernel diameter the greater the win 28th February 2011AMD‘s Favorite Effects16

Use Packing in TGSM Use packing to reduce storage space required in TGSM – Only have 32k per SIMD Reduces reads/writes from TGSM Often a uint is sufficient for color filtering Use SM5.0 instructions f32tof16(), f16tof32() 28th February 2011AMD‘s Favorite Effects17

High Definition Ambient Occlusion 28th February 2011AMD‘s Favorite Effects18 Depth + Normals HDAO buffer * = Original Scene Final Scene

Perform at Half Resolution HDAO at full resolution is expensive Running at half resolution captures more occlusion – and is obviously much faster Problem: Artifacts are introduced when combined with the full resolution scene 28th February 2011AMD‘s Favorite Effects19

Bilateral Dilate & Blur 28th February 2011AMD‘s Favorite Effects20 HDAO buffer doesn‘t match with scene A bilateral dilate & blur fixes the issue

New Pipeline... 28th February 2011AMD‘s Favorite Effects21 Bilinear Upsample Intermediate UAV Dilated & Blurred Horizontal Pass Vertical Pass ½ Res Still much faster than performing at full res!

Pixel Shader vs DirectCompute 28th February 2011AMD‘s Favorite Effects22 *Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~2.53x to ~3.17x faster than the Pixel Shader

Depth of Field Many techniques exist to solve this problem A common technique is to figure out how blurry a pixel should be – Often called the Cirle of Confusion (CoC) A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect 28th February 2011AMD‘s Favorite Effects23

The Pipeline... 28th February 2011AMD‘s Favorite Effects24 Intermediate UAV CoC Horizontal Pass Vertical Pass

28th February 2011AMD‘s Favorite Effects25 Shogun 2: DoF OFF

28th February 2011AMD‘s Favorite Effects26 Shogun 2: DoF ON

Pixel Shader vs DirectCompute 28th February 2011AMD‘s Favorite Effects27 *Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~1.48x to ~1.86x faster than the Pixel Shader

Summary DirectCompute greatly accelerates larger kernel diameter filters Allows for filtering at full resolution For access to source code: – HDAO11: – DoF11: 28th February 2011AMD‘s Favorite Effects28

Questions? Please fill in the feedback forms! 28th February AMD‘s Favorite Effects