Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis.

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Real-Time Rendering TEXTURING Lecture 02 Marina Gavrilova.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
High-Performance Software
High-Quality Parallel Depth-of- Field Using Line Samples Stanley Tzeng, Anjul Patney, Andrew Davidson, Mohamed S. Ebeida, Scott A. Mitchell, John D. Owens.
GCAFE 28 Feb Real-time REYES Jeremy Sugerman.
Real-Time Reyes: Programmable Pipelines and Research Challenges Anjul Patney University of California, Davis.
I3D Fast Non-Linear Projections using Graphics Hardware Jean-Dominique Gascuel, Nicolas Holzschuch, Gabriel Fournier, Bernard Péroche I3D 2008.
Eurographics 2012, Cagliari, Italy S-buffer: Sparsity-aware Multi-fragment Rendering Andreas A. Vasilakis and Ioannis Fudos Department of Computer Science,
Real-Time Reyes-Style Adaptive Surface Subdivision
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Final Gathering on GPU Toshiya Hachisuka University of Tokyo Introduction Producing global illumination image without any noise.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Anjul Patney University of California, Davis Real-Time Reyes Programmable Pipelines and Research Challenges.
The Graphics Pipeline CS2150 Anthony Jones. Introduction What is this lecture about? – The graphics pipeline as a whole – With examples from the video.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Beyond Programmable Shading Course, ACM SIGGRAPH 20111/66.
Shading Languages By Markus Kummerer. Markus Kummerer 2 / 19 State of the Art Shading.
02/14/02(c) University of Wisconsin 2002, CS 559 Last Time Filtering Image size reduction –Take the pixel you need in the output –Map it to the input –Place.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.
Prefiltered Anti-Aliasing on Parallel Hardware
Beyond Programmable Shading Course, ACM SIGGRAPH 20111/43.
Aaron Lefohn University of California, Davis GPU Memory Model Overview.
CHAPTER 4 Window Creation and Control © 2008 Cengage Learning EMEA.
Computationally Efficient Histopathological Image Analysis: Use of GPUs for Classification of Stromal Development Olcay Sertel 1,2, Antonio Ruiz 3, Umit.
© NVIDIA and UC Davis 2008 Advanced Data-Parallel Programming: Data Structures and Algorithms John Owens UC Davis.
09/09/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Event management Lag Group assignment has happened, like it or not.
Filtering and Color To filter a color image, simply filter each of R,G and B separately Re-scaling and truncating are more difficult to implement: –Adjusting.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Piko: A Framework for Authoring Programmable Graphics Pipelines Anjul Patney and Stanley Tzeng UC Davis and NVIDIA Kerry A. Seitz, Jr. and John D. Owens.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.
Improving k-buffer methods via Occupancy Maps Andreas A. Vasilakis and Georgios Papaioannou Dept. of Informatics, Athens University of Economics & Business,
Hardware-accelerated Rendering of Antialiased Shadows With Shadow Maps Stefan Brabec and Hans-Peter Seidel Max-Planck-Institut für Informatik Saarbrücken,
Accelerated Stereoscopic Rendering using GPU François de Sorbier - Université Paris-Est France February 2008 WSCG'2008.
Introduction to OpenGL  OpenGL is a graphics API  Software library  Layer between programmer and graphics hardware (and software)  OpenGL can fit in.
GPU-Accelerated Computing and Case-Based Reasoning Yanzhi Ren, Jiadi Yu, Yingying Chen Department of Electrical and Computer Engineering, Stevens Institute.
Real-Time High Quality Rendering CSE 291 [Winter 2015], Lecture 2 Graphics Hardware Pipeline, Reflection and Rendering Equations, Taxonomy of Methods
Stencil Routed A-Buffer
Based on paper by: Rahul Khardekar, Sara McMains Mechanical Engineering University of California, Berkeley ASME 2006 International Design Engineering Technical.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Single Pass Point Rendering and Transparent Shading Paper by Yanci Zhang and Renato Pajarola Presentation by Harmen de Weerd and Hedde Bosman.
Adaptive Volumetric Shadow Maps Marco Salvi, Kiril Vidimce, Andrew Lauritzen, Aaron Lefohn Intel Corporation 7/28/20101 Advances in Real-Time Rendering.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Ray Tracing using Programmable Graphics Hardware
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1cs426-winter-2008 Notes. 2 Atop operation  Image 1 “atop” image 2  Assume independence of sub-pixel structure So for each final pixel, a fraction alpha.
Siggraph 2009 RenderAnts: Interactive REYES Rendering on GPUs Kun Zhou Qiming Hou Zhong Ren Minmin Gong Xin Sun Baining Guo JAEHYUN CHO.
GPU Architecture and Its Application
Real-Time Soft Shadows with Adaptive Light Source Sampling
Visualization Shading
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
The Graphics Rendering Pipeline
Graphics Processing Unit
Models and Architectures
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
OpenGL-Rendering Pipeline
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

Parallelism in Interactive Graphics Well-expressed in hardware as well as APIs Well-expressed in hardware as well as APIs Consistently growing in degree & expression Consistently growing in degree & expression – More and more cores on upcoming GPUs – From programmable shaders to pipelines We should rethink algorithms to exploit this We should rethink algorithms to exploit this This paper provides one example This paper provides one example – Parallelization of composite/filter stages

A Feed-Forward Rendering Pipeline Geometry Processing RasterizationRasterization CompositeComposite FilterFilter Primitives Pixels

Composite & Filter Input: Input: – Unordered list of fragments Output Output – Pixel colors Assumption Assumption – No fragments are discarded Pixel Sample Locations

Basic Idea Insufficientparallelism Irregularity

Pixel-Parallel Processors

Basic Idea Insufficientparallelism Irregularity Fragment-Parallel Processors

Motivation Most applications have low depth complexity Most applications have low depth complexity – Pixel-level parallelism is sufficient We are interested in applications with We are interested in applications with – Very high depth complexity – High variation in depth complexity Further Further – Future platforms will demand more parallelism – High depth-complexity can limit pixel-parallelism

Motivation

Related Work Order-Independent Transparency (OIT) Depth-Peeling [Everitt 01] Depth-Peeling [Everitt 01] – One pass per transparent layer Stencil-Routed A-buffer [Myers & Bavoil 07] Stencil-Routed A-buffer [Myers & Bavoil 07] – One pass per 8 depth layers 1 Bucket Depth-Peeling [Liu et al. 09] Bucket Depth-Peeling [Liu et al. 09] – One pass per up to 32 layers 2 1 Maximum MSAA samples per pixel 2 Maximum render targets

Related Work Order-Independent Transparency (OIT) OIT using Direct3D 11 [Gruen et al. 10] OIT using Direct3D 11 [Gruen et al. 10] – Use fragment linked-lists – Per-pixel sort and composite Hair Self-Shadowing [Sintorn et al. 09] Hair Self-Shadowing [Sintorn et al. 09] – Each fragment computes its contribution – Assumes constant opacity

Related Work Programmable Rendering Pipelines RenderAnts [Zhou et al. 09] RenderAnts [Zhou et al. 09] – Sort fragments globally – Per-pixel composite/filter FreePipe [Liu et al. 10] FreePipe [Liu et al. 10] – Sort fragments globally – Per-pixel composite/filter

Pixel-Parallel Formulation PiPi P (i+1) P (i+2) SjSj S (j+1) S (j+2) S (j+3) S (j+4) S (j+5) S (j+6) j(j+1)(j+2)(j+3)(j+4)(j+5)(j+6) Thread IDs P: Pixel S: Subsample

Pixel-Parallel Formulation Workload size Workload size – Depends on number of fragments – Limits the size of rendering Degree of parallelism Degree of parallelism – Depends on number of pixels/subpixels These two may not always correspond These two may not always correspond

Fragment-Parallel Formulation P: Pixel S: Subsample PiPi P (i+1) P (i+2) SjSj S (j+1) S (j+2) S (j+3) S (j+4) S (j+5) S (j+6) P: Pixel S: Subsample Thread IDs j j+1 j+2 j+3 j+4 j+5 j+6 j+7 j+8 j+9 j+10 j+11 j+12 j+13 j+14 j+15 j+16 j+17 j+18 j+19 j+20 j+21 j+22 j+23

Fragment-Parallel Formulation How can this behavior be achieved? How can this behavior be achieved? Revisit the composite equation Revisit the composite equation C s = α 1 C 1 + (1-α 1 ){α 2 C 2 +(1-α 2 )(…(α N +(1-α N )C B )…} fragment 1 fragment 2 … background C s = 1.α 1.C 1 + (1-α 1 ).α 2.C 2 + (1-α 1 )(1-α 2 ).α 3.C 3 + … + (1-α 1 )(1-α 2 )…(1-α k-1 ).α i.C k + … + (1-α 1 )(1-α 2 )…(1-α k-1 ).α i.C k + … + (1-α 1 )(1-α 2 )…(1-α N ).C B + (1-α 1 )(1-α 2 )…(1-α N ).C B Local Contribution L k Global Contribution G k

Fragment-Parallel Formulation L k is trivially parallel (local computation) L k is trivially parallel (local computation) G k is the result of a scan operation (product) G k is the result of a scan operation (product) For the list of input fragments For the list of input fragments – Compute G[ ] and L[ ], multiply – Perform reduction to add subpixel contributions C s = G 1.L 1 + G 2.L 2 + G 3.L 3 … G N.L N G k = (1-α 1 ).(1-α 2 )…(1-α k-1 ) L k = α k.C k

Fragment-Parallel Formulation Filter, for every pixel: Filter, for every pixel: This can be expressed as another reduction This can be expressed as another reduction – After multiplying with subpixel weights κ m – Can be merged with previous reduction C p = C s1.κ 1 + C s2.κ 2 + … + C sM.κ M

Fragment-Parallel Composite & Filter Final Algorithm 1.Two-key sort (Subpixel ID, depth) 2.Segmented Scan (obtain G k ) 3.Premultiply with weights ( L k, κ m ) 4.Segmented Reduction

Fragment-Parallel Formulation P: Pixel S: Subsample PiPi P (i+1) P (i+2) P: Pixel S: Subsample Segmented Scan (product) Segmented Reduction (sum)

Implementation Hardware used: NVIDIA GeForce GTX 280 Hardware used: NVIDIA GeForce GTX 280 We require fast Segmented Scan and Reduce We require fast Segmented Scan and Reduce – CUDPP library provides that – Restricts implementation to NVIDIA CUDA No direct access to hardware rasterizer No direct access to hardware rasterizer – We wrote our own

Example System – Polygons Applications Applications – Games Depth Complexity Depth Complexity – 1 to few tens of layers – Suited to pixel-parallel Fragment-parallel software rasterizer Fragment-parallel software rasterizer

Example System – Particles Applications Applications – Simulations, games Depth Complexity Depth Complexity – Hundreds of layers – High depth-variance Particle-parallel sprite rasterizer Particle-parallel sprite rasterizer

Example System – Volumes Applications Applications – Scientific Visualization Depth Complexity Depth Complexity – Tens to Hundreds of layers – Low depth-variance Major-axis-slice rasterizer Major-axis-slice rasterizer

Example System – Reyes Applications Applications – Offline rendering Depth Complexity Depth Complexity – Tens of layers – Moderate depth variance Data-parallel micropolygon rasterizer Data-parallel micropolygon rasterizer

Performance Results

Performance Variation

Limitations Increased memory traffic Increased memory traffic – Several passes through CUDPP primitives Unclear how to optimize for special cases Unclear how to optimize for special cases – Threshold opacity – Threshold depth complexity

Summary and Conclusion Parallel formulation of composite equation Parallel formulation of composite equation – Maps well to known primitives – Can be integrated with filter – Consistent performance across varying workloads FPC is applicable to future rendering pipelines FPC is applicable to future rendering pipelines – Exploits higher degree of parallelism – Better related to size of rendering workload A tool for building programmable pipelines A tool for building programmable pipelines

Future Work Performance Performance – Reduction in memory traffic – Extension to special-case scenes – Hybrid PPC-FPC formulations Applications Applications – Integration with hardware rasterizer – Cinematic rendering, Photoshop

Acknowledgments NSF Award NSF Award SciDAC Insitute for Ultrascale Visualization SciDAC Insitute for Ultrascale Visualization NVIDIA Research Fellowship NVIDIA Research Fellowship Equipment donated by NVIDIA Equipment donated by NVIDIA Discussions and Feedback Discussions and Feedback – Shubho Sengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD) – Anonymous reviewers Implementation assistance Implementation assistance – Jeff Stuart, Shubho Sengupta

Thanks!