Download presentation
Presentation is loading. Please wait.
1
Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis
2
Parallelism in Interactive Graphics Well-expressed in hardware as well as APIs Well-expressed in hardware as well as APIs Consistently growing in degree & expression Consistently growing in degree & expression – More and more cores on upcoming GPUs – From programmable shaders to pipelines We should rethink algorithms to exploit this We should rethink algorithms to exploit this This paper provides one example This paper provides one example – Parallelization of composite/filter stages
3
A Feed-Forward Rendering Pipeline Geometry Processing RasterizationRasterization CompositeComposite FilterFilter Primitives Pixels
4
Composite & Filter Input: Input: – Unordered list of fragments Output Output – Pixel colors Assumption Assumption – No fragments are discarded Pixel Sample Locations
5
Basic Idea Insufficientparallelism Irregularity
6
Pixel-Parallel Processors
7
Basic Idea Insufficientparallelism Irregularity Fragment-Parallel Processors
8
Motivation Most applications have low depth complexity Most applications have low depth complexity – Pixel-level parallelism is sufficient We are interested in applications with We are interested in applications with – Very high depth complexity – High variation in depth complexity Further Further – Future platforms will demand more parallelism – High depth-complexity can limit pixel-parallelism
9
Motivation
10
Related Work Order-Independent Transparency (OIT) Depth-Peeling [Everitt 01] Depth-Peeling [Everitt 01] – One pass per transparent layer Stencil-Routed A-buffer [Myers & Bavoil 07] Stencil-Routed A-buffer [Myers & Bavoil 07] – One pass per 8 depth layers 1 Bucket Depth-Peeling [Liu et al. 09] Bucket Depth-Peeling [Liu et al. 09] – One pass per up to 32 layers 2 1 Maximum MSAA samples per pixel 2 Maximum render targets
11
Related Work Order-Independent Transparency (OIT) OIT using Direct3D 11 [Gruen et al. 10] OIT using Direct3D 11 [Gruen et al. 10] – Use fragment linked-lists – Per-pixel sort and composite Hair Self-Shadowing [Sintorn et al. 09] Hair Self-Shadowing [Sintorn et al. 09] – Each fragment computes its contribution – Assumes constant opacity
12
Related Work Programmable Rendering Pipelines RenderAnts [Zhou et al. 09] RenderAnts [Zhou et al. 09] – Sort fragments globally – Per-pixel composite/filter FreePipe [Liu et al. 10] FreePipe [Liu et al. 10] – Sort fragments globally – Per-pixel composite/filter
13
Pixel-Parallel Formulation PiPi P (i+1) P (i+2) SjSj S (j+1) S (j+2) S (j+3) S (j+4) S (j+5) S (j+6) j(j+1)(j+2)(j+3)(j+4)(j+5)(j+6) Thread IDs P: Pixel S: Subsample
14
Pixel-Parallel Formulation Workload size Workload size – Depends on number of fragments – Limits the size of rendering Degree of parallelism Degree of parallelism – Depends on number of pixels/subpixels These two may not always correspond These two may not always correspond
15
Fragment-Parallel Formulation P: Pixel S: Subsample PiPi P (i+1) P (i+2) SjSj S (j+1) S (j+2) S (j+3) S (j+4) S (j+5) S (j+6) P: Pixel S: Subsample Thread IDs j j+1 j+2 j+3 j+4 j+5 j+6 j+7 j+8 j+9 j+10 j+11 j+12 j+13 j+14 j+15 j+16 j+17 j+18 j+19 j+20 j+21 j+22 j+23
16
Fragment-Parallel Formulation How can this behavior be achieved? How can this behavior be achieved? Revisit the composite equation Revisit the composite equation C s = α 1 C 1 + (1-α 1 ){α 2 C 2 +(1-α 2 )(…(α N +(1-α N )C B )…} fragment 1 fragment 2 … background C s = 1.α 1.C 1 + (1-α 1 ).α 2.C 2 + (1-α 1 )(1-α 2 ).α 3.C 3 + … + (1-α 1 )(1-α 2 )…(1-α k-1 ).α i.C k + … + (1-α 1 )(1-α 2 )…(1-α k-1 ).α i.C k + … + (1-α 1 )(1-α 2 )…(1-α N ).C B + (1-α 1 )(1-α 2 )…(1-α N ).C B Local Contribution L k Global Contribution G k
17
Fragment-Parallel Formulation L k is trivially parallel (local computation) L k is trivially parallel (local computation) G k is the result of a scan operation (product) G k is the result of a scan operation (product) For the list of input fragments For the list of input fragments – Compute G[ ] and L[ ], multiply – Perform reduction to add subpixel contributions C s = G 1.L 1 + G 2.L 2 + G 3.L 3 … G N.L N G k = (1-α 1 ).(1-α 2 )…(1-α k-1 ) L k = α k.C k
18
Fragment-Parallel Formulation Filter, for every pixel: Filter, for every pixel: This can be expressed as another reduction This can be expressed as another reduction – After multiplying with subpixel weights κ m – Can be merged with previous reduction C p = C s1.κ 1 + C s2.κ 2 + … + C sM.κ M
19
Fragment-Parallel Composite & Filter Final Algorithm 1.Two-key sort (Subpixel ID, depth) 2.Segmented Scan (obtain G k ) 3.Premultiply with weights ( L k, κ m ) 4.Segmented Reduction
20
Fragment-Parallel Formulation P: Pixel S: Subsample PiPi P (i+1) P (i+2) P: Pixel S: Subsample Segmented Scan (product) Segmented Reduction (sum)
21
Implementation Hardware used: NVIDIA GeForce GTX 280 Hardware used: NVIDIA GeForce GTX 280 We require fast Segmented Scan and Reduce We require fast Segmented Scan and Reduce – CUDPP library provides that – Restricts implementation to NVIDIA CUDA No direct access to hardware rasterizer No direct access to hardware rasterizer – We wrote our own
22
Example System – Polygons Applications Applications – Games Depth Complexity Depth Complexity – 1 to few tens of layers – Suited to pixel-parallel Fragment-parallel software rasterizer Fragment-parallel software rasterizer
23
Example System – Particles Applications Applications – Simulations, games Depth Complexity Depth Complexity – Hundreds of layers – High depth-variance Particle-parallel sprite rasterizer Particle-parallel sprite rasterizer
24
Example System – Volumes Applications Applications – Scientific Visualization Depth Complexity Depth Complexity – Tens to Hundreds of layers – Low depth-variance Major-axis-slice rasterizer Major-axis-slice rasterizer
25
Example System – Reyes Applications Applications – Offline rendering Depth Complexity Depth Complexity – Tens of layers – Moderate depth variance Data-parallel micropolygon rasterizer Data-parallel micropolygon rasterizer
26
Performance Results
27
Performance Variation
28
Limitations Increased memory traffic Increased memory traffic – Several passes through CUDPP primitives Unclear how to optimize for special cases Unclear how to optimize for special cases – Threshold opacity – Threshold depth complexity
29
Summary and Conclusion Parallel formulation of composite equation Parallel formulation of composite equation – Maps well to known primitives – Can be integrated with filter – Consistent performance across varying workloads FPC is applicable to future rendering pipelines FPC is applicable to future rendering pipelines – Exploits higher degree of parallelism – Better related to size of rendering workload A tool for building programmable pipelines A tool for building programmable pipelines
30
Future Work Performance Performance – Reduction in memory traffic – Extension to special-case scenes – Hybrid PPC-FPC formulations Applications Applications – Integration with hardware rasterizer – Cinematic rendering, Photoshop
31
Acknowledgments NSF Award 0541448 NSF Award 0541448 SciDAC Insitute for Ultrascale Visualization SciDAC Insitute for Ultrascale Visualization NVIDIA Research Fellowship NVIDIA Research Fellowship Equipment donated by NVIDIA Equipment donated by NVIDIA Discussions and Feedback Discussions and Feedback – Shubho Sengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD) – Anonymous reviewers Implementation assistance Implementation assistance – Jeff Stuart, Shubho Sengupta
32
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.