Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Similar presentations


Presentation on theme: "Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University."— Presentation transcript:

1 Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University

2 Motivation GPU Programming Interactive shading Offline rendering Computation physical simulations numerical methods BrookGPU [Buck et al. 2004] Shouldn’t be constrained by hardware limits but demand high runtime performance

3 Motivation – Multipass Partitioning Divide GPU program (shader) into a partition set of rendering passes each pass satisfies all resource constraints save/restore intermediate values in textures Many possible partitions exist The problem: given a program, find the best partition

4 Related Work SGI’s ISL [Peercy et al. 2000] treat OpenGL machine as SIMD processor Recursive Dominator Split (RDS) [Chan et al. 2002] graph partitioning of shader dag Data-Dependent Multipass Control Flow on GPU [Popa and McCool 2004] partition around flow control and schedule passes Mio [Riffel et al. 2004] instruction scheduling with backtracking

5 Contribution Merging Recursive Dominator Split (MRDS) MRDS – Extends RDS support shaders with multiple outputs support hardware with multiple render targets generate more optimal partitions same running time as RDS

6 Outline Motivation Related Work RDS Algorithm MRDS Algorithm Results Future Work

7 RDS - Overview Input: dag of n nodes shader ops inputs interpolants constants textures Goal: mark subset of nodes as splits split nodes define pass boundaries 2 n possible subsets

8 RDS - Overview Input: dag of n nodes shader ops inputs interpolants constants textures Goal: mark subset of nodes as splits split nodes define pass boundaries 2 n possible subsets

9 RDS - Overview Input: dag of n nodes shader ops inputs interpolants constants textures Goal: mark subset of nodes as splits split nodes define pass boundaries 2 n possible subsets

10 RDS - Overview Combination of approaches to limit search space Save/recompute decisions primary performance tradeoff Dominator tree used to avoid save/recompute tradeoffs

11 RDS – Save / Recompute M – multiply refereced node

12 RDS – Save / Recompute M – multiply refereced node

13 RDS – Save / Recompute M – multiply refereced node

14 RDS – Save / Recompute M – multiply refereced node

15 Dominator B dom G all paths to B go through G

16 Dominator Tree

17 Key Insight if B, G in same pass and B dom G then no save/recompute costs for G

18 MRDS – Multiple-Output Shaders

19

20 MRDS – Multiple-Output Hardware float4 x, y;... for( i=0; i<N; i++ ) { x ' = x*x - y*y; y ' = 2*x*y; x = x ' ; y = y ' ; }...

21 MRDS – Multiple-Output Hardware float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

22 MRDS – Multiple-Output Hardware float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

23 MRDS – Multiple-Output Hardware State cannot fit in single output float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

24 MRDS – Multiple-Output Hardware State cannot fit in single output float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

25 MRDS – Dominating Sets Dominating Set S = {A,D} S dom G All paths to G go through element of S S, G in same pass avoid save/recompute for G

26 MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

27 MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

28 MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

29 MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

30 MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

31 MRDS – Pass Merging What if RDS chose to recompute G? Merge between passes A and D eliminates duplicate instructions gets high score

32 MRDS – Pass Merging What if RDS chose to recompute G? Merge between passes A and D eliminates duplicate instructions gets high score

33 MRDS – Time Complexity Cost of merging dominated by initial search iterates over s 2 pairs of splits each pair requires size-s set operations and 1 compiler call O(s 2 (s+n)) s = O(n) in worst case MRDS = O(n 3 ) in worst case in practice we expect s << n Assumes compiler calls are linear not true for fxc

34 MRDS ' RDS uses linear search for save/recompute evaluates cost of both alternatives with RDS h RDS = O(n * RDS h ) = O(n 3 ) MRDS merges after RDS has made these decisions MRDS = O(RDS + n 3 ) = O(n 3 ) MRDS ' merges during cost evaluation adds linear factor in worst case MRDS ' = O(n * (RDS h + n 3 )) = O(n 4 )

35 Results 3 Brook Programs Procedural Fire Mandelbrot Fractal Matrix Mulitply Compiled for ATI Radeon 9800 XT with RDS MRDS MRDS '

36 Results – Procedural Fire MRDS' better than MRDS and RDS better save/recompute decisions results in less bandwidth used

37 Results – Compile Times

38 Results – Mandelbrot Fractal MRDS', MRDS better than RDS iterative computation – state in 2 variables RDS duplicates computation

39 Results – Matrix Multiply Matrix-matrix multiply benefits from blocking blocking cuts computation by ~2 Blocking requires multiple outputs performance limited by MRT performance

40 Summary Modified RDS algorithm, MRDS supports multiple-output shaders generates code for multiple-render-targets easy to implement, same running time generates better-performing partitions

41 Future Work Implementations Ashli combine with Mio Exploit new hardware data-dependent flow control large numbers of outputs

42 Acknowledgements Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot RDS implementation, design discussions Kayvon Fatahalian, Ian Buck GPUBench results ATI hardware DARPA, ATI, IBM, NVIDIA, SONY funding

43


Download ppt "Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University."

Similar presentations


Ads by Google