Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Slides:



Advertisements
Similar presentations
Hybrid BDD and All-SAT Method for Model Checking Orna Grumberg Joint work with Assaf Schuster and Avi Yadgar Technion – Israel Institute of Technology.
Advertisements

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Sven Woop Computer Graphics Lab Saarland University
Static Single-Assignment ? ? Introduction: Over last few years [1991] SSA has been Stablished as… Intermediate program representation.
Representing Boolean Functions for Symbolic Model Checking Supratik Chakraborty IIT Bombay.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
GH05 KD-Tree Acceleration Structures for a GPU Raytracer Tim Foley, Jeremy Sugerman Stanford University.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Shadow Silhouette Maps Pradeep Sen Mike Cammarano Pat Hanrahan Stanford University Speaker: Alvin Date: 8/24/2003 SIGGRAPH 2003.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
CS 412/413 Spring 2007Introduction to Compilers1 Lecture 29: Control Flow Analysis 9 Apr 07 CS412/413 Introduction to Compilers Tim Teitelbaum.
Real-Time High Quality Rendering COMS 6160 [Fall 2004], Lecture 3 Overview of Course Content
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
Improving Code Generation Honors Compilers April 16 th 2002.
Introduction to Optimization Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Enhancing GPU for Scientific Computing Some thoughts.
GPU Computation Strategies & Tricks Ian Buck Stanford University.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
May 2004 Department of Electrical and Computer Engineering 1 ANEW GRAPH STRUCTURE FOR HARDWARE- SOFTWARE PARTITIONING OF HETEROGENEOUS SYSTEMS A NEW GRAPH.
1 2-Hardware Design Basics of Embedded Processors (cont.)
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Shadow Mapping Chun-Fa Chang National Taiwan Normal University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
M. Jędrzejewski, K.Marasek, Warsaw ICCVG, Multimedia Chair Computation of room acoustics using programable video hardware Marcin Jędrzejewski.
CS4432: Database Systems II Query Processing- Part 2.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
David Luebke 1 1/25/2016 Programmable Graphics Hardware.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
Query Processing CS 405G Introduction to Database Systems.
Ray Tracing using Programmable Graphics Hardware
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Hybrid BDD and All-SAT Method for Model Checking
EECE571R -- Harnessing Massively Parallel Processors ece
Evaluation of Relational Operations: Other Operations
Objective of This Course
Sorting and Searching Tim Purcell NVIDIA.
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University

Motivation GPU Programming Interactive shading Offline rendering Computation physical simulations numerical methods BrookGPU [Buck et al. 2004] Shouldn’t be constrained by hardware limits but demand high runtime performance

Motivation – Multipass Partitioning Divide GPU program (shader) into a partition set of rendering passes each pass satisfies all resource constraints save/restore intermediate values in textures Many possible partitions exist The problem: given a program, find the best partition

Related Work SGI’s ISL [Peercy et al. 2000] treat OpenGL machine as SIMD processor Recursive Dominator Split (RDS) [Chan et al. 2002] graph partitioning of shader dag Data-Dependent Multipass Control Flow on GPU [Popa and McCool 2004] partition around flow control and schedule passes Mio [Riffel et al. 2004] instruction scheduling with backtracking

Contribution Merging Recursive Dominator Split (MRDS) MRDS – Extends RDS support shaders with multiple outputs support hardware with multiple render targets generate more optimal partitions same running time as RDS

Outline Motivation Related Work RDS Algorithm MRDS Algorithm Results Future Work

RDS - Overview Input: dag of n nodes shader ops inputs interpolants constants textures Goal: mark subset of nodes as splits split nodes define pass boundaries 2 n possible subsets

RDS - Overview Input: dag of n nodes shader ops inputs interpolants constants textures Goal: mark subset of nodes as splits split nodes define pass boundaries 2 n possible subsets

RDS - Overview Input: dag of n nodes shader ops inputs interpolants constants textures Goal: mark subset of nodes as splits split nodes define pass boundaries 2 n possible subsets

RDS - Overview Combination of approaches to limit search space Save/recompute decisions primary performance tradeoff Dominator tree used to avoid save/recompute tradeoffs

RDS – Save / Recompute M – multiply refereced node

RDS – Save / Recompute M – multiply refereced node

RDS – Save / Recompute M – multiply refereced node

RDS – Save / Recompute M – multiply refereced node

Dominator B dom G all paths to B go through G

Dominator Tree

Key Insight if B, G in same pass and B dom G then no save/recompute costs for G

MRDS – Multiple-Output Shaders

MRDS – Multiple-Output Hardware float4 x, y;... for( i=0; i<N; i++ ) { x ' = x*x - y*y; y ' = 2*x*y; x = x ' ; y = y ' ; }...

MRDS – Multiple-Output Hardware float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

MRDS – Multiple-Output Hardware float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

MRDS – Multiple-Output Hardware State cannot fit in single output float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

MRDS – Multiple-Output Hardware State cannot fit in single output float4 x, y;... for( i=0; i<N; i++ ) { x ' = f( x, y ); y ' = g( x, y ); x = x ' ; y = y ' ; }...

MRDS – Dominating Sets Dominating Set S = {A,D} S dom G All paths to G go through element of S S, G in same pass avoid save/recompute for G

MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

MRDS – Pass Merging Generate initial passes with RDS Find potential merges check if valid evaluate change in cost Execute from best to worst revalidate Stop when no more beneficial merges

MRDS – Pass Merging What if RDS chose to recompute G? Merge between passes A and D eliminates duplicate instructions gets high score

MRDS – Pass Merging What if RDS chose to recompute G? Merge between passes A and D eliminates duplicate instructions gets high score

MRDS – Time Complexity Cost of merging dominated by initial search iterates over s 2 pairs of splits each pair requires size-s set operations and 1 compiler call O(s 2 (s+n)) s = O(n) in worst case MRDS = O(n 3 ) in worst case in practice we expect s << n Assumes compiler calls are linear not true for fxc

MRDS ' RDS uses linear search for save/recompute evaluates cost of both alternatives with RDS h RDS = O(n * RDS h ) = O(n 3 ) MRDS merges after RDS has made these decisions MRDS = O(RDS + n 3 ) = O(n 3 ) MRDS ' merges during cost evaluation adds linear factor in worst case MRDS ' = O(n * (RDS h + n 3 )) = O(n 4 )

Results 3 Brook Programs Procedural Fire Mandelbrot Fractal Matrix Mulitply Compiled for ATI Radeon 9800 XT with RDS MRDS MRDS '

Results – Procedural Fire MRDS' better than MRDS and RDS better save/recompute decisions results in less bandwidth used

Results – Compile Times

Results – Mandelbrot Fractal MRDS', MRDS better than RDS iterative computation – state in 2 variables RDS duplicates computation

Results – Matrix Multiply Matrix-matrix multiply benefits from blocking blocking cuts computation by ~2 Blocking requires multiple outputs performance limited by MRT performance

Summary Modified RDS algorithm, MRDS supports multiple-output shaders generates code for multiple-render-targets easy to implement, same running time generates better-performing partitions

Future Work Implementations Ashli combine with Mio Exploit new hardware data-dependent flow control large numbers of outputs

Acknowledgements Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot RDS implementation, design discussions Kayvon Fatahalian, Ian Buck GPUBench results ATI hardware DARPA, ATI, IBM, NVIDIA, SONY funding