Download presentation
Presentation is loading. Please wait.
Published byJemimah Jennings Modified over 9 years ago
1
Compiling Metaprogrammed Shaders to Stream GPUs Michael D. McCool Computer Graphics Lab University of Waterloo Graphics Hardware 2003
2
Topics GPUs are “Stream Processors”… But what does that mean, exactly? Can general programs be compiled to GPUs? Can they run efficiently on GPUs? How can GPUs be evolved to support more powerful programming models without negatively impacting performance? What abstractions should programming languages for GPUs support?
3
Imagine Stream Processor SIMD kernel processing on streams containing homogeneous records Memory hierarchy Local registers Stream register file External memory Streaming external memory access Conditional read and write
4
Stream GPU Architecture Vertex Shader Rasterizer Fragment Shader Compositor Display New Optional
5
Stream GPU Architecture Stream input to vertex unit Array inputs to fragment unit At least two stream outputs from fragment unit supporting conditional writes Array output from fragment unit via compositor
6
Sh Metaprogramming Library Embedded metaprogramming Both a library and a high-level programming language Available from SourceForge: http://libsh.sourceforge.net http://libsh.sourceforge.net Currently semantically “Cg-equivalent” Adding control constructs, stream algebra in next phase…
7
Julia Set: Sh Example ShAttrib1f julia_max_iter = 20.0; ShAttrib1f julia_scale = 0.05; ShAttrib2f julia_c(1.0, -0.3); ShTexture2D julia_map(32,32);... ShProgram julia0 = SH_BEGIN_VERTEX_SHADER { ShInputTexCoord2f ui; ShInputPosition3f pm; ShOutputTexCoord2f uo(ui); ShOutputPosition4f pd; pd = (perspective | modelview) | pm; } SH_END_SHADER; ShProgram julia1 = SH_BEGIN_FRAGMENT_SHADER { ShInputTexCoord2f u; ShInputPosition2f pdxy; ShOutputColor3f fc; ShAttrib1f i = 0.0; SH_WHILE(((v|v) < 2.0) * (i < julia_max_iter)) { ShTexCoord2f v; v(0) = u(0)*u(0) - u(1)*u(1); v(1) = 2.0*u(0)*u(1); u = v + julia_c; i++; } SH_ENDWHILE; ShTexCoord2f lookup(0.0,0.0); lookup(0) = julia_scale * i; fc = julia_map(lookup); } SH_END_SHADER;
8
Compiler: Control Flow Graph
9
Control Graph Control flow graph from compiler also describes multipass stream program! Need conditional write to avoid accumulation of “garbage records” Iteration and conditionals may scramble order of records --- but can always sort by ID later if necessary.
10
Julia Set: Control Graph IteratorRender 197.48 Kwords 9450 Kwords 197.48 Kwords 197.48 Kwords Rasterize 56.25 Kwords (800 tris)
11
Adaptive Tessellation: Control Graph Oracle Tess4 Tess3 Tess2 Bump Split Stack arc 4010 Kwords (42771 tris) 5748 Kwords 5661 Kwords 3.46 Kwords 368.6 Kwords 1137 Kwords 86.71 Kwords (800 tris)
12
Scheduler Local arcs are system-allocated stream buffers (ideally stream registers) System picks kernel to run: Has enough input data Space available in available output buffer Picks kernel that maximizes throughput Repeat until no more data in input stream
13
Observations: True conditionals and iteration: Implementable with conditional write to stream output NEED NULL COMPRESSION! Multiple stream outputs also desirable Fragment scatter: Implementable with render-to-vertex-array F-buffer feedback also desirable
14
Simulating Null Compression Want conditional write to stream No space wasted for nullified records Can simulate on current GPUs: Write to array Use occlusion test to count number of non-null records Sort array by mark bit (use depth channel to mark) Discard null records (now at end of array) Expensive, perhaps other ways…
15
HW Stream Null Compression
16
Stream Algebra ShProgram p; (a,b) = p(d,e,f); (a,b) = p << (d,e,f); (a,b) = p << d << e << f; (a,b) = p << q << (d,e,f); (a,b,u,v,w) = (p ** q) << (d,e,f,j,k,l); fb += p << r << q << (c,n,v)[i]; ShStream cq = optimize(q << (c,n,v)[i]); fb += p << r << cq; a += s * t; ShCampaign k =...
17
Targets GPUs (via Cg, OGL Slang, etc.) SIMD Multithreaded MIMD SSE, SSE2 (via Intel compiler) Cluster computers Shared-mem computers PS2, PS3
18
Issues: Null compression can be simulated with sparse texture compression, but slow. H/W support would be useful. On-chip stream registers… Off-chip stream buffer compression… On-GPU scheduler… Compilation of recursive algorithms? Virtualization: registers, stream record size, stream length, textures, array read-write, synchronization, etc. Abstractions: streams, sequences, sets, indexes, arrays, programs, campaigns, shapes, etc.
19
Material Mapping: Control Graph HF Wood HF + WoodRastSplit
20
Control Construct Templates C A S P Predecessor WHILE (C) { A } Successor C B S P Predecessor IF (C) { A } ELSE { B } Successor A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.