Download presentation
Presentation is loading. Please wait.
Published byChristina Caldwell Modified over 9 years ago
1
Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture Group Computer Systems Laboratory Stanford University
2
Today’s Best Hardware Commercial hardware: Fast Cheap Ubiquitous Flexibility limited OpenGL scenes: Programmable streams deliver comparable performance. Frame from Quake 3 Arena, © id Software
3
Today’s Best Software Today’s software solutions: Powerful and flexible Slow OpenGL scenes: Streams deliver 20x performance. Frame from A Bug’s Life, © Pixar Animation Studios, 1998
4
The Vision + Performance of a special-purpose processor Programmability of a general-purpose processor “Real-Time Renderman”
5
Outline What is stream processing? The Imagine architecture Polygon rendering on a stream architecture Results Conclusions
6
Kernels and Streams A stream is a set of elements of an arbitrary datatype. A computational kernel operates on streams. Kernel Streams Transform
7
Stream Processing All data is streams! 2 levels of programming: Stream-level code Kernel-level code Transform Shader Z Buffer Zcompare Color Buffer z, color z z, color offset
8
Media Apps and Streams Producer-consumer locality High arithmetic requirements Homogeneous computation Efficient control Data parallelism … poor match for microprocessors Transform Shader Z Buffer Zcompare Color Buffer z, color z z, color offset
9
The Imagine Architecture
10
Bandwidth Hierarchy 4GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s ALU Cluster SIMD/VLIW Control Peak BW:
11
Cluster Organization
12
Imagine Stats & Status 0.59 cm 2 CMOS chip 500 MHz Circuits/Logic: expected completion 9/15/00 Tapeout: expected Q4/2000 Fab: TI GS30KA process (0.15 m drawn)
13
Polygon Rendering Outline Overview of OpenGL pipeline How we map OpenGL into streams & kernels How stream operations are sequenced How kernels are mapped onto Imagine Use of stream recirculation Detail of 3 steps in the pipeline: Matrix transformation Scan conversion Enforcing ordering in composite stage
14
OpenGL Pipeline Overview Application Geometry Rasterization Image Composite OpenGL: Has state Requires immediate mode Respects ordering
15
Pipeline Detail Transform GLShader Primitive Assembly Cull Project Geometry Spanprep Spangen Spanrast Texture Lookup Rasterization Hash Z Lookup Zcompare Compact Color, Z Write Composite Image Input Data Sort / Merge
16
Pipeline Stream Datatypes Transform GLShader Primitive Assembly Cull Project Geometry Spanprep Spangen Spanrast Texture Lookup Rasterization Hash Z Lookup Zcompare Compact Color, Z Write Composite Image Sort / Merge vertices trianglesspansfragments offsets depths Most data is floating point.
17
Stream Recirculation Transform Memory SRFClusters Shader Z Buffer Zcompare Color Buffer z, color z z, color offset Strip-mining Memory accesses: Initial load of vertices Lookup of color/z/texture Writeback of color/z All other data accesses are local to the SRF
18
Stream and Kernel Flow xform project assemble rasterize zcompare Z load Z store Color store Texture load Vertex load for next batch xform CLUSTERS MEM STR 0MEM STR 1 Excerpt from ADVS-1 run
19
Mapping Xform to Imagine RAM SRF Cluster Transform Memory SRFClusters
20
SRF Cluster Mapping Spanrast to Imagine Spanrast Memory SRFClusters
21
Enforcing ordering General sort possible But too expensive Hash much cheaper! Hash function: 12 bits Low 6 bits of x, low 6 bits of y Hash table: 2 12 entries 2 bits/entry 16 words/scratchpad/ cluster Compact: Enforces ordering constraint Compact Sort Hash Merge Zcompare
22
Image Composition RAM SRF Cluster Memory SRFClusters Z Buffer Zcompare Offset, z, color z z, color offset Color Buffer
23
Benchmarks ADVS-1: 62k vertices as point-sampled polygons (SPECviewperf 6.1.1 Advanced Visualizer) ADVS-8: mipmapped version of ADVS-1 Sphere: 82k lit, Gouraud- shaded triangles; 3 positional lights Fill: 20k mipmapped 25- pixel triangles ADVS Sphere
24
Experimental setup Comparison systems: Microsoft opengl32.dll (sustained) NVIDIA Quadro (sustained) NVIDIA Quadro (peak) Test system: 450 MHz PIII Xeon, NT 4.0 For comparison: Low overhead trace player (no appn. overhead) Average over 100s of frames (no startup costs) Disabled vsync
25
Results Summary
26
Stream-level Performance Computation, not memory, bound Highest memory system occupancy: 58.7% Cluster occupancy: 94.3% - 98.8% Reuse 5.6 GOPS on Sphere CLUSTERS MEM STR 0MEM STR 1
27
Imagine Kernel Breakdown Majority of time is in rasterization ADVS-8 has 2.5x ops/frame than ADVS-1 ADVS-8
28
Future Directions Extend generality of OpenGL pipeline Add more complex scenes Programmable shading and lighting Straightforward to add per-vertex/per-fragment ops Eliminate multipass Goal: “Toolbox” of flexible elements Non-polygon rendering: raytracing, IBR, … Scalability: multi-Imagine implementations
29
Conclusions Streams: Powerful primitive Stream architectures: Enable high performance Flexibility of general-purpose processor 20x better frame rates than commercial software Performance of special-purpose processor Comparable frame rates to commercial hardware
30
Acknowledgements DARPA Industrial sponsors Texas Instruments Intel Corporation Matthew Eldridge and Kekoa Proudfoot Brian Towles and Brucek Khailany Anonymous reviewers for helpful comments The US Passport Office same-day turnaround!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.