Stream Architecture: Rethinking Media Processor Design Scott Rixner April 9, 2001 Rice University Computer Systems Laboratory
Media Processing Video/image compression & decompression MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis Polygon rendering, image-based rendering, ... Image understanding Face recognition, depth extraction, ... Scott Rixner Stream Architecture
Stereo Depth Extraction Left Camera Image Right Camera Image 640x480 @ 30 fps Requirements 11 GOPS Imagine stream processor 12.1 GOPS, 4.6 GOPS/W Depth Map Scott Rixner Stream Architecture
Outline Stream Processing VLSI Constraints Register Organization Imagine Conclusions Scott Rixner Stream Architecture
Media Processing Characteristics Low-precision data 24% 8-bit integer operations 29% 16-bit integer operations Abundant data-parallelism Little global data reuse Average of 1.5 references per global data word Numerous computations per global reference 50-500 operations per global data reference Scott Rixner Stream Architecture
Stream Processing Stream Input Data Kernel Output Data SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 Depth Map Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (>60 operations per memory reference) Scott Rixner Stream Architecture
Locality and Concurrency Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Streams expose data parallelism Scott Rixner Stream Architecture
Sony PlayStation2 Emotion Engine FPU MIPS Core VPU0 VPU1 Graphics Synthesizer Display IPU RDRAM, I/O, DMAC, etc. Scott Rixner Stream Architecture
Special vs. General Purpose Special Purpose Fixed function High performance General Purpose Programmable Insufficient performance Instruction Cache IR IP Registers Scott Rixner Stream Architecture
Register Files Dwarf ALUs Scott Rixner Stream Architecture
Register File Area Each cell requires: Each cell grows as p2 1 word line per port 1 bit line per port Each cell grows as p2 R registers in the file Area: p2R µ N3 Register Bit Cell Scott Rixner Stream Architecture
Register File Access Delay Signal must traverse: Word line to access cell Bit line to transfer data Wire capacitance dominates Delay: pR1/2 µ N3/2 Register File Scott Rixner Stream Architecture
Register File Power Dissipation 100% utilization requires driving all pR1/2 bit lines Wire capacitance dominates Power: p2R µ N3 Register File Scott Rixner Stream Architecture
Centralized Register Organization Area, Power µ N3, Delay µ N3/2 Scott Rixner Stream Architecture
Partitioned Organizations SIMD Data-parallel axis Distributed Register Files (DRF) Instruction-level parallel axis Hierarchical Memory hierarchy axis Stream Optimizing for streams Scott Rixner Stream Architecture
SIMD Register Organization Area, Power µ N3/C2, Delay µ (N/C)3/2 Scott Rixner Stream Architecture
Distributed Register Organization Area, Power µ N2, Delay µ N Scott Rixner Stream Architecture
Combining SIMD and DRF Scalar SIMD Central DRF Scott Rixner Stream Architecture
Hierarchical Register Organization Hierarchical T=40 Area, Power µ N3, Delay µ N3/2 Scott Rixner Stream Architecture
Hierarchical Organizations Scalar SIMD Central DRF Scott Rixner Stream Architecture
Stream Register Organization Area, Power µ N2/C, Delay µ N/C Scott Rixner Stream Architecture
Stream Organizations Scalar SIMD Central DRF Scott Rixner Stream Architecture
Comparison of Organizations 48 ALUs (32-bit), 500 MHz Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x Scott Rixner Stream Architecture
(8% with latency constraints) Performance 16% Performance Drop (8% with latency constraints) 180x Improvement Scott Rixner Stream Architecture
Stream Architecture Stream Processing Stream Register Organization Matched to media processing Exposes locality and concurrency Stream Register Organization Efficiency of special-purpose hardware Optimized for streaming applications Data bandwidth Bandwidth hierarchy Memory access scheduling Conditional streams Scott Rixner Stream Architecture
The Imagine Stream Processor Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller Scott Rixner Stream Architecture
Arithmetic Clusters Communication Unit Scratch-pad Register File Intercluster Network Local Register File + + + * * / CU To SRF Cross Point From SRF Scott Rixner Stream Architecture
Bandwidth Hierarchy SDRAM ALU Cluster ALU Cluster SDRAM Register File Stream SDRAM SDRAM ALU Cluster 2GB/s 32GB/s 544GB/s 41.2 32-bit operations per word of memory bandwidth Scott Rixner Stream Architecture
Stream Recirculation Scott Rixner Stream Architecture
Bandwidth Demands of FIR Filter Scott Rixner Stream Architecture
Bandwidth Utilization of FIR Filter Scott Rixner Stream Architecture
Performance floating-point application 16-bit kernels 16-bit applications 16-bit kernels floating-point kernel Scott Rixner Stream Architecture
Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3 Scott Rixner Stream Architecture
Relative Performance and Power Efficiency FFT Performance Power Efficiency Scott Rixner Stream Architecture
Imagine Floorplan Tapeout ~Q2 ’01 21 million T’s Target: 32 FO4 6M SRF SRAM 6M UC SRAM 6M Clusters 3M Other Target: 32 FO4 300 MHz at SSSS 500 MHz at TTSS TI GS30KA: 0.15 mm Ldrawn 457 Signal Pins Scott Rixner Stream Architecture
Imagine Team William J. Dally Ujval Kapasi Brucek Khailany Peter Mattson Jinyung Namkoong John Owens Ben Serebrin Brian Towles Scott Rixner Don Alpert (Intel) Ghazi Ben Amor Chris Buehler (MIT) JP Grossman (MIT) Brad Johanson Abelardo Lopez-Lagunas Ben Mowery Manman Ren Scott Rixner Stream Architecture
Conclusions Media Processing VLSI Imagine Little data reuse Highly data parallel Compute intensive VLSI Stream register organization Bandwidth hierarchy Imagine Stream architecture 10 GOPS sustained application performance 5 GOPS/W application power efficiency Scott Rixner Stream Architecture