Download presentation
Presentation is loading. Please wait.
Published byEdward McCormick Modified over 6 years ago
1
Stream Architecture: Rethinking Media Processor Design
Scott Rixner April 9, 2001 Rice University Computer Systems Laboratory
2
Media Processing Video/image compression & decompression
MPEG, JPEG, ... Signal Processing DSL modems, cellular base stations, ... Image synthesis Polygon rendering, image-based rendering, ... Image understanding Face recognition, depth extraction, ... Scott Rixner Stream Architecture
3
Stereo Depth Extraction
Left Camera Image Right Camera Image 30 fps Requirements 11 GOPS Imagine stream processor 12.1 GOPS, 4.6 GOPS/W Depth Map Scott Rixner Stream Architecture
4
Outline Stream Processing VLSI Constraints Register Organization
Imagine Conclusions Scott Rixner Stream Architecture
5
Media Processing Characteristics
Low-precision data 24% 8-bit integer operations 29% 16-bit integer operations Abundant data-parallelism Little global data reuse Average of 1.5 references per global data word Numerous computations per global reference operations per global data reference Scott Rixner Stream Architecture
6
Stream Processing Stream Input Data Kernel Output Data
SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 Depth Map Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (>60 operations per memory reference) Scott Rixner Stream Architecture
7
Locality and Concurrency
Operations within a kernel operate on local data Kernels can be partitioned across chips to exploit control parallelism Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Streams expose data parallelism Scott Rixner Stream Architecture
8
Sony PlayStation2 Emotion Engine FPU MIPS Core VPU0 VPU1 Graphics
Synthesizer Display IPU RDRAM, I/O, DMAC, etc. Scott Rixner Stream Architecture
9
Special vs. General Purpose
Special Purpose Fixed function High performance General Purpose Programmable Insufficient performance Instruction Cache IR IP Registers Scott Rixner Stream Architecture
10
Register Files Dwarf ALUs
Scott Rixner Stream Architecture
11
Register File Area Each cell requires: Each cell grows as p2
1 word line per port 1 bit line per port Each cell grows as p2 R registers in the file Area: p2R µ N3 Register Bit Cell Scott Rixner Stream Architecture
12
Register File Access Delay
Signal must traverse: Word line to access cell Bit line to transfer data Wire capacitance dominates Delay: pR1/2 µ N3/2 Register File Scott Rixner Stream Architecture
13
Register File Power Dissipation
100% utilization requires driving all pR1/2 bit lines Wire capacitance dominates Power: p2R µ N3 Register File Scott Rixner Stream Architecture
14
Centralized Register Organization
Area, Power µ N3, Delay µ N3/2 Scott Rixner Stream Architecture
15
Partitioned Organizations
SIMD Data-parallel axis Distributed Register Files (DRF) Instruction-level parallel axis Hierarchical Memory hierarchy axis Stream Optimizing for streams Scott Rixner Stream Architecture
16
SIMD Register Organization
Area, Power µ N3/C2, Delay µ (N/C)3/2 Scott Rixner Stream Architecture
17
Distributed Register Organization
Area, Power µ N2, Delay µ N Scott Rixner Stream Architecture
18
Combining SIMD and DRF Scalar SIMD Central DRF Scott Rixner
Stream Architecture
19
Hierarchical Register Organization
Hierarchical T=40 Area, Power µ N3, Delay µ N3/2 Scott Rixner Stream Architecture
20
Hierarchical Organizations
Scalar SIMD Central DRF Scott Rixner Stream Architecture
21
Stream Register Organization
Area, Power µ N2/C, Delay µ N/C Scott Rixner Stream Architecture
22
Stream Organizations Scalar SIMD Central DRF Scott Rixner
Stream Architecture
23
Comparison of Organizations
48 ALUs (32-bit), 500 MHz Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x Scott Rixner Stream Architecture
24
(8% with latency constraints)
Performance 16% Performance Drop (8% with latency constraints) 180x Improvement Scott Rixner Stream Architecture
25
Stream Architecture Stream Processing Stream Register Organization
Matched to media processing Exposes locality and concurrency Stream Register Organization Efficiency of special-purpose hardware Optimized for streaming applications Data bandwidth Bandwidth hierarchy Memory access scheduling Conditional streams Scott Rixner Stream Architecture
26
The Imagine Stream Processor
Stream Register File Network Interface Stream Controller Imagine Stream Processor Host Processor ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller Scott Rixner Stream Architecture
27
Arithmetic Clusters Communication Unit Scratch-pad Register File
Intercluster Network Local Register File + + + * * / CU To SRF Cross Point From SRF Scott Rixner Stream Architecture
28
Bandwidth Hierarchy SDRAM ALU Cluster ALU Cluster SDRAM Register File Stream SDRAM SDRAM ALU Cluster 2GB/s 32GB/s 544GB/s bit operations per word of memory bandwidth Scott Rixner Stream Architecture
29
Stream Recirculation Scott Rixner Stream Architecture
30
Bandwidth Demands of FIR Filter
Scott Rixner Stream Architecture
31
Bandwidth Utilization of FIR Filter
Scott Rixner Stream Architecture
32
Performance floating-point application 16-bit kernels 16-bit
applications 16-bit kernels floating-point kernel Scott Rixner Stream Architecture
33
Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3 Scott Rixner
Stream Architecture
34
Relative Performance and Power Efficiency
FFT Performance Power Efficiency Scott Rixner Stream Architecture
35
Imagine Floorplan Tapeout ~Q2 ’01 21 million T’s Target: 32 FO4
6M SRF SRAM 6M UC SRAM 6M Clusters 3M Other Target: 32 FO4 300 MHz at SSSS 500 MHz at TTSS TI GS30KA: 0.15 mm Ldrawn 457 Signal Pins Scott Rixner Stream Architecture
36
Imagine Team William J. Dally Ujval Kapasi Brucek Khailany
Peter Mattson Jinyung Namkoong John Owens Ben Serebrin Brian Towles Scott Rixner Don Alpert (Intel) Ghazi Ben Amor Chris Buehler (MIT) JP Grossman (MIT) Brad Johanson Abelardo Lopez-Lagunas Ben Mowery Manman Ren Scott Rixner Stream Architecture
37
Conclusions Media Processing VLSI Imagine Little data reuse
Highly data parallel Compute intensive VLSI Stream register organization Bandwidth hierarchy Imagine Stream architecture 10 GOPS sustained application performance 5 GOPS/W application power efficiency Scott Rixner Stream Architecture
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.