Download presentation
Presentation is loading. Please wait.
1
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture Group Computer Systems Laboratory Stanford University Brucek Khailany http://cva.stanford.edu/imagine
2
2Hot Chips 2000Imagine Imagine: A Programmable Signal and Image Processor Motivation –Applications poorly matched to conventional architectures Key stream architecture features –High computational bandwidth (Imagine: 48 on-chip ALUs) –Stream register organization –Data bandwidth hierarchy Performance density of a special purpose processor –0.59 cm 2 CMOS chip, 0.13 m standard cell, 500 MHz –20 GFLOPS peak performance (40 GOPS fixed point) –10 GFLOPS sustained on several apps –> 2 GFLOPS/W, > 5 GOPS/W
3
3Hot Chips 2000Imagine Stereo Depth Extraction Polygon Rendering MPEG Encoding/Decoding Encoded 2D Data 2D Video Stream Encode/Decode Representative Applications Render 101100 010110 001001
4
4Hot Chips 2000Imagine Stream Processing Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 arithmetic operations per memory reference) SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map
5
5Hot Chips 2000Imagine Characteristics of Media Applications Poorly matched to conventional architectures –Caches –Instruction-Level Parallelism –Few arithmetic units Well-matched to modern VLSI technology –Lots (100’s - 1000’s) of ALUs fit on a single chip –Communication bandwidth is the scarce resource
6
6Hot Chips 2000Imagine Special-Purpose Processors: ALUs fed by dedicated wires/memories Regs Instr. Cache IR IP General-Purpose Processors: ‘Feeding’ Structure Dwarfs ALUs Communication Bandwidth: Care and Feeding of ALUs
7
7Hot Chips 2000Imagine Stream Architecture Provides Data Bandwidth Hierarchy 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s ALU Cluster SIMD/VLIW Control Peak BW:
8
8Hot Chips 2000Imagine Application Data Bandwidth Usage 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s
9
9Hot Chips 2000Imagine Stream Register File: Details SRF: Single-ported 128KB SRAM (1024 x 32W) Stream buffers 32W/cycle Arbiter To/From Arithmetic Clusters
10
10Hot Chips 2000Imagine CU Intercluster Network + From SRF To SRF + * / Cross Point Local Register File Arithmetic Cluster: Details Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions –4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC –17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) + *
11
11Hot Chips 2000Imagine Imagine Programming Environment StereoDepthExtraction(…) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Convolve7x7(…) {... while(!In.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... }
12
12Hot Chips 2000Imagine Performance 16-bit kernels 16-bit applications floating-point application floating-point kernel
13
13Hot Chips 2000Imagine Stereo Depth Extraction –320x240 8-bit grayscale –200 frames/second Polygon Rendering –4.5 Million Vertices/sec –5.1 Million Pixels/sec MPEG Encoding –720x486 24-bit color –120 frames/second Encoded 2D Data 2D Video Stream Sustained Application Performance Render 101100 010110 001001 Encode SPECviewperf ADVS benchmark (unlit)
14
14Hot Chips 2000Imagine GOPS/W: 4.6 10.7 4.1 10.2 9.6 2.4 6.9 Power Estimates
15
15Hot Chips 2000Imagine The Imagine Stream Processor Stream Register File: 32kW SRAM Network Interface Stream Controller Imagine Stream Processor Host Processor Network ALU Cluster 0ALU Cluster 1ALU Cluster 2ALU Cluster 3ALU Cluster 4ALU Cluster 5ALU Cluster 6 ALU Cluster 7 SDRAM Streaming Memory System Microcontroller : 2K VLIW Instrs
16
16Hot Chips 2000Imagine Imagine Floorplan 22 million transistors 500 MHz TI GS30KA: –0.15 m L drawn –0. m L eff –CMOS process
17
17Hot Chips 2000Imagine VLSI Implementation: 22M Transistors with 7 grad students Stream architecture reduces VLSI design complexity –Modularity / Replication –Long wire delays converted to explicit communications Exposed to microarchitecture, software Design methodology –Standard ASIC flow with forced placement of datapaths Bitslice Verilog Improved area, delay –Pre-placement wire length estimates Reduce design iterations
18
18Hot Chips 2000Imagine Status Imagine team accomplishments –Cycle-accurate simulator –Software tools –Completed synthesizable Verilog –Arithmetic units implemented in standard cells Industrial partners –Texas Instruments: Fab –Intel Future work –Circuits/Logic: expected completion 9/15/00 –Tapeout: expected Q4/2000
19
19Hot Chips 2000Imagine Summary Key stream architecture features –Stream register organization –Data bandwidth hierarchy Performance density of a special purpose processor –10 GFLOPS sustained on several apps –>2 GFLOPS/W, >5 GOPS/W VLSI Implementation –Validate architectural concepts –Develop experimental prototype
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.