The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA Scott Rixner February 23, 2000
Scott RixnerThe Imagine Stream Processor2 Outline The Stanford Concurrent VLSI Architecture Group Media processing Technology –Distributed register files –Stream organization The Imagine Stream Processor –20GFLOPS on a 0.7cm 2 chip –Conditional streams –Memory access scheduling –Programming tools
Scott RixnerThe Imagine Stream Processor3 The Concurrent VLSI Architecture Group Architecture and design technology for VLSI Routing chips –Torus Routing Chip, Network Design Frame, Reliable Router –Basis for Intel, Cray/SGI, Mercury, Avici network chips
Scott RixnerThe Imagine Stream Processor4 Parallel computer systems J-Machine (MDP) led to Cray T3D/T3E M-Machine (MAP) –Fast messaging, scalable processing nodes, scalable memory architecture MDP Chip J-Machine Cray T3D MAP Chip
Scott RixnerThe Imagine Stream Processor5 Design technology Off-chip I/O –Simultaneous bidirectional signaling, 1989 now used by Intel and Hitachi –High-speed signalling 4Gb/s in 0.6 m CMOS, Equalization, 1995 On-Chip Signalling –Low-voltage on-chip signalling –Low-skew clock distribution Synchronization –Mesochronous, Plesiochronous –Self-Timed Design 4Gb/s CMOS I/O 250ps/division
Scott RixnerThe Imagine Stream Processor6 Forces Acting on Media Processor Architecture Applications - streams of low-precision samples –video, graphics, audio, DSL modems, cellular base stations Technology - becoming wire-limited –power/delay dominated by communication, not arithmetic –global structures: register files, instruction issue don’t scale Microarchitecture - ILP has been mined out –to the point of diminishing returns on squeezing performance from sequential code –explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance
Scott RixnerThe Imagine Stream Processor7 30 fps Requirements –11 GOPS Imagine stream processor –12.1 GOPS, 4.6 GOPS/W Stereo Depth Extraction Left Camera ImageRight Camera Image Depth Map
Scott RixnerThe Imagine Stream Processor8 Applications Little locality of reference –read each pixel once –often non-unit stride –but there is producer-consumer locality Very high arithmetic intensity –100s of arithmetic operations per memory reference Dominated by low-precision (16-bit) integer operations
Scott RixnerThe Imagine Stream Processor9 Stream Processing SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)
Scott RixnerThe Imagine Stream Processor10 Locality and Concurrency SAD Image 1 convolve Image 0 convolve Depth Map Operations within a kernel operate on local data Streams expose data parallelism Kernels can be partitioned across chips to exploit control parallelism
Scott RixnerThe Imagine Stream Processor11 These Applications Demand High Performance Density GOPs to TOPs of sustained performance Space, weight, and power are limited Historically achieved using special-purpose hardware Building a special processor for each application is –expensive –time consuming –inflexible
Scott RixnerThe Imagine Stream Processor12 Performance of Special-Purpose Processors Fed by dedicated wires/memoriesLots (100s) of ALUs
Scott RixnerThe Imagine Stream Processor13 Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU
Scott RixnerThe Imagine Stream Processor14 Technology scaling makes communication the scarce resource 0.18 m 256Mb DRAM 16 64b FP Proc 500MHz 0.07 m 4Gb DRAM b FP Proc 2.5GHz P mm 30,000 tracks 1 clock repeaters every 3mm 25mm 120,000 tracks 16 clocks repeaters every 0.5mm P
Scott RixnerThe Imagine Stream Processor15 What Does This Say About Architecture? Tremendous opportunities –Media problems have lots of parallelism and locality –VLSI technology enables 100s of ALUs/chip (1000s soon) (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder) Challenging problems –Locality - global structures won’t work –Explicit parallelism - ILP won’t keep 100 ALUs busy –Memory - streaming applications don’t cache well Its time to try some new approaches
Scott RixnerThe Imagine Stream Processor16 Example: Register File Organization Register files serve two functions: –Short term storage for intermediate results –Communication between multiple function units Global register files don’t scale well as N, number of ALUs increases –Need more registers to hold more results (grows with N) –Need more ports to connect all of the units (grows with N 2 )
Scott RixnerThe Imagine Stream Processor17 Register Files Dwarf ALUs
Scott RixnerThe Imagine Stream Processor18 Register File Area Each cell requires: –1 word line per port –1 bit line per port Each cell grows as p 2 R registers in the file Area: p 2 R N 3 Register Bit Cell
Scott RixnerThe Imagine Stream Processor19 SIMD and Distributed Register Files ScalarSIMD Central DRF
Scott RixnerThe Imagine Stream Processor20 Hierarchical Organizations ScalarSIMD Central DRF
Scott RixnerThe Imagine Stream Processor21 Stream Organizations ScalarSIMD Central DRF
Scott RixnerThe Imagine Stream Processor22 Organizations 48 ALUs (32-bit), 500 MHz Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x
Scott RixnerThe Imagine Stream Processor23 Performance 16% Performance Drop (8% with latency constraints) 180x Improvement
Scott RixnerThe Imagine Stream Processor24 Scheduling Distributed Register Files Distributed register files means: –Not all functional units can access all data –Each functional unit input/output no longer has a dedicated route from/to all register files
Scott RixnerThe Imagine Stream Processor25 Communication Scheduling Schedule operation on instruction Assign operation to functional unit Schedule communication for operation When to schedule route from operation A to operation B? –Can’t schedule route when A is scheduled since B is not assigned to a FU –Can’t wait to schedule route when B is scheduled since other routes may block
Scott RixnerThe Imagine Stream Processor26 Stubs: Communication Placeholders Stubs –ensure that other communications will not block write of Op1’s result Copy-connected architecture –ensures that any functional unit that performs Op2 can read Op1’s result
Scott RixnerThe Imagine Stream Processor27 Scheduling With Stubs
Scott RixnerThe Imagine Stream Processor28 The Imagine Stream Processor
Scott RixnerThe Imagine Stream Processor29 Arithmetic Clusters
Scott RixnerThe Imagine Stream Processor30 Bandwidth Hierarchy VLIW clusters with shared control bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s
Scott RixnerThe Imagine Stream Processor31 Imagine is a Stream Processor Instructions are Load, Store, and Operate –operands are streams –Send and Receive for multiple-imagine systems Operate performs a compound stream operation –read elements from input streams –perform a local computation –append elements to output streams –repeat until input stream is consumed –(e.g., triangle transform)
Scott RixnerThe Imagine Stream Processor32 Bandwidth Demands of Transformation
Scott RixnerThe Imagine Stream Processor33 Bandwidth Usage
Scott RixnerThe Imagine Stream Processor34 Data Parallelism is easier than ILP
Scott RixnerThe Imagine Stream Processor35 Conventional Approaches to Data-Dependent Conditional Execution A x>0 B C J K Data-Dependent Branch YN A x>0 Y B C Whoops J K Speculative Loss D x W ~100s A B J y=(x>0) if y if ~y C K if y if ~y Exponentially Decreasing Duty Factor
Scott RixnerThe Imagine Stream Processor36 Zero-Cost Conditionals Most Approaches to Conditional Operations are Costly –Branching control flow - dead issue slots on mispredicted branches –Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. Conditional Streams –append an element to an output stream depending on a case variable. Value Stream Case Stream {0,1} Output Stream
Scott RixnerThe Imagine Stream Processor37 Modern DRAM Synchronous Three-dimensional –Bank –Row –Column Shared address lines Shared data lines Not truly random access Concurrent Access Pipelined Access
Scott RixnerThe Imagine Stream Processor38 DRAM Throughput and Latency P: Bank Precharge A: Row Activation C: Column Access Without Access Scheduling (56 DRAM cycles) With Access Scheduling (19 DRAM cycles)
Scott RixnerThe Imagine Stream Processor39 Streaming Memory System Stream load/store architecture Memory access scheduling –Out-of-order access –High stream throughput Address Generators IndexData DRAM Bank 0 Bank Buffer DRAM Bank n Bank Buffer Reorder Buffer
Scott RixnerThe Imagine Stream Processor40 Streaming Memory Performance
Scott RixnerThe Imagine Stream Processor41 Performance 80-90% Improvement 35% Improvement
Scott RixnerThe Imagine Stream Processor42 Evaluation Tools Simulators –Imagine RTL model (verilog) –isim: cycle-accurate simulator (C++) –idebug: stream-level simulator (C++) Compiler tools –iscd: kernel scheduler (communication scheduling) –istream: stream scheduler (SRF/memory management) –schedviz: visualizer
Scott RixnerThe Imagine Stream Processor43 Stream Programming Stream-level –Host Processor –C++ –istream load (datin1, mem1); // Memory to SRF load (datin2, mem2); op (ADDSUB, datin1, datin2, datout1, datout2); send (datout1, P2, chan0); // SRF to Network send (datout2, P3, chan1); Kernel-level –Clusters –C-like Syntax –iscd loop(input_stream0) { input_stream0 >> a; input_stream1 >> b; c = a + b; d = a - b; output_stream0 << c; output_stream1 << d; }
Scott RixnerThe Imagine Stream Processor44 Stereo Depth Extractor Load original packed row Unpack (8bit 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values ConvolutionsDisparity Search
Scott RixnerThe Imagine Stream Processor45 7x7 Convolution Kernel ALUsComm/SPStreams Pipeline Stage 0 Pipeline Stage 1 Pipeline Stage 2
Scott RixnerThe Imagine Stream Processor46 Performance 16-bit kernels floating-point kernel 16-bit applications floating-point application
Scott RixnerThe Imagine Stream Processor47 Power GOPS/W:
Scott RixnerThe Imagine Stream Processor48 Implementation Goals –< 0.7 cm 2 chip in 0.18 m CMOS –Operates at 500 MHz Status –Cycle accurate C++ simulator –Behavioral verilog –Detailed floorplan with all critical signals planned –Schematics and layout for key cells –timing analysis of key units and communication paths
Scott RixnerThe Imagine Stream Processor49 The Imagine Stream Processor
Scott RixnerThe Imagine Stream Processor50 Imagine Floorplan SRF/SB Ctrl AG Clust/UC Stream Buffers SRF Cluster 2 Cluster 3 Cluster 0 Cluster 1 Cluster 6 Cluster 7 Cluster 4 Cluster 5 MS/HI/NI Stream/Re-Order Buffers CC CC CC CC CC CC CC CC CC C Core HI MB0 MB1 MB2 MB3 BitCoder NI Routing 5.9mm 1mm 5mm 2mm0.6mm
Scott RixnerThe Imagine Stream Processor51 Imagine Pipeline MEM SRF F1D R X1 R X2X3 …. Microcontroller X2 W X4 W Stream buffers Clusters F2 SEL
Scott RixnerThe Imagine Stream Processor52 Communication paths SRF/SB Ctrl AG StreamBuffers SRF Streambuffers CC HI Mem Sys BitCoder NI Control Routing Clusters MEM F1D RX1 R X2X3 …. X2 W X4 W Stream buffers F2 SEL
Scott RixnerThe Imagine Stream Processor m m Adder Layout Local RF Integer Adder
Scott RixnerThe Imagine Stream Processor54 Imagine Team Professor William J. Dally –Scott Rixner –Ghazi Ben Amor –Ujval Kapasi –Brucek Khailany –Peter Mattson –Jinyung Namkoong –John Owens –Brian Towles –Don Alpert (Intel) –Chris Buehler (MIT) –JP Grossman (MIT) –Brad Johanson –Dr. Abelardo Lopez-Lagunas –Ben Mowery –Manman Ren
Scott RixnerThe Imagine Stream Processor55 Imagine Summary Imagine operates on streams of records –simplifies programming –exposes locality and concurrency Compound stream operations –perform a subroutine on each stream element –reduces global register bandwidth Bandwidth hierarchy –use bandwidth where its inexpensive –distributed and hierarchical register organization Conditional stream operations –sort elements into homogeneous streams –avoid predication or speculation
Scott RixnerThe Imagine Stream Processor56 Imagine gives high performance with low power and flexible programming Matches capabilities of communication-limited technology to demands of signal and image processing applications Performance –compound stream operations realize >10GOPS on key applications –can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power –three-level register hierarchy gives 2-10GOPS/W Flexibility –programmed in “C” –streaming model –conditional stream operations enable applications like sort
Scott RixnerThe Imagine Stream Processor57 MPEG-2 Encoder
Scott RixnerThe Imagine Stream Processor58 Advantages of Implementing MPEG-2 Encode on Imagine Block Matching well suited to Imagine –Clusters provide high arithmetic bandwidth –SRF provides necessary data locality, bandwidth –Good performance Programmable Solution –Simplifies system design –Allows tradeoff between complexity of block matcher vs. bitrate, picture quality Real Time Power Efficient
Scott RixnerThe Imagine Stream Processor59 MPEG-2 Challenges VLC (Huffman Coding) –Difficult and inefficient to implement on SIMD clusters –A separate module which provides LUT and shift register capabilities is used instead Rate Control –Difficult because multiple macroblocks encoded in parallel –Must perform on a coarser granularity (impact on picture quality?)