Download presentation
Presentation is loading. Please wait.
Published byAriel Moody Modified over 9 years ago
1
The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February 23, 2000
2
Scott RixnerThe Imagine Stream Processor2 Outline The Stanford Concurrent VLSI Architecture Group Media processing Technology –Distributed register files –Stream organization The Imagine Stream Processor –20GFLOPS on a 0.7cm 2 chip –Conditional streams –Memory access scheduling –Programming tools
3
Scott RixnerThe Imagine Stream Processor3 The Concurrent VLSI Architecture Group Architecture and design technology for VLSI Routing chips –Torus Routing Chip, Network Design Frame, Reliable Router –Basis for Intel, Cray/SGI, Mercury, Avici network chips
4
Scott RixnerThe Imagine Stream Processor4 Parallel computer systems J-Machine (MDP) led to Cray T3D/T3E M-Machine (MAP) –Fast messaging, scalable processing nodes, scalable memory architecture MDP Chip J-Machine Cray T3D MAP Chip
5
Scott RixnerThe Imagine Stream Processor5 Design technology Off-chip I/O –Simultaneous bidirectional signaling, 1989 now used by Intel and Hitachi –High-speed signalling 4Gb/s in 0.6 m CMOS, Equalization, 1995 On-Chip Signalling –Low-voltage on-chip signalling –Low-skew clock distribution Synchronization –Mesochronous, Plesiochronous –Self-Timed Design 4Gb/s CMOS I/O 250ps/division
6
Scott RixnerThe Imagine Stream Processor6 Forces Acting on Media Processor Architecture Applications - streams of low-precision samples –video, graphics, audio, DSL modems, cellular base stations Technology - becoming wire-limited –power/delay dominated by communication, not arithmetic –global structures: register files, instruction issue don’t scale Microarchitecture - ILP has been mined out –to the point of diminishing returns on squeezing performance from sequential code –explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance
7
Scott RixnerThe Imagine Stream Processor7 640x480 @ 30 fps Requirements –11 GOPS Imagine stream processor –12.1 GOPS, 4.6 GOPS/W Stereo Depth Extraction Left Camera ImageRight Camera Image Depth Map
8
Scott RixnerThe Imagine Stream Processor8 Applications Little locality of reference –read each pixel once –often non-unit stride –but there is producer-consumer locality Very high arithmetic intensity –100s of arithmetic operations per memory reference Dominated by low-precision (16-bit) integer operations
9
Scott RixnerThe Imagine Stream Processor9 Stream Processing SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 operations per memory reference)
10
Scott RixnerThe Imagine Stream Processor10 Locality and Concurrency SAD Image 1 convolve Image 0 convolve Depth Map Operations within a kernel operate on local data Streams expose data parallelism Kernels can be partitioned across chips to exploit control parallelism
11
Scott RixnerThe Imagine Stream Processor11 These Applications Demand High Performance Density GOPs to TOPs of sustained performance Space, weight, and power are limited Historically achieved using special-purpose hardware Building a special processor for each application is –expensive –time consuming –inflexible
12
Scott RixnerThe Imagine Stream Processor12 Performance of Special-Purpose Processors Fed by dedicated wires/memoriesLots (100s) of ALUs
13
Scott RixnerThe Imagine Stream Processor13 Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU
14
Scott RixnerThe Imagine Stream Processor14 Technology scaling makes communication the scarce resource 0.18 m 256Mb DRAM 16 64b FP Proc 500MHz 0.07 m 4Gb DRAM 256 64b FP Proc 2.5GHz P 1999 2008 18mm 30,000 tracks 1 clock repeaters every 3mm 25mm 120,000 tracks 16 clocks repeaters every 0.5mm P
15
Scott RixnerThe Imagine Stream Processor15 What Does This Say About Architecture? Tremendous opportunities –Media problems have lots of parallelism and locality –VLSI technology enables 100s of ALUs/chip (1000s soon) (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder) Challenging problems –Locality - global structures won’t work –Explicit parallelism - ILP won’t keep 100 ALUs busy –Memory - streaming applications don’t cache well Its time to try some new approaches
16
Scott RixnerThe Imagine Stream Processor16 Example: Register File Organization Register files serve two functions: –Short term storage for intermediate results –Communication between multiple function units Global register files don’t scale well as N, number of ALUs increases –Need more registers to hold more results (grows with N) –Need more ports to connect all of the units (grows with N 2 )
17
Scott RixnerThe Imagine Stream Processor17 Register Files Dwarf ALUs
18
Scott RixnerThe Imagine Stream Processor18 Register File Area Each cell requires: –1 word line per port –1 bit line per port Each cell grows as p 2 R registers in the file Area: p 2 R N 3 Register Bit Cell
19
Scott RixnerThe Imagine Stream Processor19 SIMD and Distributed Register Files ScalarSIMD Central DRF
20
Scott RixnerThe Imagine Stream Processor20 Hierarchical Organizations ScalarSIMD Central DRF
21
Scott RixnerThe Imagine Stream Processor21 Stream Organizations ScalarSIMD Central DRF
22
Scott RixnerThe Imagine Stream Processor22 Organizations 48 ALUs (32-bit), 500 MHz Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x
23
Scott RixnerThe Imagine Stream Processor23 Performance 16% Performance Drop (8% with latency constraints) 180x Improvement
24
Scott RixnerThe Imagine Stream Processor24 Scheduling Distributed Register Files Distributed register files means: –Not all functional units can access all data –Each functional unit input/output no longer has a dedicated route from/to all register files
25
Scott RixnerThe Imagine Stream Processor25 Communication Scheduling Schedule operation on instruction Assign operation to functional unit Schedule communication for operation When to schedule route from operation A to operation B? –Can’t schedule route when A is scheduled since B is not assigned to a FU –Can’t wait to schedule route when B is scheduled since other routes may block
26
Scott RixnerThe Imagine Stream Processor26 Stubs: Communication Placeholders Stubs –ensure that other communications will not block write of Op1’s result Copy-connected architecture –ensures that any functional unit that performs Op2 can read Op1’s result
27
Scott RixnerThe Imagine Stream Processor27 Scheduling With Stubs
28
Scott RixnerThe Imagine Stream Processor28 The Imagine Stream Processor
29
Scott RixnerThe Imagine Stream Processor29 Arithmetic Clusters
30
Scott RixnerThe Imagine Stream Processor30 Bandwidth Hierarchy VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s
31
Scott RixnerThe Imagine Stream Processor31 Imagine is a Stream Processor Instructions are Load, Store, and Operate –operands are streams –Send and Receive for multiple-imagine systems Operate performs a compound stream operation –read elements from input streams –perform a local computation –append elements to output streams –repeat until input stream is consumed –(e.g., triangle transform)
32
Scott RixnerThe Imagine Stream Processor32 Bandwidth Demands of Transformation
33
Scott RixnerThe Imagine Stream Processor33 Bandwidth Usage
34
Scott RixnerThe Imagine Stream Processor34 Data Parallelism is easier than ILP
35
Scott RixnerThe Imagine Stream Processor35 Conventional Approaches to Data-Dependent Conditional Execution A x>0 B C J K Data-Dependent Branch YN A x>0 Y B C Whoops J K Speculative Loss D x W ~100s A B J y=(x>0) if y if ~y C K if y if ~y Exponentially Decreasing Duty Factor
36
Scott RixnerThe Imagine Stream Processor36 Zero-Cost Conditionals Most Approaches to Conditional Operations are Costly –Branching control flow - dead issue slots on mispredicted branches –Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle. Conditional Streams –append an element to an output stream depending on a case variable. Value Stream Case Stream {0,1} Output Stream
37
Scott RixnerThe Imagine Stream Processor37 Modern DRAM Synchronous Three-dimensional –Bank –Row –Column Shared address lines Shared data lines Not truly random access Concurrent Access Pipelined Access
38
Scott RixnerThe Imagine Stream Processor38 DRAM Throughput and Latency P: Bank Precharge A: Row Activation C: Column Access Without Access Scheduling (56 DRAM cycles) With Access Scheduling (19 DRAM cycles)
39
Scott RixnerThe Imagine Stream Processor39 Streaming Memory System Stream load/store architecture Memory access scheduling –Out-of-order access –High stream throughput Address Generators IndexData DRAM Bank 0 Bank Buffer DRAM Bank n Bank Buffer Reorder Buffer
40
Scott RixnerThe Imagine Stream Processor40 Streaming Memory Performance
41
Scott RixnerThe Imagine Stream Processor41 Performance 80-90% Improvement 35% Improvement
42
Scott RixnerThe Imagine Stream Processor42 Evaluation Tools Simulators –Imagine RTL model (verilog) –isim: cycle-accurate simulator (C++) –idebug: stream-level simulator (C++) Compiler tools –iscd: kernel scheduler (communication scheduling) –istream: stream scheduler (SRF/memory management) –schedviz: visualizer
43
Scott RixnerThe Imagine Stream Processor43 Stream Programming Stream-level –Host Processor –C++ –istream load (datin1, mem1); // Memory to SRF load (datin2, mem2); op (ADDSUB, datin1, datin2, datout1, datout2); send (datout1, P2, chan0); // SRF to Network send (datout2, P3, chan1); Kernel-level –Clusters –C-like Syntax –iscd loop(input_stream0) { input_stream0 >> a; input_stream1 >> b; c = a + b; d = a - b; output_stream0 << c; output_stream1 << d; }
44
Scott RixnerThe Imagine Stream Processor44 Stereo Depth Extractor Load original packed row Unpack (8bit 16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values ConvolutionsDisparity Search
45
Scott RixnerThe Imagine Stream Processor45 7x7 Convolution Kernel ALUsComm/SPStreams Pipeline Stage 0 Pipeline Stage 1 Pipeline Stage 2
46
Scott RixnerThe Imagine Stream Processor46 Performance 16-bit kernels floating-point kernel 16-bit applications floating-point application
47
Scott RixnerThe Imagine Stream Processor47 Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3
48
Scott RixnerThe Imagine Stream Processor48 Implementation Goals –< 0.7 cm 2 chip in 0.18 m CMOS –Operates at 500 MHz Status –Cycle accurate C++ simulator –Behavioral verilog –Detailed floorplan with all critical signals planned –Schematics and layout for key cells –timing analysis of key units and communication paths
49
Scott RixnerThe Imagine Stream Processor49 The Imagine Stream Processor
50
Scott RixnerThe Imagine Stream Processor50 Imagine Floorplan SRF/SB Ctrl AG Clust/UC Stream Buffers SRF Cluster 2 Cluster 3 Cluster 0 Cluster 1 Cluster 6 Cluster 7 Cluster 4 Cluster 5 MS/HI/NI Stream/Re-Order Buffers CC CC CC CC CC CC CC CC CC C Core HI MB0 MB1 MB2 MB3 BitCoder NI Routing 5.9mm 1mm 5mm 2mm0.6mm
51
Scott RixnerThe Imagine Stream Processor51 Imagine Pipeline MEM SRF F1D R X1 R X2X3 …. Microcontroller X2 W X4 W Stream buffers Clusters F2 SEL
52
Scott RixnerThe Imagine Stream Processor52 Communication paths SRF/SB Ctrl AG StreamBuffers SRF Streambuffers CC HI Mem Sys BitCoder NI Control Routing Clusters MEM F1D RX1 R X2X3 …. X2 W X4 W Stream buffers F2 SEL
53
Scott RixnerThe Imagine Stream Processor53 460 m 146.7 m Adder Layout Local RF Integer Adder
54
Scott RixnerThe Imagine Stream Processor54 Imagine Team Professor William J. Dally –Scott Rixner –Ghazi Ben Amor –Ujval Kapasi –Brucek Khailany –Peter Mattson –Jinyung Namkoong –John Owens –Brian Towles –Don Alpert (Intel) –Chris Buehler (MIT) –JP Grossman (MIT) –Brad Johanson –Dr. Abelardo Lopez-Lagunas –Ben Mowery –Manman Ren
55
Scott RixnerThe Imagine Stream Processor55 Imagine Summary Imagine operates on streams of records –simplifies programming –exposes locality and concurrency Compound stream operations –perform a subroutine on each stream element –reduces global register bandwidth Bandwidth hierarchy –use bandwidth where its inexpensive –distributed and hierarchical register organization Conditional stream operations –sort elements into homogeneous streams –avoid predication or speculation
56
Scott RixnerThe Imagine Stream Processor56 Imagine gives high performance with low power and flexible programming Matches capabilities of communication-limited technology to demands of signal and image processing applications Performance –compound stream operations realize >10GOPS on key applications –can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power –three-level register hierarchy gives 2-10GOPS/W Flexibility –programmed in “C” –streaming model –conditional stream operations enable applications like sort
57
Scott RixnerThe Imagine Stream Processor57 MPEG-2 Encoder
58
Scott RixnerThe Imagine Stream Processor58 Advantages of Implementing MPEG-2 Encode on Imagine Block Matching well suited to Imagine –Clusters provide high arithmetic bandwidth –SRF provides necessary data locality, bandwidth –Good performance Programmable Solution –Simplifies system design –Allows tradeoff between complexity of block matcher vs. bitrate, picture quality Real Time Power Efficient
59
Scott RixnerThe Imagine Stream Processor59 MPEG-2 Challenges VLC (Huffman Coding) –Difficult and inefficient to implement on SIMD clusters –A separate module which provides LUT and shift register capabilities is used instead Rate Control –Difficult because multiple macroblocks encoded in parallel –Must perform on a coarser granularity (impact on picture quality?)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.