Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February.

Similar presentations


Presentation on theme: "The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February."— Presentation transcript:

1 The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February 23, 2000

2 Scott RixnerThe Imagine Stream Processor2 Outline  The Stanford Concurrent VLSI Architecture Group  Media processing  Technology –Distributed register files –Stream organization  The Imagine Stream Processor –20GFLOPS on a 0.7cm 2 chip –Conditional streams –Memory access scheduling –Programming tools

3 Scott RixnerThe Imagine Stream Processor3 The Concurrent VLSI Architecture Group  Architecture and design technology for VLSI  Routing chips –Torus Routing Chip, Network Design Frame, Reliable Router –Basis for Intel, Cray/SGI, Mercury, Avici network chips

4 Scott RixnerThe Imagine Stream Processor4 Parallel computer systems  J-Machine (MDP) led to Cray T3D/T3E  M-Machine (MAP) –Fast messaging, scalable processing nodes, scalable memory architecture MDP Chip J-Machine Cray T3D MAP Chip

5 Scott RixnerThe Imagine Stream Processor5 Design technology  Off-chip I/O –Simultaneous bidirectional signaling, 1989 now used by Intel and Hitachi –High-speed signalling 4Gb/s in 0.6  m CMOS, Equalization, 1995  On-Chip Signalling –Low-voltage on-chip signalling –Low-skew clock distribution  Synchronization –Mesochronous, Plesiochronous –Self-Timed Design 4Gb/s CMOS I/O 250ps/division

6 Scott RixnerThe Imagine Stream Processor6 Forces Acting on Media Processor Architecture  Applications - streams of low-precision samples –video, graphics, audio, DSL modems, cellular base stations  Technology - becoming wire-limited –power/delay dominated by communication, not arithmetic –global structures: register files, instruction issue don’t scale  Microarchitecture - ILP has been mined out –to the point of diminishing returns on squeezing performance from sequential code –explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance

7 Scott RixnerThe Imagine Stream Processor7  640x480 @ 30 fps  Requirements –11 GOPS  Imagine stream processor –12.1 GOPS, 4.6 GOPS/W Stereo Depth Extraction Left Camera ImageRight Camera Image Depth Map

8 Scott RixnerThe Imagine Stream Processor8 Applications  Little locality of reference –read each pixel once –often non-unit stride –but there is producer-consumer locality  Very high arithmetic intensity –100s of arithmetic operations per memory reference  Dominated by low-precision (16-bit) integer operations

9 Scott RixnerThe Imagine Stream Processor9 Stream Processing SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map  Little data reuse (pixels never revisited)  Highly data parallel (output pixels not dependent on other output pixels)  Compute intensive (60 operations per memory reference)

10 Scott RixnerThe Imagine Stream Processor10 Locality and Concurrency SAD Image 1 convolve Image 0 convolve Depth Map Operations within a kernel operate on local data Streams expose data parallelism Kernels can be partitioned across chips to exploit control parallelism

11 Scott RixnerThe Imagine Stream Processor11 These Applications Demand High Performance Density  GOPs to TOPs of sustained performance  Space, weight, and power are limited  Historically achieved using special-purpose hardware  Building a special processor for each application is –expensive –time consuming –inflexible

12 Scott RixnerThe Imagine Stream Processor12 Performance of Special-Purpose Processors Fed by dedicated wires/memoriesLots (100s) of ALUs

13 Scott RixnerThe Imagine Stream Processor13 Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

14 Scott RixnerThe Imagine Stream Processor14 Technology scaling makes communication the scarce resource 0.18  m 256Mb DRAM 16 64b FP Proc 500MHz 0.07  m 4Gb DRAM 256 64b FP Proc 2.5GHz P 1999 2008 18mm 30,000 tracks 1 clock repeaters every 3mm 25mm 120,000 tracks 16 clocks repeaters every 0.5mm P

15 Scott RixnerThe Imagine Stream Processor15 What Does This Say About Architecture?  Tremendous opportunities –Media problems have lots of parallelism and locality –VLSI technology enables 100s of ALUs/chip (1000s soon) (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder)  Challenging problems –Locality - global structures won’t work –Explicit parallelism - ILP won’t keep 100 ALUs busy –Memory - streaming applications don’t cache well  Its time to try some new approaches

16 Scott RixnerThe Imagine Stream Processor16 Example: Register File Organization  Register files serve two functions: –Short term storage for intermediate results –Communication between multiple function units  Global register files don’t scale well as N, number of ALUs increases –Need more registers to hold more results (grows with N) –Need more ports to connect all of the units (grows with N 2 )

17 Scott RixnerThe Imagine Stream Processor17 Register Files Dwarf ALUs

18 Scott RixnerThe Imagine Stream Processor18 Register File Area  Each cell requires: –1 word line per port –1 bit line per port  Each cell grows as p 2  R registers in the file  Area: p 2 R  N 3 Register Bit Cell

19 Scott RixnerThe Imagine Stream Processor19 SIMD and Distributed Register Files ScalarSIMD Central DRF

20 Scott RixnerThe Imagine Stream Processor20 Hierarchical Organizations ScalarSIMD Central DRF

21 Scott RixnerThe Imagine Stream Processor21 Stream Organizations ScalarSIMD Central DRF

22 Scott RixnerThe Imagine Stream Processor22 Organizations  48 ALUs (32-bit), 500 MHz  Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x

23 Scott RixnerThe Imagine Stream Processor23 Performance 16% Performance Drop (8% with latency constraints) 180x Improvement

24 Scott RixnerThe Imagine Stream Processor24 Scheduling Distributed Register Files  Distributed register files means: –Not all functional units can access all data –Each functional unit input/output no longer has a dedicated route from/to all register files

25 Scott RixnerThe Imagine Stream Processor25 Communication Scheduling  Schedule operation on instruction  Assign operation to functional unit  Schedule communication for operation  When to schedule route from operation A to operation B? –Can’t schedule route when A is scheduled since B is not assigned to a FU –Can’t wait to schedule route when B is scheduled since other routes may block

26 Scott RixnerThe Imagine Stream Processor26 Stubs: Communication Placeholders  Stubs –ensure that other communications will not block write of Op1’s result  Copy-connected architecture –ensures that any functional unit that performs Op2 can read Op1’s result

27 Scott RixnerThe Imagine Stream Processor27 Scheduling With Stubs

28 Scott RixnerThe Imagine Stream Processor28 The Imagine Stream Processor

29 Scott RixnerThe Imagine Stream Processor29 Arithmetic Clusters

30 Scott RixnerThe Imagine Stream Processor30 Bandwidth Hierarchy  VLIW clusters with shared control  41.2 32-bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

31 Scott RixnerThe Imagine Stream Processor31 Imagine is a Stream Processor  Instructions are Load, Store, and Operate –operands are streams –Send and Receive for multiple-imagine systems  Operate performs a compound stream operation –read elements from input streams –perform a local computation –append elements to output streams –repeat until input stream is consumed –(e.g., triangle transform)

32 Scott RixnerThe Imagine Stream Processor32 Bandwidth Demands of  Transformation

33 Scott RixnerThe Imagine Stream Processor33 Bandwidth Usage

34 Scott RixnerThe Imagine Stream Processor34 Data Parallelism is easier than ILP

35 Scott RixnerThe Imagine Stream Processor35 Conventional Approaches to Data-Dependent Conditional Execution A x>0 B C J K Data-Dependent Branch YN A x>0 Y B C Whoops J K Speculative Loss D x W ~100s A B J y=(x>0) if y if ~y C K if y if ~y Exponentially Decreasing Duty Factor

36 Scott RixnerThe Imagine Stream Processor36 Zero-Cost Conditionals  Most Approaches to Conditional Operations are Costly –Branching control flow - dead issue slots on mispredicted branches –Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.  Conditional Streams –append an element to an output stream depending on a case variable. Value Stream Case Stream {0,1} Output Stream

37 Scott RixnerThe Imagine Stream Processor37 Modern DRAM  Synchronous  Three-dimensional –Bank –Row –Column  Shared address lines  Shared data lines  Not truly random access Concurrent Access Pipelined Access

38 Scott RixnerThe Imagine Stream Processor38 DRAM Throughput and Latency P: Bank Precharge A: Row Activation C: Column Access Without Access Scheduling (56 DRAM cycles) With Access Scheduling (19 DRAM cycles)

39 Scott RixnerThe Imagine Stream Processor39 Streaming Memory System  Stream load/store architecture  Memory access scheduling –Out-of-order access –High stream throughput Address Generators IndexData DRAM Bank 0 Bank Buffer DRAM Bank n Bank Buffer Reorder Buffer

40 Scott RixnerThe Imagine Stream Processor40 Streaming Memory Performance

41 Scott RixnerThe Imagine Stream Processor41 Performance 80-90% Improvement 35% Improvement

42 Scott RixnerThe Imagine Stream Processor42 Evaluation Tools  Simulators –Imagine RTL model (verilog) –isim: cycle-accurate simulator (C++) –idebug: stream-level simulator (C++)  Compiler tools –iscd: kernel scheduler (communication scheduling) –istream: stream scheduler (SRF/memory management) –schedviz: visualizer

43 Scott RixnerThe Imagine Stream Processor43 Stream Programming  Stream-level –Host Processor –C++ –istream load (datin1, mem1); // Memory to SRF load (datin2, mem2); op (ADDSUB, datin1, datin2, datout1, datout2); send (datout1, P2, chan0); // SRF to Network send (datout2, P3, chan1);  Kernel-level –Clusters –C-like Syntax –iscd loop(input_stream0) { input_stream0 >> a; input_stream1 >> b; c = a + b; d = a - b; output_stream0 << c; output_stream1 << d; }

44 Scott RixnerThe Imagine Stream Processor44 Stereo Depth Extractor Load original packed row Unpack (8bit  16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values ConvolutionsDisparity Search

45 Scott RixnerThe Imagine Stream Processor45 7x7 Convolution Kernel ALUsComm/SPStreams Pipeline Stage 0 Pipeline Stage 1 Pipeline Stage 2

46 Scott RixnerThe Imagine Stream Processor46 Performance 16-bit kernels floating-point kernel 16-bit applications floating-point application

47 Scott RixnerThe Imagine Stream Processor47 Power GOPS/W: 4.6 6.9 4.1 10.2 9.6 2.4 6.3

48 Scott RixnerThe Imagine Stream Processor48 Implementation  Goals –< 0.7 cm 2 chip in 0.18  m CMOS –Operates at 500 MHz  Status –Cycle accurate C++ simulator –Behavioral verilog –Detailed floorplan with all critical signals planned –Schematics and layout for key cells –timing analysis of key units and communication paths

49 Scott RixnerThe Imagine Stream Processor49 The Imagine Stream Processor

50 Scott RixnerThe Imagine Stream Processor50 Imagine Floorplan SRF/SB Ctrl AG Clust/UC Stream Buffers SRF Cluster 2 Cluster 3 Cluster 0 Cluster 1 Cluster 6 Cluster 7 Cluster 4 Cluster 5 MS/HI/NI Stream/Re-Order Buffers CC CC CC CC CC CC CC CC CC  C Core HI MB0 MB1 MB2 MB3 BitCoder NI Routing 5.9mm 1mm 5mm 2mm0.6mm

51 Scott RixnerThe Imagine Stream Processor51 Imagine Pipeline MEM SRF F1D R X1 R X2X3 …. Microcontroller X2 W X4 W Stream buffers Clusters F2 SEL

52 Scott RixnerThe Imagine Stream Processor52 Communication paths SRF/SB Ctrl AG StreamBuffers SRF Streambuffers CC HI Mem Sys BitCoder NI Control Routing Clusters MEM F1D RX1 R X2X3 …. X2 W X4 W Stream buffers F2 SEL

53 Scott RixnerThe Imagine Stream Processor53 460  m 146.7  m Adder Layout Local RF Integer Adder

54 Scott RixnerThe Imagine Stream Processor54 Imagine Team  Professor William J. Dally –Scott Rixner –Ghazi Ben Amor –Ujval Kapasi –Brucek Khailany –Peter Mattson –Jinyung Namkoong –John Owens –Brian Towles –Don Alpert (Intel) –Chris Buehler (MIT) –JP Grossman (MIT) –Brad Johanson –Dr. Abelardo Lopez-Lagunas –Ben Mowery –Manman Ren

55 Scott RixnerThe Imagine Stream Processor55 Imagine Summary  Imagine operates on streams of records –simplifies programming –exposes locality and concurrency  Compound stream operations –perform a subroutine on each stream element –reduces global register bandwidth  Bandwidth hierarchy –use bandwidth where its inexpensive –distributed and hierarchical register organization  Conditional stream operations –sort elements into homogeneous streams –avoid predication or speculation

56 Scott RixnerThe Imagine Stream Processor56 Imagine gives high performance with low power and flexible programming  Matches capabilities of communication-limited technology to demands of signal and image processing applications  Performance –compound stream operations realize >10GOPS on key applications –can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board)  Power –three-level register hierarchy gives 2-10GOPS/W  Flexibility –programmed in “C” –streaming model –conditional stream operations enable applications like sort

57 Scott RixnerThe Imagine Stream Processor57 MPEG-2 Encoder

58 Scott RixnerThe Imagine Stream Processor58 Advantages of Implementing MPEG-2 Encode on Imagine  Block Matching well suited to Imagine –Clusters provide high arithmetic bandwidth –SRF provides necessary data locality, bandwidth –Good performance  Programmable Solution –Simplifies system design –Allows tradeoff between complexity of block matcher vs. bitrate, picture quality  Real Time  Power Efficient

59 Scott RixnerThe Imagine Stream Processor59 MPEG-2 Challenges  VLC (Huffman Coding) –Difficult and inefficient to implement on SIMD clusters –A separate module which provides LUT and shift register capabilities is used instead  Rate Control –Difficult because multiple macroblocks encoded in parallel –Must perform on a coarser granularity (impact on picture quality?)


Download ppt "The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February."

Similar presentations


Ads by Google