The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA 94305 Scott Rixner February.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Streaming Supercomputer Strawman Bill Dally, Jung-Ho Ahn, Mattan Erez, Ujval Kapasi, Tim Knight, Ben Serebrin April 15, 2002.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Oct 2, 2001 SSS: 1 Stanford Streaming Supercomputer (SSS) Project Meeting Bill Dally, Pat Hanrahan, and Ron Fedkiw Computer Systems Laboratory Stanford.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
Technische universiteit eindhoven ‘Nothing is built on stone; all is built on sand, but we must build as if the sand were stone.’ Jorge Luis Borges (Argentine.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
1Hot Chips 2000Imagine IMAGINE: Signal and Image Processing Using Streams William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong,
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Basics and Architectures
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
RICE UNIVERSITY ‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
MPEG2 Video Encoding on Imagine November 16, 2000 Scott Rixner.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Investigating Architectural Balance using Adaptable Probes.
DSP Architectural Considerations for Optimal Baseband Processing Sridhar Rajagopal Scott Rixner Joseph R. Cavallaro Behnaam Aazhang Rice University, Houston,
WJD Feb 3, 19981Tomorrow's Computing Engines Tomorrow’s Computing Engines February 3, 1998 Symposium on High-Performance Computer Architecture William.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.
Overview von Neumann Architecture Computer component Computer function
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Sunpyo Hong, Hyesoon Kim
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
My Coordinates Office EM G.27 contact time:
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
CS 352H: Computer Systems Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Microarchitecture.
Architecture & Organization 1
Stream Architecture: Rethinking Media Processor Design
Lecture on High Performance Processor Architecture (CS05162)
Out of Order Processors
Architecture & Organization 1
William J. Dally Computer Systems Laboratory Stanford University
Morgan Kaufmann Publishers Computer Organization and Assembly Language
William J. Dally Computer Systems Laboratory Stanford University
Computer Architecture
Presentation transcript:

The Imagine Stream Processor Concurrent VLSI Architecture Group Stanford University Computer Systems Laboratory Stanford, CA Scott Rixner February 23, 2000

Scott RixnerThe Imagine Stream Processor2 Outline  The Stanford Concurrent VLSI Architecture Group  Media processing  Technology –Distributed register files –Stream organization  The Imagine Stream Processor –20GFLOPS on a 0.7cm 2 chip –Conditional streams –Memory access scheduling –Programming tools

Scott RixnerThe Imagine Stream Processor3 The Concurrent VLSI Architecture Group  Architecture and design technology for VLSI  Routing chips –Torus Routing Chip, Network Design Frame, Reliable Router –Basis for Intel, Cray/SGI, Mercury, Avici network chips

Scott RixnerThe Imagine Stream Processor4 Parallel computer systems  J-Machine (MDP) led to Cray T3D/T3E  M-Machine (MAP) –Fast messaging, scalable processing nodes, scalable memory architecture MDP Chip J-Machine Cray T3D MAP Chip

Scott RixnerThe Imagine Stream Processor5 Design technology  Off-chip I/O –Simultaneous bidirectional signaling, 1989 now used by Intel and Hitachi –High-speed signalling 4Gb/s in 0.6  m CMOS, Equalization, 1995  On-Chip Signalling –Low-voltage on-chip signalling –Low-skew clock distribution  Synchronization –Mesochronous, Plesiochronous –Self-Timed Design 4Gb/s CMOS I/O 250ps/division

Scott RixnerThe Imagine Stream Processor6 Forces Acting on Media Processor Architecture  Applications - streams of low-precision samples –video, graphics, audio, DSL modems, cellular base stations  Technology - becoming wire-limited –power/delay dominated by communication, not arithmetic –global structures: register files, instruction issue don’t scale  Microarchitecture - ILP has been mined out –to the point of diminishing returns on squeezing performance from sequential code –explicit parallelism (data parallelism and thread-level parallelism) required to continue scaling performance

Scott RixnerThe Imagine Stream Processor7  30 fps  Requirements –11 GOPS  Imagine stream processor –12.1 GOPS, 4.6 GOPS/W Stereo Depth Extraction Left Camera ImageRight Camera Image Depth Map

Scott RixnerThe Imagine Stream Processor8 Applications  Little locality of reference –read each pixel once –often non-unit stride –but there is producer-consumer locality  Very high arithmetic intensity –100s of arithmetic operations per memory reference  Dominated by low-precision (16-bit) integer operations

Scott RixnerThe Imagine Stream Processor9 Stream Processing SAD Kernel Stream Input Data Output Data Image 1 convolve Image 0 convolve Depth Map  Little data reuse (pixels never revisited)  Highly data parallel (output pixels not dependent on other output pixels)  Compute intensive (60 operations per memory reference)

Scott RixnerThe Imagine Stream Processor10 Locality and Concurrency SAD Image 1 convolve Image 0 convolve Depth Map Operations within a kernel operate on local data Streams expose data parallelism Kernels can be partitioned across chips to exploit control parallelism

Scott RixnerThe Imagine Stream Processor11 These Applications Demand High Performance Density  GOPs to TOPs of sustained performance  Space, weight, and power are limited  Historically achieved using special-purpose hardware  Building a special processor for each application is –expensive –time consuming –inflexible

Scott RixnerThe Imagine Stream Processor12 Performance of Special-Purpose Processors Fed by dedicated wires/memoriesLots (100s) of ALUs

Scott RixnerThe Imagine Stream Processor13 Care and Feeding of ALUs Data Bandwidth Instruction Bandwidth Regs Instr. Cache IR IP ‘Feeding’ Structure Dwarfs ALU

Scott RixnerThe Imagine Stream Processor14 Technology scaling makes communication the scarce resource 0.18  m 256Mb DRAM 16 64b FP Proc 500MHz 0.07  m 4Gb DRAM b FP Proc 2.5GHz P mm 30,000 tracks 1 clock repeaters every 3mm 25mm 120,000 tracks 16 clocks repeaters every 0.5mm P

Scott RixnerThe Imagine Stream Processor15 What Does This Say About Architecture?  Tremendous opportunities –Media problems have lots of parallelism and locality –VLSI technology enables 100s of ALUs/chip (1000s soon) (in 0.18um 0.1mm 2 per integer adder, 0.5mm 2 per FP adder)  Challenging problems –Locality - global structures won’t work –Explicit parallelism - ILP won’t keep 100 ALUs busy –Memory - streaming applications don’t cache well  Its time to try some new approaches

Scott RixnerThe Imagine Stream Processor16 Example: Register File Organization  Register files serve two functions: –Short term storage for intermediate results –Communication between multiple function units  Global register files don’t scale well as N, number of ALUs increases –Need more registers to hold more results (grows with N) –Need more ports to connect all of the units (grows with N 2 )

Scott RixnerThe Imagine Stream Processor17 Register Files Dwarf ALUs

Scott RixnerThe Imagine Stream Processor18 Register File Area  Each cell requires: –1 word line per port –1 bit line per port  Each cell grows as p 2  R registers in the file  Area: p 2 R  N 3 Register Bit Cell

Scott RixnerThe Imagine Stream Processor19 SIMD and Distributed Register Files ScalarSIMD Central DRF

Scott RixnerThe Imagine Stream Processor20 Hierarchical Organizations ScalarSIMD Central DRF

Scott RixnerThe Imagine Stream Processor21 Stream Organizations ScalarSIMD Central DRF

Scott RixnerThe Imagine Stream Processor22 Organizations  48 ALUs (32-bit), 500 MHz  Stream organization improves central organization by Area: 195x, Delay: 20x, Power: 430x

Scott RixnerThe Imagine Stream Processor23 Performance 16% Performance Drop (8% with latency constraints) 180x Improvement

Scott RixnerThe Imagine Stream Processor24 Scheduling Distributed Register Files  Distributed register files means: –Not all functional units can access all data –Each functional unit input/output no longer has a dedicated route from/to all register files

Scott RixnerThe Imagine Stream Processor25 Communication Scheduling  Schedule operation on instruction  Assign operation to functional unit  Schedule communication for operation  When to schedule route from operation A to operation B? –Can’t schedule route when A is scheduled since B is not assigned to a FU –Can’t wait to schedule route when B is scheduled since other routes may block

Scott RixnerThe Imagine Stream Processor26 Stubs: Communication Placeholders  Stubs –ensure that other communications will not block write of Op1’s result  Copy-connected architecture –ensures that any functional unit that performs Op2 can read Op1’s result

Scott RixnerThe Imagine Stream Processor27 Scheduling With Stubs

Scott RixnerThe Imagine Stream Processor28 The Imagine Stream Processor

Scott RixnerThe Imagine Stream Processor29 Arithmetic Clusters

Scott RixnerThe Imagine Stream Processor30 Bandwidth Hierarchy  VLIW clusters with shared control  bit operations per word of memory bandwidth 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

Scott RixnerThe Imagine Stream Processor31 Imagine is a Stream Processor  Instructions are Load, Store, and Operate –operands are streams –Send and Receive for multiple-imagine systems  Operate performs a compound stream operation –read elements from input streams –perform a local computation –append elements to output streams –repeat until input stream is consumed –(e.g., triangle transform)

Scott RixnerThe Imagine Stream Processor32 Bandwidth Demands of  Transformation

Scott RixnerThe Imagine Stream Processor33 Bandwidth Usage

Scott RixnerThe Imagine Stream Processor34 Data Parallelism is easier than ILP

Scott RixnerThe Imagine Stream Processor35 Conventional Approaches to Data-Dependent Conditional Execution A x>0 B C J K Data-Dependent Branch YN A x>0 Y B C Whoops J K Speculative Loss D x W ~100s A B J y=(x>0) if y if ~y C K if y if ~y Exponentially Decreasing Duty Factor

Scott RixnerThe Imagine Stream Processor36 Zero-Cost Conditionals  Most Approaches to Conditional Operations are Costly –Branching control flow - dead issue slots on mispredicted branches –Predication (SIMD select, masked vectors) - large fraction of execution ‘opportunities’ go idle.  Conditional Streams –append an element to an output stream depending on a case variable. Value Stream Case Stream {0,1} Output Stream

Scott RixnerThe Imagine Stream Processor37 Modern DRAM  Synchronous  Three-dimensional –Bank –Row –Column  Shared address lines  Shared data lines  Not truly random access Concurrent Access Pipelined Access

Scott RixnerThe Imagine Stream Processor38 DRAM Throughput and Latency P: Bank Precharge A: Row Activation C: Column Access Without Access Scheduling (56 DRAM cycles) With Access Scheduling (19 DRAM cycles)

Scott RixnerThe Imagine Stream Processor39 Streaming Memory System  Stream load/store architecture  Memory access scheduling –Out-of-order access –High stream throughput Address Generators IndexData DRAM Bank 0 Bank Buffer DRAM Bank n Bank Buffer Reorder Buffer

Scott RixnerThe Imagine Stream Processor40 Streaming Memory Performance

Scott RixnerThe Imagine Stream Processor41 Performance 80-90% Improvement 35% Improvement

Scott RixnerThe Imagine Stream Processor42 Evaluation Tools  Simulators –Imagine RTL model (verilog) –isim: cycle-accurate simulator (C++) –idebug: stream-level simulator (C++)  Compiler tools –iscd: kernel scheduler (communication scheduling) –istream: stream scheduler (SRF/memory management) –schedviz: visualizer

Scott RixnerThe Imagine Stream Processor43 Stream Programming  Stream-level –Host Processor –C++ –istream load (datin1, mem1); // Memory to SRF load (datin2, mem2); op (ADDSUB, datin1, datin2, datout1, datout2); send (datout1, P2, chan0); // SRF to Network send (datout2, P3, chan1);  Kernel-level –Clusters –C-like Syntax –iscd loop(input_stream0) { input_stream0 >> a; input_stream1 >> b; c = a + b; d = a - b; output_stream0 << c; output_stream1 << d; }

Scott RixnerThe Imagine Stream Processor44 Stereo Depth Extractor Load original packed row Unpack (8bit  16 bit) 7x7 Convolve 3x3 Convolve Store convolved row Load Convolved Rows Calculate BlockSADs at different disparities Store best disparity values ConvolutionsDisparity Search

Scott RixnerThe Imagine Stream Processor45 7x7 Convolution Kernel ALUsComm/SPStreams Pipeline Stage 0 Pipeline Stage 1 Pipeline Stage 2

Scott RixnerThe Imagine Stream Processor46 Performance 16-bit kernels floating-point kernel 16-bit applications floating-point application

Scott RixnerThe Imagine Stream Processor47 Power GOPS/W:

Scott RixnerThe Imagine Stream Processor48 Implementation  Goals –< 0.7 cm 2 chip in 0.18  m CMOS –Operates at 500 MHz  Status –Cycle accurate C++ simulator –Behavioral verilog –Detailed floorplan with all critical signals planned –Schematics and layout for key cells –timing analysis of key units and communication paths

Scott RixnerThe Imagine Stream Processor49 The Imagine Stream Processor

Scott RixnerThe Imagine Stream Processor50 Imagine Floorplan SRF/SB Ctrl AG Clust/UC Stream Buffers SRF Cluster 2 Cluster 3 Cluster 0 Cluster 1 Cluster 6 Cluster 7 Cluster 4 Cluster 5 MS/HI/NI Stream/Re-Order Buffers CC CC CC CC CC CC CC CC CC  C Core HI MB0 MB1 MB2 MB3 BitCoder NI Routing 5.9mm 1mm 5mm 2mm0.6mm

Scott RixnerThe Imagine Stream Processor51 Imagine Pipeline MEM SRF F1D R X1 R X2X3 …. Microcontroller X2 W X4 W Stream buffers Clusters F2 SEL

Scott RixnerThe Imagine Stream Processor52 Communication paths SRF/SB Ctrl AG StreamBuffers SRF Streambuffers CC HI Mem Sys BitCoder NI Control Routing Clusters MEM F1D RX1 R X2X3 …. X2 W X4 W Stream buffers F2 SEL

Scott RixnerThe Imagine Stream Processor  m  m Adder Layout Local RF Integer Adder

Scott RixnerThe Imagine Stream Processor54 Imagine Team  Professor William J. Dally –Scott Rixner –Ghazi Ben Amor –Ujval Kapasi –Brucek Khailany –Peter Mattson –Jinyung Namkoong –John Owens –Brian Towles –Don Alpert (Intel) –Chris Buehler (MIT) –JP Grossman (MIT) –Brad Johanson –Dr. Abelardo Lopez-Lagunas –Ben Mowery –Manman Ren

Scott RixnerThe Imagine Stream Processor55 Imagine Summary  Imagine operates on streams of records –simplifies programming –exposes locality and concurrency  Compound stream operations –perform a subroutine on each stream element –reduces global register bandwidth  Bandwidth hierarchy –use bandwidth where its inexpensive –distributed and hierarchical register organization  Conditional stream operations –sort elements into homogeneous streams –avoid predication or speculation

Scott RixnerThe Imagine Stream Processor56 Imagine gives high performance with low power and flexible programming  Matches capabilities of communication-limited technology to demands of signal and image processing applications  Performance –compound stream operations realize >10GOPS on key applications –can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board)  Power –three-level register hierarchy gives 2-10GOPS/W  Flexibility –programmed in “C” –streaming model –conditional stream operations enable applications like sort

Scott RixnerThe Imagine Stream Processor57 MPEG-2 Encoder

Scott RixnerThe Imagine Stream Processor58 Advantages of Implementing MPEG-2 Encode on Imagine  Block Matching well suited to Imagine –Clusters provide high arithmetic bandwidth –SRF provides necessary data locality, bandwidth –Good performance  Programmable Solution –Simplifies system design –Allows tradeoff between complexity of block matcher vs. bitrate, picture quality  Real Time  Power Efficient

Scott RixnerThe Imagine Stream Processor59 MPEG-2 Challenges  VLC (Huffman Coding) –Difficult and inefficient to implement on SIMD clusters –A separate module which provides LUT and shift register capabilities is used instead  Rate Control –Difficult because multiple macroblocks encoded in parallel –Must perform on a coarser granularity (impact on picture quality?)