VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.

Slides:

Advertisements

Similar presentations

Systolic Arrays & Their Applications

Advertisements

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.

Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.

Processor System Architecture

Parallell Processing Systems1 Chapter 4 Vector Processors.

VLSI Communication SystemsRecap VLSI Communication Systems RECAP.

System Development. Numerical Techniques for Matrix Inversion.

1 Introduction to Data Parallel Architectures Sima, Fountain and Kacsuk Chapter 10 CSE462.

Examples of One-Dimensional Systolic Arrays

UNIVERSITY OF MASSACHUSETTS Dept

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Why Systolic Architecture ? VLSI Signal Processing 台灣大學電機系吳安宇.

Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.

Examples of One- Dimensional Systolic Arrays Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application.

UNIVERSITY OF MASSACHUSETTS Dept

Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 16: Application-Driven Hardware Acceleration (1/4)

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Introduction to Parallel Processing Ch. 12, Pg

Low power and cost effective VLSI design for an MP3 audio decoder using an optimized synthesis- subband approach T.-H. Tsai and Y.-C. Yang Department of.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU A New Algorithm to Compute the Discrete Cosine Transform VLSI Signal Processing 台灣大學電機系.

Introduction to Convolution circuits synthesis image processing, speech processing, DSP, polynomial multiplication in robot control. convolution.

Computer Organization Computer Organization & Assembly Language: Module 2.

Abdullah Aldahami ( ) Feb26, Introduction 2. Feedback Switch Logic 3. Arithmetic Logic Unit Architecture a.Ripple-Carry Adder b.Kogge-Stone.

Chapter One Introduction to Pipelined Processors.

Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.

CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo

Computer Organization - 1. INPUT PROCESS OUTPUT List different input devices Compare the use of voice recognition as opposed to the entry of data via.

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU CORDIC (Coordinate rotation digital computer) Ref: Y. H. Hu, “CORDIC based VLSI architecture.

Chapter 4 MARIE: An Introduction to a Simple Computer.

Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

MODULE 5 INTEL TODAY WE ARE GOING TO DISCUSS ABOUT, FEATURES OF 8086 LOGICAL PIN DIAGRAM INTERNAL ARCHITECTURE REGISTERS AND FLAGS OPERATING MODES.

1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.

Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Multirate Processing of Digital Signals (II): Short-Length FIR Filter VLSI Signal Processing.

ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Brief Overview of Residue Number System (RNS) VLSI Signal Processing 台灣大學電機系吳安宇.

EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003 Rev /05/2003.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ read/write and clock inputs Sequence of control signal combinations.

Full Adder Truth Table Conjugate Symmetry A B C CARRY SUM

CORDIC (Coordinate rotation digital computer)

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

Hiba Tariq School of Engineering

CORDIC (Coordinate rotation digital computer)

Pipelining and Retiming 1

Architecture & Organization 1

Course Name: Computer Application Topic: Central Processing Unit (CPU)

Examples of One-Dimensional Systolic Arrays

Architecture & Organization 1

Brief Overview of Residue Number System (RNS)

UNIVERSITY OF MASSACHUSETTS Dept

Real time signal processing

UNIVERSITY OF MASSACHUSETTS Dept

Why systolic architectures?

Presentation transcript:

VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University

VLSI SP Course 2001 台大電機吳安宇 2 Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and computation imbalance is a notable problem. The concept of Systolic architecture can map high-level computation into hardware structures. Systolic system works like an automobile assembly line. Systolic system is easy to implement because of it regularity and easy to reconfigure. Systolic architecture can result in cost-effective, high- performance special-purpose systems for a wide range of problems.

VLSI SP Course 2001 台大電機吳安宇 3 Key architectural issues in designing special-purpose systems Simple and regular design Simple, regular design yields cost-effective special systems. Concurrency and communication Design algorithm to support high concurrency and meantime to employ only simple. Balancing computation with I/O A special-purpose system should be match a variety of I/O bandwidth.

VLSI SP Course 2001 台大電機吳安宇 4 The basic principle of systolic architecture Systolic system consists of a set interconnected cells, each capable of performing some simple operation. Systolic approach can speed up a compute-bound computation in a relatively simple and inexpensive manner. A systolic array in particular, is illustrated in next page. (we achieve higher computation throughput without increasing memory bandwidth)

VLSI SP Course 2001 台大電機吳安宇 5 Basic principle of a systolic system

VLSI SP Course 2001 台大電機吳安宇 6 A family of systolic designs for convolution computation Given the sequence of weight {w 1, w 2,..., w k } And the input sequence {x 1, x 2,..., x k }, Compute the result sequence {y 1, y 2,..., y n+1-k } Defined by y i = w 1 x i + w 2 x i w k x i+k-1

VLSI SP Course 2001 台大電機吳安宇 7 Design B1 Broadcast input, move results, weights stay [(Semi-) systolic convolution arrays with global data communication] Previously propose for circuits to implement a pattern matching processor and for circuit to implement polynomial multiplication.

VLSI SP Course 2001 台大電機吳安宇 8 Design B2 Broadcast input, move weights, results stay [(Semi-) systolic convolution arrays with global data communication] The path for moving y i ’s is wider then w i ’s because of y i ’s carry more bits then w i ’s in numerical accuracy. The use of multiplier- accumulators may also help increase precision of the result, since extra bit can be kept in these accumulators with modest cost.

VLSI SP Course 2001 台大電機吳安宇 9 Design F Fan-in results, move inputs, weights stay [(Semi-) systolic convolution arrays with global data communication] When number of cell is large, the adder can be implemented as a pipelined adder tree to avoid large delay. Design of this type using unbounded fan-in.

VLSI SP Course 2001 台大電機吳安宇 10 Design R1 Results stay, inputs and weights move in opposite directions [(Pure-) systolic convolution arrays with global data communication] Design R1 has the advantage that it dose not require a bus, or any other global net-work, for collecting output from cells. The basic ideal of this design has been used to implement a pattern matching chip.

VLSI SP Course 2001 台大電機吳安宇 11 Design R2 Results stay, inputs and weights move in the same direction but at different speeds [(Pure-) systolic convolution arrays with global data communication] Multiplier-accumulator can be used effectively and so can tag bit method to signal the output of each cell. Compared with R1, all cells work all the time when additional register in each cell to hold a w value.

VLSI SP Course 2001 台大電機吳安宇 12 Design W1 Weights stay, inputs and results move in opposite direction [(Pure-) systolic convolution arrays with global data communication] This design is fundamental in the sense that it can be naturally extend to perform recursive filtering. This design suffers the same drawback as R1, only appro- ximately 1/2 cells work at any given time unless two inde- pendent computation are in- terleaved in the same array.

VLSI SP Course 2001 台大電機吳安宇 13 Design W2 Weights stay, inputs and results move in the same direction but at different speeds [(Pure-) systolic convolution arrays with global data communication] This design lose one advantage of W1, the constant response time. This design has been extended to implement 2-D convolution, where high throughputs rather than fast response are of concern.

VLSI SP Course 2001 台大電機吳安宇 14 Remarks Designs Above are all possible systolic designs for the convolution problem. Using a systolic control path, weight can be selected on- the-fly to implement interpolation or adaptive filtering. We need to understand precisely the strengths and drawbacks of each design so that an appropriate design can be selected for a given environment. For improving throughput, it may be worthwhile to implement multiplier and adder separately to allow overlapping of their execution. (Such as next page show) When chip pin is considered, pure-systolic require four ; semi-systolic require three I/O ports.

VLSI SP Course 2001 台大電機吳安宇 15 Overlapping the executions of multiply and add in design W1

VLSI SP Course 2001 台大電機吳安宇 16 Criteria and advantages The design makes multiple use of each input data item Because of this property, systolic systems can achieve high throughputs with modest I/O bandwidths for outside communication. The design uses extensive concurrency Concurrency can be obtained by pipelining the stages involved in the computation of each single result, by multiprocessing many results in parallel, or by both.

VLSI SP Course 2001 台大電機吳安宇 17 On-the-fly least-squares solutions using one and two dimensional systolic array, with p=4.

VLSI SP Course 2001 台大電機吳安宇 18 Criteria and advantages There are only a few types of simple cells To achieve performance goals, a systolic system is likely to use a large number of cells which must be simple and of only a few types to curtail design and implementation cost. Data and control flow are simple and regular Pure systolic system totally avoid long-distance or irregular wires for data communication.

VLSI SP Course 2001 台大電機吳安宇 19 Applications base on systolic array with convolution computation FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Fourier transform Interpolation 1-D and 2-D median filtering Geometric warping *Signal and image processing :

VLSI SP Course 2001 台大電機吳安宇 20 Applications base on systolic array with convolution computation Matrix-vector multiplication Matrix-matrix multiplication Matrix triangularization (solution of linear systems, matrix inversion) QR decomposition (eigenvalue, least-square computation) Solution of triangular linear systems *Matrix arithmetic :

VLSI SP Course 2001 台大電機吳安宇 21 Applications base on systolic array with convolution computation Data structure Graph algorithm Language recognition Dynamic programming Encoder (polynomial division) Relational data-base operations Non-numeric applications :