Introduction to Convolution circuits synthesis image processing, speech processing, DSP, polynomial multiplication in robot control. convolution.

Slides:



Advertisements
Similar presentations
Reversible Gates in various realization technologies
Advertisements

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
Convolution circuits synthesis Perkowski. FIR-filter like structure b4b3 b2b1 +++ a4000 a4*b4.
Systolic Arrays & Their Applications
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Example of Scheduling and Allocation based on Jaap Hofstede IIR Filter.
The University of Adelaide, School of Computer Science
Introduction So far, we have studied the basic skills of designing combinational and sequential logic using schematic and Verilog-HDL Now, we are going.
Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Multiplication Schemes Continued
System Development. Numerical Techniques for Matrix Inversion.
Numerical Algorithms • Matrix multiplication
Examples of One-Dimensional Systolic Arrays
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Why Systolic Architecture ? VLSI Signal Processing 台灣大學電機系 吳安宇.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Examples of One- Dimensional Systolic Arrays Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application.
UNIVERSITY OF MASSACHUSETTS Dept
Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Systolic Computing Fundamentals. This is a form of pipelining, sometimes in more than one dimension. Machines have been constructed based on this principle,
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Distributed Arithmetic: Implementations and Applications
ELEC692 VLSI Signal Processing Architecture Lecture 6
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
Chapter 5 Arithmetic Logic Functions. Page 2 This Chapter..  We will be looking at multi-valued arithmetic and logic functions  Bitwise AND, OR, EXOR,
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
GPGPU platforms GP - General Purpose computation using GPU
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
Programmable Logic Circuits: Multipliers Dr. Eng. Amr T. Abdel-Hamid ELECT 90X Fall 2009 Slides based on slides prepared by: B. Parhami, Computer Arithmetic:
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
High Speed, Low Power FIR Digital Filter Implementation Presented by, Praveen Dongara and Rahul Bhasin.
Chapter One Introduction to Pipelined Processors.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Chapter 4 MARIE: An Introduction to a Simple Computer.
Folding Technique: Compromising in Special Purpose Hardware Design
ELEC692 VLSI Signal Processing Architecture Lecture 3
ALU (Continued) Computer Architecture (Fall 2006).
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 3 Computing System Fundamentals 3.2 Computer Architecture.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Chapter 3 Boolean Algebra and Digital Logic T103: Computer architecture, logic and information processing.
VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Full Adder Truth Table Conjugate Symmetry A B C CARRY SUM
Array Multiplier Haibin Wang Qiong Wu. Outlines Background & Motivation Principles Implementation & Simulation Advantages & Disadvantages Conclusions.
Pipelining and Retiming 1
Examples of One-Dimensional Systolic Arrays
Pipelining and Vector Processing
UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept
Advanced Computer Architecture Systolic Organization
UNIVERSITY OF MASSACHUSETTS Dept
Real time signal processing
UNIVERSITY OF MASSACHUSETTS Dept
Why systolic architectures?
Presentation transcript:

Introduction to Convolution circuits synthesis image processing, speech processing, DSP, polynomial multiplication in robot control. convolution

FIR-filter like structure b4b3 b2b1 +++ a4000 a4*b4 Separate input and output Input and output move synchronized Weights stay in space

b4b3 b2b1 +++ a400 a4*b4 a3 a3*b4+a4b3

b4b3 b2b1 +++ a3a40 a4*b4 a2 a3*b4+a4b3 a4*b2+a3*b3+a2*b4

b4b3 b2b1 +++ a2a3a4 a4*b4 a1 a3*b4+a4b3 a4*b2+a3*b3+a2*b4 a1*b4+a2*b3+a3*b2+a4*b1

b4b3 b2b1 +++ a1a2a3 a4*b4 0 a3*b4+a4b3 a4*b2+a3*b3+a2*b4 a1*b4+a2*b3+a3*b2+a4*b1 a1*b3+a2*b2+a3*b1

We insert Dffs to avoid many levels of logic b4b3 b2b1 +++ a4a2a3 a4*b4 a4*b3 a4*b2a4*b1

b4b3 b2b1 +++ a3a1a2 a4*b4 a4*b3+a3b4 a4*b2+a3b3 a4*b1+a3b2 a3b1

b4b3 b2b1 +++ a20a1 a4*b4 a4*b3+a3b4 a4*b2+a3b3+a2b4 a4*b1+a3b2+a2b3 a3b1+a2b2 a2b1 The disadvantage of this circuit is broadcasting

We insert more Dffs to avoid broadcasting b4b3 b2b1 +++ a4a2a3 a4*b

b4b3 b2b1 +++ a3a1a2 a4*b4 a3b4 a4b3 0 a400 0 Does not work correctly like this, try something new….

b4b3 b2b1 a3a1a2 a4*b4 a3b4a4b3 0 a400 0 a2b4 a1b4 a3b3 a2b3 a1b a4b2 a3b2 a2b2 a1b a4b1 a3b1 a2b1 First sum Second sum

FIR-filter like structure, assume two delays b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

b4b3 b2b1 +++

Convolution Algorithm Two loops Patterns of operations on vectors: [a0*b0, a1*b1, … an*bn] 1.vector product is not dot product [a0*b0, a1*b1, … an*bn] a0*b0 + a1*b1 + … + an*bn 2.Dot product (scalar product) = vector product with accumulation a0*b0 + a1*b1 + … + an*bn 3.Polynomial multiplication

Example 3: FIR Filter or Convolution

Example 3: Convolution There are many ways to implement convolution using systolic arrays, one of them is shown: –u(n) : The input of sequence from left. –w(n) : The weights preloaded in n PEs. –y(n) : The sequence from right (Initial value: 0) and having the same speed as u(n). In this operation each cell’s function is: –1. Multiply the inputs coming from left with weights and output the input received to the next cell. –2. Add the final value to the inputs from right. W0W0 W1W1 W2W2 W3W3 u i ……u 0 y i ……y 0 0 WiWi a in b out a out b in a out = a in b out = b in + a in * w i

Each cell operation. W0W0 W1W1 W2W2 W3W3 u i ……u 0 y i ……y 0 0 WiWi a in b out a out b in a out = a in b out = b in + a in * w i Convolution (cont) Systolic array. The input of sequence from left. This is just one solution to this problem Weights in space Inputs and outputs in each cell

Convolution Can be 1D, 2D, 3D, etc. Is very important in many applications. Can be implemented efficiently in various architectures. Is an excellent example to compare various computer architectures: –SIMD, – MIMD, –CA, –pipelined, –Systolic.

Various Possible Implementations Convolution is very important, we use it in several applications. So let us think what are to implement it Convolution is very important, we use it in several applications. So let us think what are all the possible ways to implement it Convolution Algorithm

Bag of Tricks that can be used Preload-repeated-value Replace-feedback-with-register Internalize-data-flow Broadcast-common-input Propagate-common-input Retime-to-eliminate-broadcasting

How to invent such circuits? 1.Let us learn from existing designs 2.Let us learn from our own mistakes 3.Let us check all possibilities of moving every piece of data

Bogus Attempt at Systolic FIR for i=1 to n in parallel for j=1 to k in place y i += w j * x i+j-1 feedback from sequential implementation Replace with register Inner loop realized in place Stage 1: directly from equation Stage 2: feedback = y i = y i Stage 3: Internal loop in space

Bogus Attempt continued: Outer Loop for i=1 to n in parallel for j=1 to k in place y i += w j * x i+j-1 Factorize w j This could work but it has broadcast

Bogus Attempt continued: Outer Loop - 2 for i=1 to n in parallel for j=1 to k in place y i += w j * x i+j-1 Because we do not want to have broadcast, we retime the signal w, this requires also retiming of X j

Another possibility of retiming for i=1 to n in parallel for j=1 to k in place y i += w j * x i+j-1 Bogus Attempt continued: Outer Loop - 2a

Yet another approach is to broadcast common input x i-1 Bogus Attempt continued: Outer Loop - 3 for i=1 to n in parallel for j=1 to k in place y i += w j * x i+j-1

What we achieved? We showed several possible beginnigs of creating architectures. They were not successful, but show the principles. We will continue to create architectures, but these attempts were not complete waste of time, they can be used in similar problems successfully. You have to experiment with ideas!!

Attempt at Systolic FIR: now internal loop is in parallel Internal loop in parallel

Attempt at Systolic FIR: now internal loop is in parallel 1 2 3

Outer Loop continuation for FIR filter

Continue: Optimize Outer Loop Preload-repeated Value Based on previous slide we can preload weights Wi

Continue: Optimize Outer Loop Broadcast Common Value This design has broadcast. Some purists tell this is not systolic as systolic should have all short wires.

Continue: Optimize Outer Loop Retime to Eliminate Broadcast We delay these signals y i

The design becomes not intuitive. Therefore, we have to explain in detail “How it works” y1=x1w1 x1 x2 inputs outputs Was it a good idea to combine input and output streams to cells?

Polynomial Multiplication of 1-D convolution problem

Types of systolic structure Convolution problem weight : {w 1, w 2,..., w k } inputs : {x 1, x 2,..., x n } results : {y 1, y 2,..., y n+k-1 } y i = w 1 x i + w 2 x i w k x i+k-1 (combining two data streams) H. T. Kung’s grouping work assume k = 3 Polynomial Multiplication of 1-D convolution problem

A well-known family of systolic designs for convolution computation Given the sequence of weights {w 1, w 2,..., w k } And the input sequence {x 1, x 2,..., x k }, Compute the result sequence {y 1, y 2,..., y n+1-k } Defined by y i = w 1 x i + w 2 x i w k x i+k-1

Design B1 - Broadcast input, - move results systolically, - weights stay - (Semi-systolic convolution arrays with global data communication -

Design B1 - Broadcast input, - move results systolically, - weights stay - (Semi-systolic convolution arrays with global data communication Previously proposed for circuits to implement a pattern matching processor and for circuit to implement polynomial multiplication. -

Types of systolic structure: design B1 wider systolic path (partial result y i move) x3x3 x2x2 x1x1 y3y3 y2y2 y1y1 W1W1 W2W2 W3W3 y in x in y out y out = y in + W  x in W Please analyze this circuit drawing snapshots like in an animated movie of data in subsequent moments of time broadcast Discuss disadvantages of broadcast Results move out

Design B2 Inputs broadcast Weights move Results stay

Types of systolic structure: Design B2 Inputs broadcast Weights move Results stay w i circulate use multiplier-accumulator hardware w i has a tag bit (signals accumulator to output results) needs separate bus (or other global network for collecting output) W in x in W out y = y + W in  x in W out = W in y x3x3 x2x2 x1x1 y1y1 y2y2 y3y3 W2W2 W3W3 W1W1

Design B2 Broadcast input, move weights, results stay [(Semi-) systolic convolution arrays with global data communication] The path for moving y i ’s is wider then w i ’s because of y i ’s carry more bits then w i ’s in numerical accuracy. The use of multiplier- accumulators may also help increase precision of the result, since extra bit can be kept in these accumulators with modest cost. Semisystolic because of broadcast

Design F Input move Weights stay Partial results fan-in needs adder

Types of systolic structure: design F Input move Weights stay Partial results fan-in needs adder applications : signal processing, pattern matching y 1 ’s Z out = W  x in x out = x in Z out x out x in W x3x3 x2x2 x1x1 W3W3 W2W2 W1W1 ADDER

Design F - Fan-in results, move inputs, weights stay - Semi-systolic convolution arrays with global data communication When number of cell is large, the adder can be implemented as a pipelined adder tree to avoid large delay. Design of this type using unbounded fan-in.

Design R1 Inputs and weights move in the opposite directions Results stay can use tag bit no bus (systolic output path is sufficient) one-half the cells work at any time

Types of systolic structure: Design R1 Inputs and weights move in the opposite directions Results stay can use tag bit no bus (systolic output path is sufficient) one-half the cells are work at any time applications : pattern matching y = y + W in  x in x out = x in W out = W in x1x1 x3x3 x2x2 W1W1 W2W2 y3y3 y2y2 y1y1 W in x in W out y x out

Design R1 - Results stay, inputs and weights move in opposite directions - Pure-systolic convolution arrays with global data communication Design R1 has the advan- tage that it dose not require a bus, or any other global net- work, for collecting output from cells. The basic ideal of this de- sign has been used to imple- ment a pattern matching chip.

Design R2 Inputs and weights move in the same direction at different speeds Results stay x j ’s move twice as fast as the w j ’s all cells work at any time need additional registers (to hold w value)

Types of systolic structure: design R2 Inputs and weights move in the same direction at different speeds Results stay x j ’s move twice as fast as the w j ’s all cells work at any time need additional registers (to hold w value) applications : pipeline multiplier W1W1 W2W2 W3W3 W4W4 W5W5 x3x3 x2x2 x1x1 y1y1 y2y2 y3y3 WWW W y W in W out x in x out y = y + W in  x in W = W in W out = W x out = x in

Design R2 - Results stay, inputs and weights move in the same direction but at different speeds - Pure-systolic convolution arrays with global data communication Multiplier-accumulator can be used effectively and so can tag bit method to signal the output of each cell. Compared with R1, all cells work all the time when additional register in each cell to hold a w value.

Design W1 Inputs and results move in the opposite direction Weights stay one-half the cells are work constant response time

Types of systolic structure: design W1 Inputs and results move in the opposite direction Weights stay one-half the cells are work constant response time applications : polynomial division y out = y in + W  x in x out = x in y in x in y out W x out x1x1 x3x3 x2x2 W1W1 W2W2 y W3W3

Design W1 -Weights stay, inputs and results move in opposite direction - Pure-systolic convolution arrays with global data communication This design is fundamental in the sense that it can be naturally extend to perform recursive filtering. This design suffers the same drawback as R1, only appro- ximately 1/2 cells work at any given time unless two inde- pendent computation are in- terleaved in the same array.

Overlapping the executions of multiply-and-add in design W1

Design W2 Inputs and results move in the same direction at different speeds Weights stay high throughputsall cells work (high throughputs rather than fast response)

Types of systolic structure: design W2 Inputs and results move in the same direction at different speeds Weights stay high throughputsall cells work (high throughputs rather than fast response) x W x in x out y in y out y out = y in + W in  x in x = x in x out = x W1W1 W2W2 x5x5 W3W3 x7x7 x3x3 x2x2 x1x1 y1y1 y2y2 y3y3 WWW x4x4 x6x6

Design W2 -Weights stay, inputs and results move in the same direction but at different speeds - Pure-systolic convolution arrays with global data communication This design lose one advan- tage of W1, the constant response time. 2-D convolution, This design has been extended to implement 2-D convolution, where high throughputs rather than fast response are of concern.

Remarks on Linear Arrays Above designs are all possible systolic designs for the convolution problem. (some are semi-) Using a systolic control path, weight can be selected on- the-fly to implement interpolation or adaptive filtering. We need to understand precisely the strengths and drawbacks of each design so that an appropriate design can be selected for a given environment. For improving throughput, it may be worthwhile to implement multiplier and adder separately to allow overlapping of their execution. (Such as next page show) When chip pin is considered: pure-systolic requires four I/O ports; semi-systolic requires three I/O ports.

Retiming of filters

FIR circuit: initial design delays Pipelining of x i We insert various numbers of unified delays

FIR circuit: registers added below weight multipliers Notice changed timing here We insert delays here

FIR Summary: comparison of sequential and systolic

Conclusions on 1D and 1.5D Systolic Arrays Systolic arrays are more than processor arrays which execute systolic algorithms. one of the following –A systolic cell takes on one of the following forms: 1.A special purpose cell with hardwired functions, 2.A vector-computer-like cell with instruction decoding and a processing element, 3.A systolic processor complete with a control unit and a processing unit. Smarter processor for SAT, Petrick, etc.

Large Systolic Arrays as general purpose computers Large Systolic Arrays as general purpose computers

Originally, systolic architectures were motivated for high performance special purpose computational systems that meet the constraints of VLSI, However, it is possible to design systolic systems which: –have high throughputs –yet are not constrained to a single VLSI chip.

Problems with systolic array design 1.Hard to design - hard to understand low level realization may be hard to realize 2. Hard to explain remote from the algorithm function can’t readily be deduced from the structure 3.Hard to verify

Key architectural issues in designing special-purpose systems special-purpose systems Simple and regular design Simple, regular design yields cost-effective special systems. Concurrency and communication Design algorithm to support high concurrency and meantime to employ only simple blocks. Balancing computation with I/O A special-purpose system should be a match to a variety of I/O bandwidths.

Two Dimensional Systolic Arrays Two Dimensional Systolic Arrays In 1978, the first systolic arrays were introduced as a feasible design for special purpose devices which meet the VLSI constraints. These special purpose devices were able to perform four types of matrix operations at high processing speeds: –matrix-vector multiplication, –matrix-matrix multiplication, –LU-decomposition of a matrix, –Solution of triangular linear systems.

General Systolic Organization General Systolic Organization

Example 2: Example 2: Matrix-Matrix Multiplication All previously shown tricks can be applied

Seth Copen Goldstein, CMU A.R. Hurson 2. David E. Culler, UC. Berkeley, Syeda Mohsina Afroze and other students of Advanced Logic Synthesis, ECE 572, 1999 and 2000.Seth Copen Goldstein, CMU A.R. Hurson 2. David E. Culler, UC. Berkeley, Syeda Mohsina Afroze and other students of Advanced Logic Synthesis, ECE 572, 1999 and Sources