ELEC692 VLSI Signal Processing Architecture Lecture 6

Slides:



Advertisements
Similar presentations
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Programmable FIR Filter Design
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
ECE 667 Synthesis and Verification of Digital Circuits
1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.
Chapter 4 Retiming.
Give qualifications of instructors: DAP
B.Macukow 1 Lecture 12 Neural Networks. B.Macukow 2 Neural Networks for Matrix Algebra Problems.
System Development. Numerical Techniques for Matrix Inversion.
Numerical Algorithms Matrix multiplication
CS 151 Digital Systems Design Lecture 37 Register Transfer Level
ELEC692 VLSI Signal Processing Architecture Lecture 4
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
CSE621/JKim Lec4.1 9/20/99 CSE621 Parallel Algorithms Lecture 4 Matrix Operation September 20, 1999.
Lecture 21: Parallel Algorithms
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Lecture 7: Matrix-Vector Product; Matrix of a Linear Transformation; Matrix-Matrix Product Sections 2.1, 2.2.1,
Adaptive Signal Processing
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Basics and Architectures
1 Design of an SIMD Multimicroprocessor for RCA GaAs Systolic Array Based on 4096 Node Processor Elements Adaptive signal processing is of crucial importance.
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
L7: Pipelining and Parallel Processing VADA Lab..
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Important Components, Blocks and Methodologies. To remember 1.EXORS 2.Counters and Generalized Counters 3.State Machines (Moore, Mealy, Rabin-Scott) 4.Controllers.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
DSP Algorithms on FPGA Part II Digital image Processing
ELEC692 VLSI Signal Processing Architecture Lecture 2 Pipelining and Parallel Processing.
Processor Architecture
Folding Technique: Compromising in Special Purpose Hardware Design
ELEC692 VLSI Signal Processing Architecture Lecture 3
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
1 Design of an MIMD Multimicroprocessor for DSM A Board Which turns PC into a DSM Node Based on the RM Approach 1 The RM approach is essentially a write-through.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
My Coordinates Office EM G.27 contact time:
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.
1 VLSI Algorithm & Computing Structures Chapter 1. Introduction to DSP Systems Younglok Kim Dept. of Electrical Engineering Sogang University Spring 2007.
CORDIC (Coordinate rotation digital computer)
Serial Multipliers Prawat Nagvajara
Introduction to cosynthesis Rabi Mahapatra CSCE617
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Zhongguo Liu Biomedical Engineering
Memory System Performance Chapter 3
Real time signal processing
Presentation transcript:

ELEC692 VLSI Signal Processing Architecture Lecture 6 Array Processor Structure

Introduction Regular and recursive algorithms are common in DSP applications. Repetitive apply identical series of pre-determined operations E.g. matrix operations Regular architectures based on identical processing elements Utilizing massive parallel processing and intensive pipelining Systems with programmable processors – multiprocessor systems Systems with application-specific processors – array processors Global clock synchronized array – systolic arrays Self-timed asynchronous data transfer – wavefront arrays Issues How is the array processor design dependent on the algorithm? How is the algorithm best implemented in the array processors?

What is a systolic array Computational networks with distributed data storage and distributed arrangement of processing elements, so that a high number of operations are carried out simultaneously. Multiple PEs to maximize processing per memory access memory memory 10ns 10ns PE PE PE PE PE Array processor: 400MOPS Conventional: 100MOPS (Million Operations Per Second)

Characteristics of Array processors Parallelism Both data operation and data transfers Locality Connection exists only to directly neighbouring PEs Regularity and modularity Both computation and communications structures Processing elements can be simple (e.g. a single addition/multiplication) or complex Why call systolic array? Analogy to the circulatory system with its two phases, systolic and diastole System that are entirely pipelined and have periodic computation and data transfer cycles Synchronous clocking and control signals

Array structure examples PE PE PE PE 1D array PE PE PE PE PE PE PE PE PE PE PE PE 3D array 2D array

Drawback of Array processing Not all algorithms can be mapped and implemented using systolic array architecture Only fixed operations and operands to be processed are fixed prior to run-time Adaptive algorithms in which the particular operations and operands depend on the data to be processed, cannot be used. Cost in hardware and area is high Cost in latency

Data dependency Express the algorithm in inherent dependency Single-assignment code and local data dependency E.g. Matrix multiplication FOR i:=1 to n Do For j:=1 to n Do BEGIN c(i,j,0) :=0 For k:=1 to n Do if i=1, then b(1,j,k):=b_in(k,j) else b(i,j,k):=b(i-1,j,k) if j=1, then a(i,1,k):=a_in(i,k) else a(i,j,k):=a(i,j-1,k) c(i,j,k) := c(i,j,k-1)+a(i,jk)*b(i,j,k); END; c_out(I,j):=c(i,j,n); FOR i:=1 to n Do For j:=1 to n Do BEGIN c(i,j) :=0 For k:=1 to n Do c(i,j) := c(i,j)+a(i,k)*b(k,j); END;

Data dependency graph (DG) A graph specifies the data dependencies in an algorithm E.g. DG for the matrix multiplication

DG of matrix-vector multiplication

Systolic Array Fundamentals Systolic architecture are designed by using linear mapping techniques on regular dependency graph (DG). Regular Dependency Graph: the presence of an edge in a certain direction at any node in the DG represents presence of an edge in the same direction at all nodes in the DG. DG corresponds to space representation  no time instance is assigned to any computation Systolic architectures have a space-time representation where each node is mapped to a certain PE and is scheduled at a particular time instance. Systolic design methodology maps an N-dimensional DG to a lower dimensional systolic architecture

Example of DG Space representation for a FIR filter y(n) = w0x(n)+w1x(n-1)+w2x(n-2) (0,1)T x(0) x(1) x(2) x3) x(4) x(5) j’=j Processor axis w2 (1,0)T w1 w0 y(0) y(1) y(2) y(3) y(4) y(5) (1,-1)T 1 2 3 4 5 Time axis

Linear Mapping Methods algorithm Design an application-specific array processor architecture for a given algorithm Satisfy the demands regarding computational performance and throughput Minimize hardware cost Regular structure for VLSI implementation Dependency graph Assignment of the operations to processors and points in time (assignment and scheduling) Signal flow graph architecture

Estimation of # of PE # of PE nPE is generally smaller than the # of nodes in the dependency graph nDG. Important to know both since they specify how many nodes of the DG must be mapped to a PE. The processing time of a PE = TPE (including transfer time of register). Within TPE, each PE carries out a total of nOP/PE operations. The computational rate of the processor array is then given by Desired throughput = RT,target and the # of operations per sample nOP/SAMPLE, so we have

Estimation of # of PE # of PE can be furthered reduced through pipelining within the PEs. Given np as the # of pipeline stages along the datapath of the PE, then the new T’PE is Example Samples of a colour TV signal (27MHz sampling rate) are to be transformatted into 8X8 blocks, and each of the blocks is to be multiplied by an 8X8 matrix Sampling and matrix coefficents are 8-bits wide Results of the accumulated product is 19-bit A PE contains 1 MUL and 1ADD and a register at its output

Example of Estimation of # of PE (2*83 operations for 82 samples) With intensive pipelining Assume tL=50ps)

Some definitions Projection vector (also called iteration vector) Two nodes that are displaced by d or multiples of d are executed by the same processor. Scheduling vector: sT=(s1,s2) Any node with index I would be executed at time STI Processor space vector pT=(p1,p2) Any node with index IT=(i,j) would be exeucted by processor Hardware Utilization Efficiency, HUE = 1/|sTd| This is because two tasks executed by the same processor are spaced |sTd| time units apart.

Systolic Array Design Methodology Systolic architectures are designed by selecting different project, processor space and scheduling vectors, Feasibility constraints Processor space vector and projection vector must be orthogonal to each other. Points A and B differ by the projection vector, i.e. IA-IB is same as d, then they must be executed by the same processor, i.e. PTIA = PTIB and If A and B are mapped to the same processor, then they cannot be executed at the same time, i.e. Edge mapping: If an edge e exists in the space representation or DG, then an edge pTe is introduced in the systolic array with sTe delays.

Array Architecture Design Step 1: mapping algorithm to DG Based on the space-time indices in the recursive algorithm Shift-Invariance (Homogeneity) of DG Localization of DG: broadcast vs. transmitted data Step 2: mapping DG to SFG Processor assignment: a projection method may be applied (project vector d) Scheduling: a permissible linear schedule may be applied (Schedule vector s) Preserve the inter-dependence Nodes on an equitemporal hyperplane should not be projected to the same PE Step 3: mapping an SFG onto an array processor

Example: FIR Filter y(0) y(1) y(2) . D 2D D k D 2D D y(4) y(5) y(6) d h(4) y(2) s h(3) D 2D y(1) D h(2) y(0) D h(1) 2D D h(0) n x(0) x(1) x(2) x(0) x(1) x(2) x(3) x(4) x(5) x(6) Equitemporal hyperplanes

Space time transformation Space representation or DG can be transformed to a space-time representation by interpreting one of the spatial dimensions as temporal dimension. For a two-dimensional (2D) DG, the general transformation is described by i’=t=0, j’=pTI and t’=sTI or equivalently In the space-time representation, the j’ axis represents the processor axis and t’ represents the scheduling time instance.

FIR Systolic Array (Design B1) B1 design is derived by selecting projection vector, processor vector and scheduling vector as follows: Any node with index IT=(I,j) is mapped to processor Therefore all nodes on a horizontal line are mapped to the same processor Any node with index IT=(i,j) is executed at time Since then Edge mapping eT pTe sTe Weight (wt(1 0)) 1 Input ( i/p(0 1)) Result (1 -1) -1

B1 design Low-level implementation Space-time representation Input x(n) Processor axis D w1 result w0 D w2 D weight D Result D Processor 1 Low-level implementation D j’=j Processor axis x(0) x(1) x(2) x(3) x(4) w2 Space-time representation of B1 design w1 D w0 y(0) y(1) y(2) y(3) y(4) input D Time axis 1 2 3 4 Block diagram

FIR Systolic Array (Design B2) B2 design is derived by selecting projection vector, processor vector and scheduling vector as follows: Any node with index IT=(I,j) is mapped to processor Any node with index IT=(i,j) is executed at time Since then Edge mapping eT pTe sTe Weight (wt(1 0)) 1 Input ( i/p(0 1)) Result (1 -1) Weights move instead of the results as in B1. Inputs are broadcast.

B2 design Low-level implementation Input x(n) (…x(3),x(2),x(1),x(0) Processor axis D D D result D w0 w1 w2 D D D D Low-level implementation Applying space-time transformation j’=i+j Processor axis x(0) x(1) x(2) x(3) x(4) D w2 w1 w0 y(3) D y(2) D y(1) y(0) input D weight Time axis t’=I 1 2 3 4 Block diagram Space-time representation of B2 design

FIR Systolic Array (Design F) F design is derived by selecting projection vector, processor vector and scheduling vector as follows: Since then Edge mapping eT pTe sTe Weight (wt(1 0)) 1 Input ( i/p(0 1)) Result (1 -1) -1 Weights fixed in space, input vector moves from left to right with 1 delay; Output moves from right to left with no delay elements.

F design Low-level implementation Input x(n) D D Processor axis w0 D w1 D w2 D result D weight Result D Low-level implementation D Applying space-time transformation j’=j Processor axis D x(0) x(1) x(2) D w2 D w1 w0 input y(0) y(1) y(2) y(3) Time axis t’=i+j 1 2 3 4 Block diagram Space-time representation of B2 design

Selection of Scheduling Vector Finding feasible scheduling vectors using scheduling inequalities. Based on the selected scheduling vector ST, the projection vector d and the processor space vector pT can be found by Scheduling inequalities: Consider the dependency relation XY: We have Ix and Iy are the indices of node X and Y. Where Tx is the time to compute node X and Sx, Sy are the scheduling times for nodes X and Y

Scheduling Equations Linear Scheduling Affine scheduling ( A transformation followed by a translation and we have Defining edge from XY as ex-y = Iy-Ix, then scheduling inequality for an edge is: Scheduling vector sT can be obtained by solving these inequalities

Scheduling Inequalities Two steps for select the scheduling vectors: Capture all the fundamental edges. The reduced dependence graph (RDG) is used to capture the fundamental edges and the regular iterative algorithm (RIA) description of the corresponding problem is used to construct RDGs Construct the scheduling inequalities and solve them for feasible sT.

Regular Iteration Algorithm (RIA) Standard input RIA form If the index of the inputs are the same for all equations. Standard output RIA form If the output indices, i.e. indices on the left side, are the same E.g. FIR filter, RIA description is W(i+1,j) = W(i,j) X(i,j+1) = X(i,j) Y(i+1,j-1)=Y(i,j) + W(i+1,j-1)*X(i+1,j-1) Cannot expressed as standard input RIA form, but can be expressed as standard output RIA form W(i,j) = W(i-1,j) X(i,j) = X(i,j-1) Y(i,j)=Y(i-1,j+1) + W(i,j)*X(i,j) (0,1) (1,0) w x (0,0) (0,0) Reduced DG y (1,-1)

Example Tmult = 5, Tadd = 2, Tcom = 1 w x y (1,0) (0,1) (1,-1) (0,0) We have 5 edges in the RDG and so Select d=(1,-1) such that sTd=8¹ 0 and select pT = (1,1) so that pTd = 0. HUE = 1/8 eT pTe sTe Weight (wt(1 0)) 1 9 Input ( i/p(0 1)) Result (1 -1) 8 Edge mapping = X D D 8D 8D 8D For linear scheduling, gx= gy=gw=0 W 9D 9D We have One of the solution is s2=1,s1=8+1=9 and sT=(9,1)

Matrix-Matrix Multiplication DG is a three-dimensional (3D) space representation Linear projection is used to design 2D systolic arrays Given 2 matrices A and B (e.g. dimension of 2X2) and C= AB, Standard output RIA form

Matrix-Matrix Multiplication Example b c (0,1,0) (1,0,0) (0,0,1) (0,0,0) Let Tmult-add=1 and Tcom = 0, we have RDG of the matrix multi. example Solution 1: For linear scheduling, ga= gb=gc=0 HUE = 1

Solution 1 Edge mapping = i a a a c D j D D b eT pTe sTe a(0,1,0) (0,1) 1 b(1,0,0) (1,0) C(0,0,1) (0,0) Edge mapping = i a a a c D j D D b 2-dimensional systolic array By S. Y. Kung c D D D D D D b c D D D D D D b

Solution 2 Edge mapping = i a a a c c c j b D D D c D D D D D D b D c eT pTe sTe a(0,1,0) (0,1) 1 b(1,0,0) (1,0) C(0,0,1) (1,1) i a a a c c c j b D D D c D D D D D D b D c D D D D b D D