Automatic Generation of Systolic Array Designs For Reconfigurable Computing Greg Nash Engineering of Reconfigurable Systems and Algorithms (ERSA '02) International.

Slides:



Advertisements
Similar presentations
Lect.3 Modeling in The Time Domain Basil Hamed
Advertisements

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.
ELEC692 VLSI Signal Processing Architecture Lecture 6
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Computational Biology, Part E Basic Principles of Computer Graphics Robert F. Murphy Copyright  1996, 1999, 2000, All rights reserved.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Parallel Characteristics of Sequence Alignments Kyle R. Junik.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
PROGRAM DEVELOPMENT CYCLE. Problem Statement: Problem Statement help diagnose the situation so that your focus is on the problem, helpful tools at this.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
BITS Pilani Pilani Campus Data Structure and Algorithms Design Dr. Maheswari Karthikeyan Lecture1.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
An Introduction to Computational Fluids Dynamics Prapared by: Chudasama Gulambhai H ( ) Azhar Damani ( ) Dave Aman ( )
Time Domain Representations of Linear Time-Invariant Systems
1 “A picture speaks a thousand words.” Art By Ranjith & Waquas Islamiah Evening College.
Introduction To Algorithm and Data Structures Course Teacher: Moona Kanwal -Algorithm Design -Algorithm Analysis -Data structures -Abstract Data Type.
Advanced Algorithms Analysis and Design
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Design and Analysis of Algorithms (09 Credits / 5 hours per week)
ASIC Design Methodology
MESB374 System Modeling and Analysis
Dynamo: A Runtime Codesign Environment
CSCI-235 Micro-Computer Applications
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
GC211Data Structure Lecture2 Sara Alhajjam.
Linear Filters in StreamIt
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Instruction of Chapter 8 on
Description and Analysis of Systems
INTRODUCTION TO BASIC MATLAB
Anne Pratoomtong ECE734, Spring2002
Programmable Logic Devices: CPLDs and FPGAs with VHDL Design
Design and Analysis of Computer Algorithm (CS575-01)
Introduction to cosynthesis Rabi Mahapatra CSCE617
Centar ( Global Signal Processing Expo
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Chapter 1 Introduction(1.1)
Communication and Coding Theory Lab(CS491)
HIGH LEVEL SYNTHESIS.
The use of Neural Networks to schedule flow-shop with dynamic job arrival ‘A Multi-Neural Network Learning for lot Sizing and Sequencing on a Flow-Shop’
Digital Fundamentals Floyd Chapter 4 Tenth Edition
Presented By: Darlene Banta
Real time signal processing
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Analysis and design of algorithm
Presentation transcript:

Automatic Generation of Systolic Array Designs For Reconfigurable Computing Greg Nash Engineering of Reconfigurable Systems and Algorithms (ERSA '02) International Multiconference in Computer Science Las Vegas, Nevada, June 24, 2002

FPGA-Based Systolic Computing: Rationale A large number of parallel “systolic” algorithms exist that are well suited for use in real-time applications such as signal/image processing, communications, and adaptive processing Reconfigurable FPGA hardware is the most cost-effective implementation strategy when “programmable” systolic computations are desired But designing systolic arrays is a complex process for which there are no systematic design methodologies or commercial systolic CAD tools Centar’s Symbolic Parallel Algorithm Development Environment (SPADE) allows a designer to automatically explore the design space of different systolic array implementations so that system level tradeoffs can be efficiently analyzed

Parallel Processing with Systolic Arrays Algorithms – linear algebra – graph theory – computational geometry – string matching – sorting/searching – dynamic programming – discreet mathematics – number-theoretic algorithms Applications (real-time/embedded processing) – communications – seismic analysis – signal/image processing – adaptive processing – arithmetic arrays Architecture – simple processing elements – local interconnects – synchronous – fine-grained – pipelined – small local memory – local control – regular arrays Hardware – FPGA/PLD chips – programmable connections – reconfigurable boards – ASICs

Systolic Array: Matrix Multiply Space-Time Mapping Systolic Array Project along time axis

CAD Tool Goals Create parallel array designs with reduced requirements for mapping expertise/heuristics Tool inputs derived directly from mathematical solutions Heuristics applied only from within a high level language User control of architectural features by addition of constraints, e.g., boundary vs internal I/O Strong emphasis on visual feedback  Filtering of outptut; find delay-optimal solutions with Minimum area Maximum regularity Minimum network bandwidth Ease of use Built-in simulator with practical architectural model VHDL (prototype) translator to facilitate FPGA implementations

High Level Tool Description Simulator, Graphical Outputs Mathematical Algorithm Input Code Transformation Search i,k S,T for i to N do for j to N do if j=1 and i>=1 and i<=N then l[i,j]:=a[i,j]; elif i=1 and j>1 and j<=N then u[i,j]:=a[i,j]/l[i,i]; fi; if i>=j and j>1 and i<=N then l[i,j]:=a[i,j]-add(l[i,k]*\ u[k,j],k=1..j-1) if j>i and i>1 and j<=N then u[i,j]:=(a[i,j]-add(l[i,k]*\ u[k,j],k=1..i-1))/l[i,i] od od; S=spatial coordinates T=temporal coordinates M=transformation solution

Systolic Array: CAD Design Issues Algorithm representation Scheduling Reindexing Localization Allocation Constraint introduction Finding optimal solutions Automatic operation Efficient hardware

Symbolic Parallel Algorithm Development Environment (SPADE) Based on DESCARTES (Baltus and Allen, MIT) Finds optimal minimum latency solutions by “exhaustive” search of all possible solutions Schedules vector elements have limited range Spatial relationships between variables described by unimodular matrices Architectural constraints introduced at abstract level to reduce search space SPADE vs DESCARTES Uses more familiar input language (Maple™) Symbolic computations in Maple Structured computations in Fortran Different search formalism Added constraints Built in simulator

Algorithm Domain Multiple statements of the general form Where Ax,By/ax,by are integer matrices/vectors, S is the dimension of the algorithm space and the dependencies include commutative and associative operators: min, max, ,  Many “systolic” and fined grained parallel algorithm examples fall in this category Algorithms and solutions based on use of affine transforms Affine transforms are most general that preserve regularity Above form can always be systematically transformed to regular arrays with local interconnections

Array Design Solution Strategy Minimum execution time (primary objective function) Minimum area/regularity/bandwidth: secondary objective function Search based approach: for each algorithm variable explore All schedules All allocations All variable-variable indexing relationships

Solution Search Find affine transformations, T, that map algorithm indices to space-time indices If find T consists of a time component / (vector/scalar) and spatial component S/s (matrix/vector): SPADE treats , , S and s as “search variables” Matrix S is considered as a single variable and is based on unimodular matrices to force dense mappings and limit the number of possible values The elements of vectors ,s are considered separately as search values Search variables only have a small number of possible values Search contains breadth-first and depth-first components

Solution Constraints Space of possible solutions grows exponentially with number of search variables and possible values Impose abstract architectural constraints to make search process feasible, e.g., Causality between dependences, e.g., if variable c depends on a, require I/O at boundary for variable a imposes “time align” constraint Where na is the normal to the plane of a and ut is a unit time vector

Simulator Architectural model intended to facilitate transition to hardware Assumes local FSM’s control data movement and arithmetic Moderate amount of local state data Data transfers Eight nearest neighbor paths Simultaneous transfers of data I/O through edge stacks or data pre-loaded in array Basic array activity sequence Data transfer Intermediate calculations (running sums) Result calculations Visual instrumentation of user selected data and data movement

2-D Mapping Example Algorithm (includes fan-in, fan-out, and single I/O variable) Mathematical expression: Maple code: Converted by parser to Mathematical equivalent: Solutions (N=6) - (a) toptimal = 3N - 4; I/O at array edge - (b) toptimal = 2N - 2; No I/O restrictions for i from 2 to N do a[i]=add(a[j]+b[i],j=1..i-1); od; Time Time a q a q b b Space Space (a) Single optimal solution (b) 1 of 2 optimal solutions

Faddeev Algorithm Problem Solution Place matrices in array Add linear combination of rows of A to C Chose WA=C Solution “appears” in lower left hand quadrant

Design Example: Linear Algebra Mathematical Solution Input Code for j to 2*N do for i to 2*N do if i=1 and j>=1 and j<=N then u[i,j] := a[i,j] fi; if i=1 and j>=1 and j<=N then b[i,j] := a[i,j+N] fi; if j>=i and i>1 and j<=N then u[i,j]:=a[i,j]-add(l[i,k]*u[k,j],k=1..i-1) fi; if j>=1 and i>=1 and j<=N and i<=N then b[i,j]:=a[i,j+N]-add(l[i,k]*b[k,j],k=1..i-1) d[i,j] := a[i+N,j+N]-add(l[i+N,k]*b[k,j],k=1..N) if j=1 and i>=1 and i<=2*N then l[i,j]:=a[i,j]/u[j,j]; if i>=j and j>1 and i<=2*N and j<=N then l[i,j]:=(a[i,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j] od od; With substitutions

Design Example Solutions Mathematical Transformations Comments Two optimal solutions found Mimimum area, secondary objective function Latency 5N-2 Area: ½ N(3N+1) linear array of dividers required Space-time View, N=6 (One of Two Solutions) Systolic Array

Single Divider Solutions Replace variable 1/u[i,j] with u_inv[j]: Constrain u_inv[j] to be time aligned (projects to a point) ... if j<=N and j>=1 then u_inv[j]:=1/u[j,j] fi; if j=1 and i>=1 and i<=2*N then l[i,j]:=a[i,j]*u_inv[j] fi; if i>=j and j>1 and i<=2*N and j<=N then l[i,j]:=(a[i,j]-add(l[i,k]*u[k,j],k=1..j-1))*u_inv[j] fi; u_inv Systolic Array l u d 5N-2 latency Single latency optimal, minimum area solution 4N2 area a b

and b/d initialized to B/D Other Code Input Form for j to 2*N do for i to 2*N do if i=1 and j>=1 and j<=N then u[i,j] := A[i,j] fi; if j>=i and i>1 and j<=N then u[i,j]:=A[i,j]-add(l[i,k]*u[k,j],k=1..i-1) fi; if j>=1 and i>1 and j<=N and i<=N then b[i,j]:=b[i,j]-add(l[i,k]*b[k,j],k=1..i-1) if j>=1 and i>=1 and j<=N and i<=N then d[i,j] := d[i,j]-add(l[i+N,k]*b[k,j],k=1..N) if j=1 and i>=1 and i<=N then l[i,j]:=A[i,j]/u[j,j] fi; if i>=j and j>1 and i<=N and j<=N then l[i,j]:=(A[i,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j] if j=1 and i>N and i<=2*N then l[i,j]:=C[i-N,j]/u[j,j] fi; if i>N and j>1 and i<=2*N and j<=N then l[i,j]:=(C[i-N,j]-add(l[i,k]*u[k,j],k=1..j-1))/u[j,j] od od; With substitutions: and b/d initialized to B/D

Maximum Regularity Designs (1) Desire simple interconnection network topology Avoid time aligned variables (introduces O(N) memory per PE) Preference for “close” dependency relations between variables 

Maximum Regularity Design (2) One optimal solution found Mimimum area, secondary objective function Latency 5N-2 Area: 4N2 2-D array of dividers required 25% fewer data flow paths Load/unload cycle required for data Space-Time View Systolic Array

Direct Path from Mathematical Input Design Example: 1D DFT Mathematical Algorithm SPADE Outputs Base-4 Transformation Direct Path from Mathematical Input To Array Design SPADE Input for j to N/4 do for k to N/4 do Y[j,k] := WM[j,k]*add(CM1[j,i]*X[i,k],i=1..b); od; for k to R do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..4*b); od  Patent Pending

Altera Stratix FPGA: DFT Mapping

Affine Recurrence Equation Types Uniform: x(I) depends on y(I-D) D is integer vector Example (matrix-matrix multiplication) Non-uniform: x(AI+a) depends on y(BI+b) A/B are integer matrices; a/b are vectors

Lyapunov Matrix Equation Start with abstract problem (e.g., Lyapunov algorithm, solve for X) Convert to mathematical expression Non-uniform recurrenc equation in Maple language for i to N do for j to N do x[i,j] := (c[i,j]-add(a[i,k]*x[k,j],k=1..i-1)- add(b[l,j]*x[i,l],l=1..j-1))/(a[i,i]+b[j,j]); od;

Lyapunov: Uniform Algorithm * Difficult to represent algorithms in this form and no systematic method for doing this All variables must be heuristically embedded in space-time “Extra” variables created All variables must exist at all points in the index space *(V. Rowchowdhury, PhD thesis, Stanford 1989)

More Information “Constraint Directed CAD Tool For Automatic Latency-Optimal Implementation of 1-D and 2-D Fourier Transforms” (SPIE ITCom 2002, Boston, MA, July 29- August 2, 2002.) Use of constraints to define array designs 2-D FFT example Hardware Efficient Base-4 Systolic Architecture for Computing the Discrete Fourier Transform (2002 IEEE Workshop on Signal Processing Systems, San Diego CA, October 16-18) Details of 1D DFT design Mapping to FPGAs www.centar.net (above papers and extended viewgraphs)