Download presentation
Presentation is loading. Please wait.
Published byDarlene Reed Modified over 9 years ago
1
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and Reconfigurable Processors for Computing and Communications IV: SPIE ITCom, Boston, MA, July 29, 2002
2
Outline Introduction to CAD tool, SPADE (symbolic parallel algorithm development environment) Design examples: matrix Lyapunov equation, discreet Fourier transform (DFT) Isolating useful designs (Lyapunov) –Alignment of variables in space-time –Non-optimal solutions –“Low” bandwidth designs –“Regular” designs Finding optimal solutions (DFT) –Minimum latency –Maximum throughput
3
Systolic Array: Matrix Multiply Project along time axis Space-Time MappingSystolic Array d e c
4
Parallel Processing With Systolic Arrays Algorithms –Linear algebra–graph theory –computational geometry –String matching– sorting/searching – dynamic programming –Discreet mathematics– number-theoretic algorithms Applications (real-time/embedded processing) –Communications–seismic analysis –signal/image processing – Adaptive processing– arithmetic arrays Architecture –Simple processing elements–local interconnects –synchronous – Fine-grained– pipelined– small local memory – Local control–regular arrays Hardware –FPGA/PLD chips–programmable connections – Reconfigurable boards– asics
5
Altera Stratix FPGA: DFT Mapping Systolic DFT Array
6
SPADE Operation Mathematical Algorithm Input Code Transformation Search i,kS,T S=spatial coordinates T=temporal coordinates M=transformation solution Simulator, Graphical Outputs for i to N do for j to N do if j=1 and i>=1 and i<=N then l[i,j]:=a[i,j]; elif i=1 and j>1 and j<=N then u[i,j]:=a[i,j]/l[i,i]; fi; if i>=j and j>1 and i<=N then l[i,j]:=a[i,j]-add(l[i,k]*\ u[k,j],k=1..j-1) fi; if j>i and i>1 and j<=N then u[i,j]:=(a[i,j]-add(l[i,k]*\ u[k,j],k=1..i-1))/l[i,i] fi; od od;
7
Algorithm Domain Multiple statements of the general form –Where A x,B y /a x,b y are integer matrices/vectors, S is the dimension of the algorithm space and the dependencies include commutative and associative operators: min, max, ,
8
SPADE Functionality Scheduling Reindexing Localization Allocation Constraint introduction Solutions –Primary objective function: latency –Secondary objective functions area regularity bandwidth Automatic operation
9
“Time-alignment” Constraint Space-Time Mapping Systolic Array (N=6) Matrix-matrix multiplication: c d e
10
Lyapunov Matrix Equation Example Abstract problem: find X given A (lower triangular) and B (upper triangular) Convert to mathematical expression Non-uniform recurrence equation in maple language for i to N do for j to N do x[i,j] := (c[i,j]-add(a[i,k]*x[k,j],k=1..i-1)- add(b[l,j]*x[i,l],l=1..j-1))/(a[i,i]+b[j,j]); od;
11
Non-latency Optimal Solutions Two minimum area, latency optimal designs (L=4N-3) found Four smaller area, non-optimal designs (L=4N-2) found Space-Time View (N=6)
12
Minimum Bandwidth Secondary Objective Function Minimum area secondary objective function, x,a, and b time aligned –2 unique designs found –8 unique data flow paths –5 different directions –Some PEs experience 6 different different flows of data Minimum bandwidth secondary objective function –Single unique minimum area design found –Variable x placed in “center” of array N=6
13
Maximum Regularity Secondary Objective Function Desire simple orthogonal interconnection network topology with minimum number of interconnections Avoid time aligned variables (introduces O(N) memory per PE) Preference for “close” dependency relations between variables Four unique solutions found Reject (N=6) x a b x a b x a b
14
1D DFT Design Example for j to N/4 do for k to N/4 do Y[j,k] := WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); od od; Base-4 Transformation Mathematical derivation (base-4 form) SPADE input code Desired constraints –Minimize number of multipliers (time-align Y) –Time-align X, Z at array boundary –Keep coefficient matrices CM1 and CM2 internal to the array
15
Base-4 vs. Previous Systolic Designs C M1 and C M2 contain only elements from the set {1,-1,-i,i} C M1 X and C M2 Y t only involve complex additions Twiddle factor matrix W M is of dimension N/4 x4 fewer complex multiplies with x2 more complex adders (previous designs require one complex multiply/add per transform point) Takes advantage of reduced arithmetic with radix-4 butterfly, but transform length not limited to N = r m
16
1D DFT Systolic Design Result Maximum regularity secondary objective function Latency = 3N/4+7 16 designs found Very irregular space-time mappings Systolic Array Space-Time Views (N=64) Y X Z CM2 CM1 Y Y X X Z Z
17
DFT: Constraints Relaxed Requires either –X/CM2 time aligned, Z/CM1 internal –Z/CM1 time aligned, X/CM2 internal Minimum area secondary objective designs for 1D DFT –Latency = N/2 + 8 –Six unique designs –Block processing time = N/4 + 6 –Structure moderatly irregular Y X Z CM1 IM2 IM1 CM2 Space-Time View N=64
18
1D DFT: Throughput Vs. Latency High computational efficiencies inside space-time variable mappings are necessary to achieve the best latencies High computational efficiency in entire space-time volume is necessary for high throughputs Designs need to be “stackable” in time
19
Latency and Throughput Optimal Designs Maximum regularity setting Two structurally different designs –X/CM2 time aligned, Z/CM1 internal –Z/CM1 time aligned, X/CM2 internal Latency = N/2 + 8 Throughput = N/4 +1 Very regular structure Systolic Array (N=64) Space-time view, two DFT iterations (N=64)
20
2D NxN DFT Design N 1D “row” DFTs followed by N “column” DFTs 1D DFT compution by factoring, N = n 1 * n 2, and doing 2D n 1 x n 2 DFT Uses both of two optimal systolic designs –X/CM2 time aligned, Z/CM1 internal –Z/CM1 time aligned, X/CM2 internal
21
Systolic vs. “Pipelined” 16x16 DFT † S. Yu and E. Swartzlander, “A Pipelined Architecture for the Multidimensional DFT,” IEEE Trans. Signal Processing, Vol. 49, No. 9, Sept. 2001.
22
More Information “Automatic Generation of Systolic Array Designs For Reconfigurable Computing”, Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA '02), International Multiconference in Computer Science, Las Vegas, Nevada, June 24, 2002. –General description of SPADE –Faddeev algorithm (Find CX+D, given AX=B, X is unknown) Hardware Efficient Base-4 Systolic Architecture for Computing the Discrete Fourier Transform, 2002 IEEE Workshop on Signal Processing Systems, San Diego CA, October 16-18. –Details of base-4 DFT designs –Mapping to FPGAs www.centar.net (papers and extended viewgraphs)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.