Download presentation
Presentation is loading. Please wait.
Published byBrent Carr Modified over 9 years ago
1
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic Computing
2
Lect-12.2CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Recap – Multi-FPGA Systems Crossbar topology: Devices A-D are routing only Gives predictable performance Potential waste of resources for near-neighbor connections ABCD WXYZ
3
Lect-12.3CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Recap – Logic Emulation Emulation takes a sizable amount of resources Compilation time can be large due to FPGA compiles
4
Lect-12.4CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Recap – Virtual Wires Overcome pin limitations by multiplexing pins and signals Schedule when communication will take place
5
Lect-12.5CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Outline Recap Introduction and Motivation Common Systolic Structures Algorithmic Mapping Mapping Examples Finite impulse response Matrix-vector product Banded matrix-vector product Banded matrix multiplication
6
Lect-12.6CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Systolic Computing sys·to·le (sǐs’tə-lē) n. – the rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and pulmonary artery after each dilation or diastole [Greek systolē, from systellein to contract, from syn- + stellein to send] – sys·tol·ic (sǐs-tõl’ǐk) adj. Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory. [Kung, 1982]
7
Lect-12.7CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Systolic Architectures Goal – general methodology for mapping computations into hardware (spatial computing) structures Composition: Simple compute cells (e.g. add, sub, max, min) Regular interconnect pattern Pipelined communication between cells I/O at boundaries x x +x min xc
8
Lect-12.8CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Motivation Effectively utilize VLSI Reduce “Von Neumann Bottleneck” Target compute-intensive applications Reduce design cost Simplicity Regularity Exploit concurrency Local communication Short wires (small delay, less area) Scalable
9
Lect-12.9CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Why Study? Original motivation – specialized accelerator for an application Model/goals is a close match to reconfigurable computing Target algorithms match Well-developed theory, techniques, and solutions One big difference – Kung’s approach targeted custom silicon (not a reconfigurable fabric) Compute elements needed to be more general
10
Lect-12.10CprE 583 – Reconfigurable ComputingSeptember 27, 2007 One- dimensional linear array Common Systolic Structures F1F1 FNFN F3F3 F2F2 … F4F4 F 1,1 F 1,N F 1,3 F 1,2 … F 1,4 F 2,1 F 2,N F 2,3 F 2,2 … F 2,4 F M,1 F M,N F M,3 F M,2 … F M,4 …………… Two- dimensional mesh
11
Lect-12.11CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Hexagonal Array F1F1 F3F3 F6F6 F 10 F5F5 F9F9 F 13 F 15 F 12 F 16 F8F8 F 14 F2F2 F 11 F4F4 F7F7 Communicates with six nearest neighbors Squared-up representation F1F1 F4F4 F2F2 F7F7 F3F3 F8F8 F5F5 F 11 F 10 F 15 F 13 F 16 F6F6 F 12 F9F9 F 14
12
Lect-12.12CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Binary Tree F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 F8F8 F9F9 F 10 F 11 F 12 F 13 F 14 F 15 F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 F 12 F 13 F 14 F 15 F8F8 F9F9 F 10 F 11 H-Tree Representation
13
Lect-12.13CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Mapping Approach Allocate PEs Schedule computation Schedule PEs Schedule data flow Optimize Available Transformations: Preload repeated values Replace feedback loops with registers Internalize data flow Broadcast common input
14
Lect-12.14CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Example – Finite Impulse Response A Finite Impulse Response (FIR) filter is a type of digital filter Finite – response to an impulse eventually settles to zero Requires no feedback for (i=1; i<=n; i++) for (j=1; j <=k; j++) y[i] += w[j] * x[i+j-1];
15
Lect-12.15CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #1 Parallelize the outer loop for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; wjwj xjxj y1y1 y1y1 wjwj x j+1 y2y2 y2y2 wjwj x n+j-1 ynyn ynyn
16
Lect-12.16CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Atttempt #1 (cont.) Broadcast common inputs for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; wjwj y1y1 xjxj wjwj y2y2 x j+1 wjwj y3y3 x j+2 wjwj ynyn x j+n-1 … wjwj y1y1 xjxj y2y2 x j+1 y3y3 x j+2 ynyn x j+n-1 …
17
Lect-12.17CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #1 (cont.) Retime to eliminate broadcast for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; wjwj y1y1 xjxj y2y2 x j+1 y3y3 x j+2 ynyn x j+n-1 …
18
Lect-12.18CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #1 (cont.) Broadcast common values for (i=1; i<=n; i++) in parallel for (j=1; j <=k; j++) sequential y[i] += w[j] * x[i+j-1]; wjwj y1y1 x i-1 y2y2 y3y3 ynyn …
19
Lect-12.19CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #2 Parallelize the inner loop for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; w1w1 xixi yiyi yiyi w2w2 x i+1 yiyi yiyi wkwk x i+k-1 yiyi yiyi
20
Lect-12.20CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #2 (cont.) Internalize data flow for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; yiyi w1w1 x i+1 x i+2 x i+k-1 … yiyi xixi w2w2 w3w3 wkwk
21
Lect-12.21CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #2 (cont.) Allocation schedule for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; y1y1 w1w1 … x1x1 w2w2 w3w3 wkwk y2y2 y3y3 ynyn …y1y1 y2y2 y3y3 ynyn … x2x2 x3x3 xixi … x2x2 x3x3 x4x4 x i+1 … x3x3 x4x4 x5x5 x i+2 … xkxk x k+1 x k+2 x i+k-1 …
22
Lect-12.22CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #2 (cont.) Preload repeated values for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; y1y1 w1w1 w2w2 w3w3 wkwk … x1x1 y2y2 y3y3 ynyn …y1y1 y2y2 y3y3 ynyn … x2x2 x3x3 xixi … x2x2 x3x3 x4x4 x i+1 … x3x3 x4x4 x5x5 x i+2 … xkxk x k+1 x k+2 x i+k-1 …
23
Lect-12.23CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #2 (cont.) Broadcast common values for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; y1y1 w1w1 w2w2 w3w3 wkwk … y2y2 y3y3 ynyn …y1y1 y2y2 y3y3 ynyn … x1x1 x2x2 x3x3 xnxn …
24
Lect-12.24CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Attempt #2 (cont.) Retime to eliminate broadcast for (i=1; i<=n; i++) sequential for (j=1; j <=k; j++) in parallel y[i] += w[j] * x[i+j-1]; y1y1 w1w1 w2w2 w3w3 wkwk … y2y2 … y1y1 y2y2 … x1x1 x2x2 …
25
Lect-12.25CprE 583 – Reconfigurable ComputingSeptember 27, 2007 FIR Summary Sequential Memory bandwidth per output – 2k+1 O(k) cycles per output O(1) hardware x + xixi w1w1 x + w2w2 x + w3w3 x + w4w4 yiyi Systolic Memory bandwidth per output – 2 O(1) cycles per output O(k) hardware
26
Lect-12.26CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Example – Matrix-Vector Product for (i=1; i<=n; i++) for (j=1; j<=n; j++) y[i] += a[i][j] * x[j];
27
Lect-12.27CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Matrix-Vector Product (cont.) t = 3 t = 2 t = 1 t = 4 a 31 a 21 a 11 a 41 a 22 a 12 – a 23 a 13 – – a 23 – – – a 14 x1x1 x2x2 x3x3 x4x4 y2y2 y3y3 y4y4 y1y1 t = n+1 t = n+2 t = n+3 t = n xnxn … – – – –
28
Lect-12.28CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Banded Matrix-Vector Product q p for (i=1; i<=n; i++) for (j=1; j<=p+q-1; j++) y[i] += a[i][j-q-i] * x[j];
29
Lect-12.29CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Banded Matrix-Vector Product (cont.) t = 3 t = 2 t = 1 t = 4 a 12 – – – – a 11 – a 22 a 21 – – – – – – a 31 t = 3 t = 4 t = 5 t = 2 x2x2 – x3x3 – y out y in x in x out a in t = 1x1x1 t = 5a 23 –a 32 – yiyi
30
Lect-12.30CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Banded Matrix Multiplication
31
Lect-12.31CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Banded Matrix Multiplication (cont.) t = 6 t = 5 t = 4 t = 7 – – – c 41 c 31 – – – – c 21 – – – – c 11 c 22 – c 12 – – c 32 – – – – – – c 14 – – a 11 a 21 – – – – – – – a 12 – a 13 – – b 11 – – – – b 12 – b 21 – – b 13 – b 22 – – – – b 23 – F c out b in a out c in b out a in t = 1 t = 2 t = 3 t = 4 t = 5 – – a 31 – –
32
Lect-12.32CprE 583 – Reconfigurable ComputingSeptember 27, 2007 Summary Systolic structures are good for computation- bound problems Models costs in VLSI systems Minimize number of memory accesses Emphasize local interconnections (long wires are bad) Candidate algorithms Makes multiple use of input data (ex: n inputs, O(n 3 ) computations Concurrency Simple control flow, simple processing elements
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.