CprE / ComS 583 Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing
Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Other Spatial Styles

“Desperate Housewives” Night out in Campustown
Quick Points HW #3 coming out today Due Tuesday, October 17 (midnight) Systolic computing structures Systolic mapping Logic partitioning FPGA synthesis Priority: 74 CprE 583 Homework Priority: 1 Breathing, Eating, etc. Priority: 14 “Desperate Housewives” Priority: 6 Night out in Campustown Priority: 45 Other Work … September 28, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing
Project Proposals Due Sunday, 10/8 at midnight Purpose – to provide a background and overview of the project Goal – allow me to understand what you are intending to do Project topic: Perform an in-depth exploration of some area of reconfigurable computing Whatever topic you choose, you must include a strong experimental element in your project Work in groups of 2+ (3 if very lofty proposal) September 28, 2006 CprE 583 – Reconfigurable Computing

Project Proposals (cont.)
Suggested structure [3-4 pages, IEEE conf. format] Introduction – what is the context for this work? What problem are you trying to address? Why is it interesting/challenging? Prior work – what is the related work? How does your work differ from these? (5-10 references) Approach – how are you going to tackle the problem? What tools and methodologies do you intend on using? What experiments do you intend on running? Expected results – what do you expect the outcome of your project to be? What are the deliverables? How do you intend on presenting your results? Milestones – what is your expected progress schedule? Provide a weekly / bi-weekly basis September 28, 2006 CprE 583 – Reconfigurable Computing

Systolic Architectures
Goal – general methodology for mapping computations into hardware (spatial computing) structures Composition: Simple compute cells (e.g. add, sub, max, min) Regular interconnect pattern Pipelined communication between cells I/O at boundaries Simple processing units to make better use of unfamiliar VLSI technology. Regular and local connections to reduce I/O (pin counts, and a potential bottleneck). Predetermined because this is a static, custom VLSI design Pipelining to increase throughput Can think of it as an assembly line of processing tasks operating on data as it flows by. x + x min x x c September 28, 2006 CprE 583 – Reconfigurable Computing

Example – Finite Impulse Response
A Finite Impulse Response (FIR) filter is a type of digital filter Finite – response to an impulse eventually settles to zero Requires no feedback for (i=1; i<=n; i++) for (j=1; j <=k; j++) y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing September 28, 2006

Finite Impulse Response (cont.)
Sequential Memory bandwidth per output – 2k+1 O(k) cycles per output O(1) hardware Systolic Memory bandwidth per output – 2 O(1) cycles per output O(k) hardware xi x x x x w1 w2 w3 w4 + + + + yi September 28, 2006 CprE 583 – Reconfigurable Computing

Example – Matrix-Vector Product
for (i=1; i<=n; i++) for (j=1; j<=n; j++) y[i] += a[i][j] * x[j]; September 28, 2006 CprE 583 – Reconfigurable Computing

Matrix-Vector Product (cont.)
– t = 3 a31 a22 a13 – – t = 2 a21 a12 – – – t = 1 a11 – – – – x1 x2 x3 x4 xn … y1 t = n Each of the function blocks performs a multiply-addition operation. Data is fed in a time-delayed fashion. This structure can perform a dense matrix-vector product in 2n cycles. y2 t = n+1 y3 t = n+2 y4 t = n+3 CprE 583 – Reconfigurable Computing September 28, 2006

Outline Project Proposals Recap Non-Numeric Systolic Examples Systolic Loop Transformations Data dependencies Iteration spaces Example transformations Reading – Cellular Automata Reading – Bit-Serial Architectures September 28, 2006 CprE 583 – Reconfigurable Computing

Example – Relational Database
Relation is a collection of tuples that all have the same attributes Tuple is a fixed number of objects Represented in a table tuple # Name School Age QB Rating D. Carr Fresno State 27 113.6 1 P. Rivers NC State 24 107.4 2 D. McNabb Syracuse 29 105.3 3 C. Pennington Marshall 30 103.4 4 R. Grossman Florida 26 100.9 CprE 583 – Reconfigurable Computing September 28, 2006

Database Operations Intersection: A ∩ B – all records in both relation A and B Must compare all |A| x |B| tuples Compare via sequence compare Or along row or column to get inclusion bitvector B1 B2 B3 B3 A1 T11 T12 T13 A1 B2 T12 A2 T21 T22 T23 A2 B1 T21 To compute the intersection of two sets of database records (tuples), we must locate those records which are in both sets. For a systolic implementation, records could pass by each other (spatially) for comparison in the same order as a convolution operation. A3 T31 T32 T33 A3 CprE 583 – Reconfigurable Computing September 28, 2006

Database Operations (cont.)
Tuple Comparison Problem – tuples are long, comparison time might limit computation rate Strategy – perform comparison in pipelined manner by fields Stagger fields Arrange to compute field i on cycle after i-1 Cell: tout = tin and ain xnor bin B[j,4] B[j,3] B[j,2] The comparison of record fields can be pipelined so that different cells along an array row compare different fields of the same record. Each cell would pipe the true/false result of its field comparison to the next cell in the row, so that the logical and of all comparisons emerges at the end of the row. Note that we always compare every field because of the dataflow in the systolic architecture. A serial processor can short-circuit some of this work -- once a result turns false, it does not need to evaluate the rest of the field comparisons. B[j,1] True A[i,1] A[i,2] A[i,3] A[i,4] CprE 583 – Reconfigurable Computing September 28, 2006

Database Intersection
True b51 a21 b43 a13 True b42 a22 b34 a14 True b41 a31 b33 a23 T12 True b32 a32 b24 a24 T22 True b31 a41 b23 a23 T21 Note that each column of the array is dedicated to comparing a different kind of field. Inputs streams A & B traverse up & down the array (respectively), with record fields staggered so that the fields belonging to a single record from the A stream and a single record from the B stream are compared by a single row of the array. True b22 a42 b14 a34 True b21 a51 b13 a43 a52 a44 September 28, 2006 CprE 583 – Reconfigurable Computing

Database Intersection (cont.)
FALSE b52 b44 True b51 a21 b43 a13 OR T3 True b42 a22 b34 a14 OR True b41 a31 b33 a23 OR T2 True b32 a32 b24 a24 OR True b31 a41 b23 a23 OR T1 The intersection operation is completed by propagating the logical or of record comparisons to collect identical records True b22 a42 b14 a34 OR True b21 a51 b13 a43 OR a52 a44 September 28, 2006 CprE 583 – Reconfigurable Computing

Database Operations (cont.)
Unique: remove duplicate elements in multirelation A Intersect A with A Union: A U B – one copy of all tuples in A and B Concatenate A and B Use Unique to remove duplicates Projection: collapse A by removing select fields of every tuple Sample fields in A’ Set projection is the process of projecting records into a lower "dimensionality" by removing (ignoring) selected fields from each record of a set. The resulting set may contain duplicates which were originally differentiated by the fields which were removed. Thus set projection can be implemented using an intersection-based systolic array for removing duplicates. CprE 583 – Reconfigurable Computing September 28, 2006

Database Join Join: AJCA,CBB – where columns CA in A intersect columns CB in B, concatenate tuple Ai and Bj Match CA of A with CB of B Keep all Ti,j Filter i,j for which Ti,j = true Construct join from matched pairs Claim: Typically, | AJCA,CBB | << | A | | B | September 28, 2006 CprE 583 – Reconfigurable Computing

Database Summary Input database – O(n) data Operations require O(n2) effort O(n) if sorted first O(n log(n)) to sort Systolic implementation – works on O(n) processing elements in O(n) time Typical database [KunLoh80A]: 1500 bit tuples 10,000 records in a relation ~1 4-LUT per bit-compare ~1600 XC4062 FPGAs ~84 XC4LX200 FPGAs Note that if we sort database records before performing the set operations, most of the operations become O(n) rather than O(n^2). Including the sort makes operations of complexity O(n log(n)). In practice, most database applications do actually sort the data first. This suggests that the systolic implementations aren't work-optimal in some sense. Of course, the sorting operations require non-local connections and data access, so the fact that they require less operations should be weighted against their greater interconnect requirements. FYI: The best sorting algorithms on systolic hardware, sort n numbers on O(n) hardware in O(sqrt(n)) time. CprE 583 – Reconfigurable Computing September 28, 2006

Systolic Loop Transformations
Automatically re-structure code for Parallelism Locality Driven by dependency analysis September 28, 2006 CprE 583 – Reconfigurable Computing

Defining Dependencies
Flow Dependence Anti-Dependence Output Dependence Input Dependence W  R δf R  W δa W  W δo R  R δi true false S1) a = 0; S2) b = a; S3) c = a + d + e; S4) d = b; S5) b = 5+e CprE 583 – Reconfigurable Computing September 28, 2006

Example Dependencies 1 S1) a = 0; S2) b = a; S3) c = a + d + e; S4) d = b; S5) b = 5+e 2 S1 δf S2 due to a S1 δf S3 due to a S2 δf S4 due to b S3 δa S4 due to d S4 δa S5 due to b S2 δo S5 due to b S3 δi S5 due to e 3 4 5 September 28, 2006 CprE 583 – Reconfigurable Computing

Data Dependencies in Loops
Dependence can flow across iterations of the loop Dependence information is annotated with iteration information If dependence is across iterations it is loop carried otherwise loop independent for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } δf loop carried δf loop independent September 28, 2006 CprE 583 – Reconfigurable Computing

Unroll Loop to Find Dependencies
for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } δf loop carried δf loop independent A[0] = B[0]; B[1] = A[0]; A[1] = B[1]; B[2] = A[1]; A[2] = B[2]; B[3] = A[2]; ... i = 0 Distance/direction of dependence is also important i = 1 i = 2 September 28, 2006 CprE 583 – Reconfigurable Computing

Thought Exercise Consider the Laplace Transformation: In teams of two, try to determine the flow dependencies, anti dependencies, output dependencies, and input dependencies Use loop unrolling to find dependencies Most dependencies found gets a prize for (i=1; i<N; i++) for (j=1; j<N; j++) c = -4*a[i][j] + a[i-1][j] + a[i+1][j]; c += a[i][j+1] + a[i][j+1] b[i][j] = c; } CprE 583 – Reconfigurable Computing September 28, 2006

Iteration Space Every iteration generates a point in an n-dimensional space, where n is the depth of the loop nest for (i=0; i<n; i++) { ... } [4] [3; 2] for (i=0; i<n; i++) { for (j=0; j<5; j++) { ... } September 28, 2006 CprE 583 – Reconfigurable Computing

Distance Vectors for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } Distance vector is the difference between the target and source iterations A[0] = B[0]; B[1] = A[0]; A[1] = B[1]; B[2] = A[1]; A[2] = B[2]; B[3] = A[2]; ... i = 0 d = It - Is i = 1 Exactly the distance of the dependence, i.e., i = 2 Is + d = It September 28, 2006 CprE 583 – Reconfigurable Computing

Distance Vectors Example
for (i=0; i<n; i++) { for (j=0; j<m; j++) { A[i,j] = ; = A[i,j]; B[i,j+1] = ; = B[i,j]; C[i+1,j] = ; = C[i,j+1]; } A0,2= =A0,2 B0,3= =B0,2 C1,2= =C0,3 A0,2= =A0,2 B0,3= =B0,2 C1,2= =C0,3 A1,2= =A1,2 B1,3= =B1,2 C2,2= =C1,3 A2,2= =A2,2 B2,3= =B2,2 C3,2= =C2,3 A0,1= =A0,1 B0,2= =B0,1 C1,1= =C0,2 A0,1= =A0,1 B0,2= =B0,1 C1,1= =C0,2 A1,1= =A1,1 B1,2= =B1,1 C2,1= =C1,2 A1,1= =A1,1 B1,2= =B1,1 C2,1= =C1,2 A2,1= =A2,1 B2,2= =B2,1 C3,1= =C2,2 j A0,0= =A0,0 B0,1= =B0,0 C1,0= =C0,1 A0,0= =A0,0 B0,1= =B0,0 C1,0= =C0,1 A1,0= =A1,0 B1,1= =B1,0 C2,0= =C1,1 A2,0= =A2,0 B2,1= =B2,0 C3,0= =C2,1 A2,0= =A2,0 B2,1= =B2,0 C3,0= =C2,1 A yields [0; 0] B yields [0; 1] i C yields [1; -1] September 28, 2006 CprE 583 – Reconfigurable Computing

FIR Distance Vectors Y0= =Y0 =X3 =W3 Y1= =Y1 =X4 =W3 Y2= =Y2 =X5 =W3 Y3= =Y3 =X6 =W3 for (i=0; i<n; i++) for (j=0; j<m; j++) Y[i] = Y[i]+X[i+j]*W[j]; Y0= =Y0 =X2 =W2 Y1= =Y1 =X3 =W2 Y2= =Y2 =X4 =W2 Y3= =Y3 =X5 =W2 Y yields: δa [0; 0] Y yields: δf [0; 1] Y0= =Y0 =X1 =W1 Y1= =Y1 =X2 =W1 Y2= =Y2 =X3 =W1 Y3= =Y3 =X4 =W1 X yields: δi [1; -1] W yields: δi [1; 0] Y0= =Y0 =X0 =W0 Y1= =Y1 =X1 =W0 Y2= =Y2 =X2 =W0 Y3= =Y3 =X3 =W0 CprE 583 – Reconfigurable Computing September 28, 2006

Re-label / Pipeline Variables
Remove anti-dependencies and input dependencies by relabeling or pipelining variables Creates new flow dependencies Removes anti/input dependencies Y W X for (i=0; i<n; i++) { for (j=0; j<m; j++) { Wi[j] = Wi-1[j]; Xi[i+j]=Xi-1[i+j]; Yj[i] = Yj-1[i]+Xi[i+j]*Wi[j]; } CprE 583 – Reconfigurable Computing September 28, 2006

FIR Dependencies Y W X Y20= =Y10 X02= =X-12 W02= =W-12 Y21= =Y11 X13= =X03 W12= =W02 Y22= =Y12 X24= =X14 W22= =W12 Y10= =Y00 X01= =X-11 W01= =W-11 Y11= =Y01 X12= =X02 W11= =W01 Y12= =Y02 X23= =X13 W21= =W11 j Y00= =Y-10 X00= =X-10 W00= =W-10 Y01= =Y-11 X11= =X01 W02= =W00 Y02= =Y-12 X22= =X12 W20= =W10 i September 28, 2006 CprE 583 – Reconfigurable Computing

Transforming to Time and Space
Using data dependencies, find T T defines a mapping of the iteration space into a time component π, and a space component, S T = [π; S] If π·I1 = π·I2, then I1 and I2 execute at the same time π·d – amount of time units to move data items (π·d > 0) Any S can be picked that makes T a bijection See [Mol83A] for more details September 28, 2006 CprE 583 – Reconfigurable Computing

Calculating T for FIR For π = [p1 p2] Since π·d > 0, we see that: p2 != 0 (from Y) p1 != 0 (from W) p1 > p2 (from X) Smallest solution π = [2 1] S can be [1 0], [0 1], [1 1] Y W X September 28, 2006 CprE 583 – Reconfigurable Computing

An Example Transformation
(0,4) (1,4) (2,4) (3,4) Y W X (0,3) (1,3) (2,3) (3,3) Time W (0,2) (1,2) (2,2) (3,2) X Y (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) Space September 28, 2006 CprE 583 – Reconfigurable Computing

An Example Transformation (cont.)
(0,4) (1,4) (2,4) (3,4) (0,3) (1,3) (2,3) (3,3) Time W (0,2) (1,2) (2,2) (3,2) Y W X Y X (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) Space CprE 583 – Reconfigurable Computing September 28, 2006

Summary Non-numeric (database ops) example of systolic computing Multiple use of each input data item Concurrency Regular data and control flow Loop transformations Data dependency analysis Restructure code for parallelism, locality September 28, 2006 CprE 583 – Reconfigurable Computing

CprE / ComS 583 Reconfigurable Computing

Similar presentations

Presentation on theme: "CprE / ComS 583 Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CprE / ComS 583 Reconfigurable Computing

Similar presentations

Presentation on theme: "CprE / ComS 583 Reconfigurable Computing"— Presentation transcript:

Similar presentations

About project

Feedback