CprE / ComS 583 Reconfigurable Computing

Slides:



Advertisements
Similar presentations
Why Systolic Architecture ?. Motivation & Introduction We need a high-performance, special-purpose computer system to meet specific application. I/O and.
Advertisements

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
Chapter One Introduction to Pipelined Processors.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.
Exploiting Parallelism
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
CAD for VLSI Ramakrishna Lecture#2.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.
SUBJECT : DIGITAL ELECTRONICS CLASS : SEM 3(B) TOPIC : INTRODUCTION OF VHDL.
Advanced Algorithms Analysis and Design
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Database Management System
VLSI Testing Lecture 5: Logic Simulation
Pipelining and Retiming 1
Embedded Systems Design
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
ESE532: System-on-a-Chip Architecture
Introduction to parallel algorithms
Instructor: Dr. Phillip Jones
VLSI Testing Lecture 6: Fault Simulation
Algorithm Analysis CSE 2011 Winter September 2018.
Morgan Kaufmann Publishers
Greg Stitt ECE Department University of Florida
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
COMP4211 : Advance Computer Architecture
VLSI Testing Lecture 6: Fault Simulation
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
Introduction to cosynthesis Rabi Mahapatra CSCE617
Pipelining and Vector Processing
We will be studying the architecture of XC3000.
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Objective of This Course
Register Pressure Guided Unroll-and-Jam
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
Selected Topics: External Sorting, Join Algorithms, …
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Introduction to parallel algorithms
ESE532: System-on-a-Chip Architecture
Pipeline Pattern ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, 2012 slides5.ppt Oct 24, 2013.
Algorithm Discovery and Design
Artificial Intelligence Lecture No. 28
Guest Lecture by David Johnston
Searching, Sorting, and Asymptotic Complexity
Unit –VIII PRAM Algorithms.
Introduction to Data Structures
Algorithms and Problem Solving
Spreadsheets, Modelling & Databases
Flowcharts and Pseudo Code
Evaluation of Relational Operations: Other Techniques
Algorithm Design Techniques Greedy Approach vs Dynamic Programming
Combinational Circuits
Algorithmic complexity
Real time signal processing
CprE / ComS 583 Reconfigurable Computing
Optimizing single thread performance
Evaluation of Relational Operations: Other Techniques
Introduction to parallel algorithms
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Other Spatial Styles

“Desperate Housewives” Night out in Campustown Quick Points HW #3 coming out today Due Tuesday, October 17 (midnight) Systolic computing structures Systolic mapping Logic partitioning FPGA synthesis Priority: 74 CprE 583 Homework Priority: 1 Breathing, Eating, etc. Priority: 14 “Desperate Housewives” Priority: 6 Night out in Campustown Priority: 45 Other Work … September 28, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Project Proposals Due Sunday, 10/8 at midnight Purpose – to provide a background and overview of the project Goal – allow me to understand what you are intending to do Project topic: Perform an in-depth exploration of some area of reconfigurable computing Whatever topic you choose, you must include a strong experimental element in your project Work in groups of 2+ (3 if very lofty proposal) September 28, 2006 CprE 583 – Reconfigurable Computing

Project Proposals (cont.) Suggested structure [3-4 pages, IEEE conf. format] Introduction – what is the context for this work? What problem are you trying to address? Why is it interesting/challenging? Prior work – what is the related work? How does your work differ from these? (5-10 references) Approach – how are you going to tackle the problem? What tools and methodologies do you intend on using? What experiments do you intend on running? Expected results – what do you expect the outcome of your project to be? What are the deliverables? How do you intend on presenting your results? Milestones – what is your expected progress schedule? Provide a weekly / bi-weekly basis September 28, 2006 CprE 583 – Reconfigurable Computing

Systolic Architectures Goal – general methodology for mapping computations into hardware (spatial computing) structures Composition: Simple compute cells (e.g. add, sub, max, min) Regular interconnect pattern Pipelined communication between cells I/O at boundaries Simple processing units to make better use of unfamiliar VLSI technology. Regular and local connections to reduce I/O (pin counts, and a potential bottleneck). Predetermined because this is a static, custom VLSI design Pipelining to increase throughput Can think of it as an assembly line of processing tasks operating on data as it flows by. x + x min x x c September 28, 2006 CprE 583 – Reconfigurable Computing

Example – Finite Impulse Response A Finite Impulse Response (FIR) filter is a type of digital filter Finite – response to an impulse eventually settles to zero Requires no feedback for (i=1; i<=n; i++) for (j=1; j <=k; j++) y[i] += w[j] * x[i+j-1]; CprE 583 – Reconfigurable Computing September 28, 2006

Finite Impulse Response (cont.) Sequential Memory bandwidth per output – 2k+1 O(k) cycles per output O(1) hardware Systolic Memory bandwidth per output – 2 O(1) cycles per output O(k) hardware xi x x x x w1 w2 w3 w4 + + + + yi September 28, 2006 CprE 583 – Reconfigurable Computing

Example – Matrix-Vector Product for (i=1; i<=n; i++) for (j=1; j<=n; j++) y[i] += a[i][j] * x[j]; September 28, 2006 CprE 583 – Reconfigurable Computing

Matrix-Vector Product (cont.) – t = 3 a31 a22 a13 – – t = 2 a21 a12 – – – t = 1 a11 – – – – x1 x2 x3 x4 xn … y1 t = n Each of the function blocks performs a multiply-addition operation. Data is fed in a time-delayed fashion. This structure can perform a dense matrix-vector product in 2n cycles. y2 t = n+1 y3 t = n+2 y4 t = n+3 CprE 583 – Reconfigurable Computing September 28, 2006

CprE 583 – Reconfigurable Computing Outline Project Proposals Recap Non-Numeric Systolic Examples Systolic Loop Transformations Data dependencies Iteration spaces Example transformations Reading – Cellular Automata Reading – Bit-Serial Architectures September 28, 2006 CprE 583 – Reconfigurable Computing

Example – Relational Database Relation is a collection of tuples that all have the same attributes Tuple is a fixed number of objects Represented in a table tuple # Name School Age QB Rating D. Carr Fresno State 27 113.6 1 P. Rivers NC State 24 107.4 2 D. McNabb Syracuse 29 105.3 3 C. Pennington Marshall 30 103.4 4 R. Grossman Florida 26 100.9 CprE 583 – Reconfigurable Computing September 28, 2006

CprE 583 – Reconfigurable Computing Database Operations Intersection: A ∩ B – all records in both relation A and B Must compare all |A| x |B| tuples Compare via sequence compare Or along row or column to get inclusion bitvector B1 B2 B3 B3 A1 T11 T12 T13 A1 B2 T12 A2 T21 T22 T23 A2 B1 T21 To compute the intersection of two sets of database records (tuples), we must locate those records which are in both sets. For a systolic implementation, records could pass by each other (spatially) for comparison in the same order as a convolution operation. A3 T31 T32 T33 A3 CprE 583 – Reconfigurable Computing September 28, 2006

Database Operations (cont.) Tuple Comparison Problem – tuples are long, comparison time might limit computation rate Strategy – perform comparison in pipelined manner by fields Stagger fields Arrange to compute field i on cycle after i-1 Cell: tout = tin and ain xnor bin B[j,4] B[j,3] B[j,2] The comparison of record fields can be pipelined so that different cells along an array row compare different fields of the same record. Each cell would pipe the true/false result of its field comparison to the next cell in the row, so that the logical and of all comparisons emerges at the end of the row. Note that we always compare every field because of the dataflow in the systolic architecture. A serial processor can short-circuit some of this work -- once a result turns false, it does not need to evaluate the rest of the field comparisons. B[j,1] True A[i,1] A[i,2] A[i,3] A[i,4] CprE 583 – Reconfigurable Computing September 28, 2006

Database Intersection True b51 a21 b43 a13 True b42 a22 b34 a14 True b41 a31 b33 a23 T12 True b32 a32 b24 a24 T22 True b31 a41 b23 a23 T21 Note that each column of the array is dedicated to comparing a different kind of field. Inputs streams A & B traverse up & down the array (respectively), with record fields staggered so that the fields belonging to a single record from the A stream and a single record from the B stream are compared by a single row of the array. True b22 a42 b14 a34 True b21 a51 b13 a43 a52 a44 September 28, 2006 CprE 583 – Reconfigurable Computing

Database Intersection (cont.) FALSE b52 b44 True b51 a21 b43 a13 OR T3 True b42 a22 b34 a14 OR True b41 a31 b33 a23 OR T2 True b32 a32 b24 a24 OR True b31 a41 b23 a23 OR T1 The intersection operation is completed by propagating the logical or of record comparisons to collect identical records True b22 a42 b14 a34 OR True b21 a51 b13 a43 OR a52 a44 September 28, 2006 CprE 583 – Reconfigurable Computing

Database Operations (cont.) Unique: remove duplicate elements in multirelation A Intersect A with A Union: A U B – one copy of all tuples in A and B Concatenate A and B Use Unique to remove duplicates Projection: collapse A by removing select fields of every tuple Sample fields in A’ Set projection is the process of projecting records into a lower "dimensionality" by removing (ignoring) selected fields from each record of a set. The resulting set may contain duplicates which were originally differentiated by the fields which were removed. Thus set projection can be implemented using an intersection-based systolic array for removing duplicates. CprE 583 – Reconfigurable Computing September 28, 2006

CprE 583 – Reconfigurable Computing Database Join Join: AJCA,CBB – where columns CA in A intersect columns CB in B, concatenate tuple Ai and Bj Match CA of A with CB of B Keep all Ti,j Filter i,j for which Ti,j = true Construct join from matched pairs Claim: Typically, | AJCA,CBB | << | A | | B | September 28, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Database Summary Input database – O(n) data Operations require O(n2) effort O(n) if sorted first O(n log(n)) to sort Systolic implementation – works on O(n) processing elements in O(n) time Typical database [KunLoh80A]: 1500 bit tuples 10,000 records in a relation ~1 4-LUT per bit-compare ~1600 XC4062 FPGAs ~84 XC4LX200 FPGAs Note that if we sort database records before performing the set operations, most of the operations become O(n) rather than O(n^2). Including the sort makes operations of complexity O(n log(n)). In practice, most database applications do actually sort the data first. This suggests that the systolic implementations aren't work-optimal in some sense. Of course, the sorting operations require non-local connections and data access, so the fact that they require less operations should be weighted against their greater interconnect requirements. FYI: The best sorting algorithms on systolic hardware, sort n numbers on O(n) hardware in O(sqrt(n)) time. CprE 583 – Reconfigurable Computing September 28, 2006

Systolic Loop Transformations Automatically re-structure code for Parallelism Locality Driven by dependency analysis September 28, 2006 CprE 583 – Reconfigurable Computing

Defining Dependencies Flow Dependence Anti-Dependence Output Dependence Input Dependence W  R δf R  W δa W  W δo R  R δi true false S1) a = 0; S2) b = a; S3) c = a + d + e; S4) d = b; S5) b = 5+e CprE 583 – Reconfigurable Computing September 28, 2006

CprE 583 – Reconfigurable Computing Example Dependencies 1 S1) a = 0; S2) b = a; S3) c = a + d + e; S4) d = b; S5) b = 5+e 2 S1 δf S2 due to a S1 δf S3 due to a S2 δf S4 due to b S3 δa S4 due to d S4 δa S5 due to b S2 δo S5 due to b S3 δi S5 due to e 3 4 5 September 28, 2006 CprE 583 – Reconfigurable Computing

Data Dependencies in Loops Dependence can flow across iterations of the loop Dependence information is annotated with iteration information If dependence is across iterations it is loop carried otherwise loop independent for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } δf loop carried δf loop independent September 28, 2006 CprE 583 – Reconfigurable Computing

Unroll Loop to Find Dependencies for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } δf loop carried δf loop independent A[0] = B[0]; B[1] = A[0]; A[1] = B[1]; B[2] = A[1]; A[2] = B[2]; B[3] = A[2]; ... i = 0 Distance/direction of dependence is also important i = 1 i = 2 September 28, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Thought Exercise Consider the Laplace Transformation: In teams of two, try to determine the flow dependencies, anti dependencies, output dependencies, and input dependencies Use loop unrolling to find dependencies Most dependencies found gets a prize for (i=1; i<N; i++) for (j=1; j<N; j++) c = -4*a[i][j] + a[i-1][j] + a[i+1][j]; c += a[i][j+1] + a[i][j+1] b[i][j] = c; } CprE 583 – Reconfigurable Computing September 28, 2006

CprE 583 – Reconfigurable Computing Iteration Space Every iteration generates a point in an n-dimensional space, where n is the depth of the loop nest for (i=0; i<n; i++) { ... } [4] [3; 2] for (i=0; i<n; i++) { for (j=0; j<5; j++) { ... } September 28, 2006 CprE 583 – Reconfigurable Computing

Distance Vectors for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } Distance vector is the difference between the target and source iterations A[0] = B[0]; B[1] = A[0]; A[1] = B[1]; B[2] = A[1]; A[2] = B[2]; B[3] = A[2]; ... i = 0 d = It - Is i = 1 Exactly the distance of the dependence, i.e., i = 2 Is + d = It September 28, 2006 CprE 583 – Reconfigurable Computing

Distance Vectors Example for (i=0; i<n; i++) { for (j=0; j<m; j++) { A[i,j] = ; = A[i,j]; B[i,j+1] = ; = B[i,j]; C[i+1,j] = ; = C[i,j+1]; } A0,2= =A0,2 B0,3= =B0,2 C1,2= =C0,3 A0,2= =A0,2 B0,3= =B0,2 C1,2= =C0,3 A1,2= =A1,2 B1,3= =B1,2 C2,2= =C1,3 A2,2= =A2,2 B2,3= =B2,2 C3,2= =C2,3 A0,1= =A0,1 B0,2= =B0,1 C1,1= =C0,2 A0,1= =A0,1 B0,2= =B0,1 C1,1= =C0,2 A1,1= =A1,1 B1,2= =B1,1 C2,1= =C1,2 A1,1= =A1,1 B1,2= =B1,1 C2,1= =C1,2 A2,1= =A2,1 B2,2= =B2,1 C3,1= =C2,2 j A0,0= =A0,0 B0,1= =B0,0 C1,0= =C0,1 A0,0= =A0,0 B0,1= =B0,0 C1,0= =C0,1 A1,0= =A1,0 B1,1= =B1,0 C2,0= =C1,1 A2,0= =A2,0 B2,1= =B2,0 C3,0= =C2,1 A2,0= =A2,0 B2,1= =B2,0 C3,0= =C2,1 A yields [0; 0] B yields [0; 1] i C yields [1; -1] September 28, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing FIR Distance Vectors Y0= =Y0 =X3 =W3 Y1= =Y1 =X4 =W3 Y2= =Y2 =X5 =W3 Y3= =Y3 =X6 =W3 for (i=0; i<n; i++) for (j=0; j<m; j++) Y[i] = Y[i]+X[i+j]*W[j]; Y0= =Y0 =X2 =W2 Y1= =Y1 =X3 =W2 Y2= =Y2 =X4 =W2 Y3= =Y3 =X5 =W2 Y yields: δa [0; 0] Y yields: δf [0; 1] Y0= =Y0 =X1 =W1 Y1= =Y1 =X2 =W1 Y2= =Y2 =X3 =W1 Y3= =Y3 =X4 =W1 X yields: δi [1; -1] W yields: δi [1; 0] Y0= =Y0 =X0 =W0 Y1= =Y1 =X1 =W0 Y2= =Y2 =X2 =W0 Y3= =Y3 =X3 =W0 CprE 583 – Reconfigurable Computing September 28, 2006

Re-label / Pipeline Variables Remove anti-dependencies and input dependencies by relabeling or pipelining variables Creates new flow dependencies Removes anti/input dependencies Y W X for (i=0; i<n; i++) { for (j=0; j<m; j++) { Wi[j] = Wi-1[j]; Xi[i+j]=Xi-1[i+j]; Yj[i] = Yj-1[i]+Xi[i+j]*Wi[j]; } CprE 583 – Reconfigurable Computing September 28, 2006

CprE 583 – Reconfigurable Computing FIR Dependencies Y W X Y20= =Y10 X02= =X-12 W02= =W-12 Y21= =Y11 X13= =X03 W12= =W02 Y22= =Y12 X24= =X14 W22= =W12 Y10= =Y00 X01= =X-11 W01= =W-11 Y11= =Y01 X12= =X02 W11= =W01 Y12= =Y02 X23= =X13 W21= =W11 j Y00= =Y-10 X00= =X-10 W00= =W-10 Y01= =Y-11 X11= =X01 W02= =W00 Y02= =Y-12 X22= =X12 W20= =W10 i September 28, 2006 CprE 583 – Reconfigurable Computing

Transforming to Time and Space Using data dependencies, find T T defines a mapping of the iteration space into a time component π, and a space component, S T = [π; S] If π·I1 = π·I2, then I1 and I2 execute at the same time π·d – amount of time units to move data items (π·d > 0) Any S can be picked that makes T a bijection See [Mol83A] for more details September 28, 2006 CprE 583 – Reconfigurable Computing

CprE 583 – Reconfigurable Computing Calculating T for FIR For π = [p1 p2] Since π·d > 0, we see that: p2 != 0 (from Y) p1 != 0 (from W) p1 > p2 (from X) Smallest solution π = [2 1] S can be [1 0], [0 1], [1 1] Y W X September 28, 2006 CprE 583 – Reconfigurable Computing

An Example Transformation (0,4) (1,4) (2,4) (3,4) Y W X (0,3) (1,3) (2,3) (3,3) Time W (0,2) (1,2) (2,2) (3,2) X Y (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) Space September 28, 2006 CprE 583 – Reconfigurable Computing

An Example Transformation (cont.) (0,4) (1,4) (2,4) (3,4) (0,3) (1,3) (2,3) (3,3) Time W (0,2) (1,2) (2,2) (3,2) Y W X Y X (0,1) (1,1) (2,1) (3,1) (0,0) (1,0) (2,0) (3,0) Space CprE 583 – Reconfigurable Computing September 28, 2006

CprE 583 – Reconfigurable Computing Summary Non-numeric (database ops) example of systolic computing Multiple use of each input data item Concurrency Regular data and control flow Loop transformations Data dependency analysis Restructure code for parallelism, locality September 28, 2006 CprE 583 – Reconfigurable Computing