CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.

Slides:

Advertisements

Similar presentations

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Loops and cyclic graphs.

Advertisements

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Lecture 11: Code Optimization CS 540 George Mason University.

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Discussed in class and on Fridays n FSMs (only synchronous, with asynchronous reset) –Moore –Mealy –Rabin-Scott n Generalized register: –With D FFs, –With.

Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Data Dependences CS 524 – High-Performance Computing.

Rutgers University Relational Algebra 198:541 Rutgers University.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Concepts of Database Management, Fifth Edition

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

Lecture 4. RAM Model, Space and Time Complexity

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #17 – Introduction.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)

CprE / ComS 583 Reconfigurable Computing

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

Discussed in class and on Fridays n FSMs (only synchronous, with asynchronous reset) –Moore –Mealy –Rabin-Scott n Generalized register: –With D FFs, –With.

CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.

Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.

CMPT 258 Database Systems Relational Algebra (Chapter 4)

SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #12 – Systolic.

Exploiting Parallelism

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 3. Time Complexity Calculations.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.

VLSI SP Course 2001 台大電機吳安宇 1 Why Systolic Architecture ? H. T. Kung Carnegie-Mellon University.

Functional Processing of Collections (Advanced) 6.0.

Dependence Analysis and Loops CS 3220 Spring 2016.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.

CSE202 Database Management Systems

Module 2: Intro to Relational Model

CprE / ComS 583 Reconfigurable Computing

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

CS 213: Data Structures and Algorithms

Chapter 2: Intro to Relational Model

Relational Algebra Chapter 4, Part A

Web Data Extraction Based on Partial Tree Alignment

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

Register Pressure Guided Unroll-and-Jam

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Unit –VIII PRAM Algorithms.

CprE 588 Embedded Computer Systems

Implementation of Relational Operations

Lecture 13: Query Execution

Chapter 2: Intro to Relational Model

Spreadsheets, Modelling & Databases

Chapter 2: Intro to Relational Model

Joins and other advanced Queries

CprE / ComS 583 Reconfigurable Computing

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other Spatial Styles

Lect-13.2CprE 583 – Reconfigurable ComputingOctober 2, 2007 Systolic Architectures Goal – general methodology for mapping computations into hardware (spatial computing) structures Composition: Simple compute cells (e.g. add, sub, max, min) Regular interconnect pattern Pipelined communication between cells I/O at boundaries x x +x min xc

Lect-13.3CprE 583 – Reconfigurable ComputingOctober 2, 2007 Example – Matrix-Vector Product for (i=1; i<=n; i++) for (j=1; j<=n; j++) y[i] += a[i][j] * x[j];

Lect-13.4CprE 583 – Reconfigurable ComputingOctober 2, 2007 Matrix-Vector Product (cont.) t = 3 t = 2 t = 1 t = 4 a 31 a 21 a 11 a 41 a 22 a 12 – a 23 a 13 – – a 23 – – – a 14 x1x1 x2x2 x3x3 x4x4 y2y2 y3y3 y4y4 y1y1 t = n+1 t = n+2 t = n+3 t = n xnxn … – – – –

Lect-13.5CprE 583 – Reconfigurable ComputingOctober 2, 2007 Banded Matrix-Vector Product q p for (i=1; i<=n; i++) for (j=1; j<=p+q-1; j++) y[i] += a[i][j-q-i] * x[j];

Lect-13.6CprE 583 – Reconfigurable ComputingOctober 2, 2007 Banded Matrix-Vector Product (cont.) t = 3 t = 2 t = 1 t = 4 a 12 – – – – a 11 – a 22 a 21 – – – – – – a 31 t = 3 t = 4 t = 5 t = 2 x2x2 – x3x3 – y out y in x in x out a in t = 1x1x1 t = 5a 23 –a 32 – yiyi

Lect-13.7CprE 583 – Reconfigurable ComputingOctober 2, 2007 Outline Recap Non-Numeric Systolic Examples Systolic Loop Transformations Data dependencies Iteration spaces Example transformations Reading – Cellular Automata Reading – Bit-Serial Architectures

Lect-13.8CprE 583 – Reconfigurable ComputingOctober 2, 2007 Example – Relational Database Relation is a collection of tuples that all have the same attributes Tuple is a fixed number of objects Represented in a table tuple #NameSchoolAgeQB Rating 0 T. BradyMichigan T. RomoE. Illinois J. DelhommeLA-Lafayette P. ManningTennessee R. GrossmanFlorida2745.2

Lect-13.9CprE 583 – Reconfigurable ComputingOctober 2, 2007 Intersection: A ∩ B – all records in both relation A and B Must compare all |A| x |B| tuples Compare via sequence compare Or along row or column to get inclusion bitvector Database Operations B1 A1 T11 A2 T21 A3 T31 B2 T12 T22 T32 B3 T13 T23 T33 A1 A2 A3 B3 B2 B1 T12 T21

Lect-13.10CprE 583 – Reconfigurable ComputingOctober 2, 2007 Database Operations (cont.) Tuple Comparison Problem – tuples are long, comparison time might limit computation rate Strategy – perform comparison in pipelined manner by fields Stagger fields Arrange to compute field i on cycle after i-1 Cell: t out = t in and a in xnor b in True B[j,1] B[j,2] B[j,3] B[j,4] A[i,1] A[i,2] A[i,3] A[i,4]

Lect-13.11CprE 583 – Reconfigurable ComputingOctober 2, 2007 Database Intersection b51 a21 True b52 b43 a13 b44 b42 a22 b34 a14 b41 a31 b33 a23 b31 a41 b23 a23 b21 a51 b13 a43 b32 a32 b24 a24 b22 a42 b14 a34 a52a44 T12 T21 T22

Lect-13.12CprE 583 – Reconfigurable ComputingOctober 2, 2007 Database Intersection (cont.) b51 a21 True b52 b43 a13 b44 b42 a22 b34 a14 b41 a31 b33 a23 b31 a41 b23 a23 b21 a51 b13 a43 b32 a32 b24 a24 b22 a42 b14 a34 a52a44 OR FALSE OR T3 T2 T1

Lect-13.13CprE 583 – Reconfigurable ComputingOctober 2, 2007 Database Operations (cont.) Unique: remove duplicate elements in multirelation A Intersect A with A Union: A U B – one copy of all tuples in A and B Concatenate A and B Use Unique to remove duplicates Projection: collapse A by removing select fields of every tuple Sample fields in A’ Use Unique to remove duplicates

Lect-13.14CprE 583 – Reconfigurable ComputingOctober 2, 2007 Database Join Join: AJ C A,C B B – where columns C A in A intersect columns C B in B, concatenate tuple A i and B j Match C A of A with C B of B Keep all T i,j Filter i,j for which T i,j = true Construct join from matched pairs Claim: Typically, | AJ C A,C B B | << | A | | B |

Lect-13.15CprE 583 – Reconfigurable ComputingOctober 2, 2007 Database Summary Input database – O(n) data Operations require O(n 2 ) data O(n) if sorted first O(n log(n)) to sort Systolic implementation – works on O(n) processing elements in O(n) time Typical database [KunLoh80A]: 1500 bit tuples 10,000 records in a relation ~1 4-LUT per bit-compare ~1600 XC4062 FPGAs ~84 XC4LX200 FPGAs

Lect-13.16CprE 583 – Reconfigurable ComputingOctober 2, 2007 Systolic Loop Transformations Automatically re-structure code for Parallelism Locality Driven by dependency analysis

Lect-13.17CprE 583 – Reconfigurable ComputingOctober 2, 2007 Defining Dependencies Flow Dependence Anti-Dependence Output Dependence Input Dependence S1) a = 0; S2) b = a; S3) c = a + d + e; S4) d = b; S5) b = 5+e W  R δ f R  W δ a W  W δ o R  R δ i true false

Lect-13.18CprE 583 – Reconfigurable ComputingOctober 2, 2007 Example Dependencies S1) a = 0; S2) b = a; S3) c = a + d + e; S4) d = b; S5) b = 5+e S1 δ f S2 due to a S1 δ f S3 due to a S2 δ f S4 due to b S3 δ a S4 due to d S4 δ a S5 due to b S2 δ o S5 due to b S3 δ i S5 due to e

Lect-13.19CprE 583 – Reconfigurable ComputingOctober 2, 2007 Data Dependencies in Loops Dependence can flow across iterations of the loop Dependence information is annotated with iteration information If dependence is across iterations it is loop carried otherwise loop independent for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } δ f loop carried δ f loop independent

Lect-13.20CprE 583 – Reconfigurable ComputingOctober 2, 2007 Unroll Loop to Find Dependencies for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } δ f loop carried δ f loop independent A[0] = B[0]; B[1] = A[0]; A[1] = B[1]; B[2] = A[1]; A[2] = B[2]; B[3] = A[2];... i = 0 i = 1 i = 2 Distance/direction of dependence is also important

Lect-13.21CprE 583 – Reconfigurable ComputingOctober 2, 2007 Thought Exercise Consider the Laplace Transformation: In teams of two, try to determine the flow dependencies, anti dependencies, output dependencies, and input dependencies Use loop unrolling to find dependencies Most dependencies found gets a prize for (i=1; i<N; i++) for (j=1; j<N; j++) c = -4*a[i][j] + a[i-1][j] + a[i+1][j]; c += a[i][j+1] + a[i][j-1] b[i][j] = c; }

Lect-13.22CprE 583 – Reconfigurable ComputingOctober 2, 2007 Iteration Space Every iteration generates a point in an n-dimensional space, where n is the depth of the loop nest for (i=0; i<n; i++) {... } for (i=0; i<n; i++) { for (j=0; j<5; j++) {... } [3; 2] [4]

Lect-13.23CprE 583 – Reconfigurable ComputingOctober 2, 2007 Distance Vectors for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; } A[0] = B[0]; B[1] = A[0]; A[1] = B[1]; B[2] = A[1]; A[2] = B[2]; B[3] = A[2];... i = 0 i = 1 i = 2 Distance vector is the difference between the target and source iterations d = I t - I s Exactly the distance of the dependence, i.e., I s + d = I t

Lect-13.24CprE 583 – Reconfigurable ComputingOctober 2, 2007 A 0,1 = =A 0,1 B 0,2 = =B 0,1 C 1,1 = =C 0,2 A 0,1 = =A 0,1 B 0,2 = =B 0,1 C 1,1 = =C 0,2 for (i=0; i<n; i++) { for (j=0; j<m; j++) { A[i,j] = ; = A[i,j]; B[i,j+1] = ; = B[i,j]; C[i+1,j] = ; = C[i,j+1]; } Distance Vectors Example A 0,2 = =A 0,2 B 0,3 = =B 0,2 C 1,2 = =C 0,3 A 1,2 = =A 1,2 B 1,3 = =B 1,2 C 2,2 = =C 1,3 A 2,2 = =A 2,2 B 2,3 = =B 2,2 C 3,2 = =C 2,3 A 1,1 = =A 1,1 B 1,2 = =B 1,1 C 2,1 = =C 1,2 A 2,1 = =A 2,1 B 2,2 = =B 2,1 C 3,1 = =C 2,2 A 0,0 = =A 0,0 B 0,1 = =B 0,0 C 1,0 = =C 0,1 A 1,0 = =A 1,0 B 1,1 = =B 1,0 C 2,0 = =C 1,1 A 2,0 = =A 2,0 B 2,1 = =B 2,0 C 3,0 = =C 2,1 j i A 0,2 = =A 0,2 B 0,3 = =B 0,2 C 1,2 = =C 0,3 A 1,1 = =A 1,1 B 1,2 = =B 1,1 C 2,1 = =C 1,2 A 0,0 = =A 0,0 B 0,1 = =B 0,0 C 1,0 = =C 0,1 A 2,0 = =A 2,0 B 2,1 = =B 2,0 C 3,0 = =C 2,1 B yields [0; 1] A yields [0; 0] C yields [1; -1]

Lect-13.25CprE 583 – Reconfigurable ComputingOctober 2, 2007 FIR Distance Vectors for (i=0; i<n; i++) for (j=0; j<m; j++) Y[i] = Y[i]+X[i+j]*W[j]; Y yields: δ a [0; 0] Y yields: δ f [0; 1] X yields: δ i [1; -1] W yields: δ i [1; 0] Y 0 = =Y 0 =X 3 =W 3 Y 1 = =Y 1 =X 4 =W 3 Y 2 = =Y 2 =X 5 =W 3 Y 3 = =Y 3 =X 6 =W 3 Y 0 = =Y 0 =X 2 =W 2 Y 1 = =Y 1 =X 3 =W 2 Y 2 = =Y 2 =X 4 =W 2 Y 3 = =Y 3 =X 5 =W 2 Y 0 = =Y 0 =X 1 =W 1 Y 1 = =Y 1 =X 2 =W 1 Y 2 = =Y 2 =X 3 =W 1 Y 3 = =Y 3 =X 4 =W 1 Y 0 = =Y 0 =X 0 =W 0 Y 1 = =Y 1 =X 1 =W 0 Y 2 = =Y 2 =X 2 =W 0 Y 3 = =Y 3 =X 3 =W 0

Lect-13.26CprE 583 – Reconfigurable ComputingOctober 2, 2007 Re-label / Pipeline Variables Remove anti-dependencies and input dependencies by relabeling or pipelining variables Creates new flow dependencies Removes anti/input dependencies for (i=0; i<n; i++) { for (j=0; j<m; j++) { W i [j] = W i-1 [j]; X i [i+j]=X i-1 [i+j]; Y j [i] = Y j-1 [i]+X i [i+j]*W i [j]; } YWX

Lect-13.27CprE 583 – Reconfigurable ComputingOctober 2, 2007 FIR Dependencies Y 2 0 = =Y 1 0 X 0 2 = =X -1 2 W 0 2 = =W -1 2 YWX Y 2 1 = =Y 1 1 X 1 3 = =X 0 3 W 1 2 = =W 0 2 Y 2 2 = =Y 1 2 X 2 4 = =X 1 4 W 2 2 = =W 1 2 Y 1 0 = =Y 0 0 X 0 1 = =X -1 1 W 0 1 = =W -1 1 Y 1 1 = =Y 0 1 X 1 2 = =X 0 2 W 1 1 = =W 0 1 Y 1 2 = =Y 0 2 X 2 3 = =X 1 3 W 2 1 = =W 1 1 Y 0 0 = =Y -1 0 X 0 0 = =X -1 0 W 0 0 = =W -1 0 Y 0 1 = =Y -1 1 X 1 1 = =X 0 1 W 0 2 = =W 0 0 Y 0 2 = =Y -1 2 X 2 2 = =X 1 2 W 2 0 = =W 1 0 j i

Lect-13.28CprE 583 – Reconfigurable ComputingOctober 2, 2007 Transforming to Time and Space Using data dependencies, find T T defines a mapping of the iteration space into a time component π, and a space component, S T = [π; S] If π·I 1 = π·I 2, then I 1 and I 2 execute at the same time π·d – amount of time units to move data items (π·d > 0) Any S can be picked that makes T a bijection See [Mol83A] for more details

Lect-13.29CprE 583 – Reconfigurable ComputingOctober 2, 2007 Calculating T for FIR For π = [p 1 p 2 ] Since π·d > 0, we see that: p 2 != 0 (from Y) p 1 != 0 (from W) p 1 > p 2 (from X) Smallest solution π = [2 1] S can be [1 0], [0 1], [1 1] YWX

Lect-13.30CprE 583 – Reconfigurable ComputingOctober 2, 2007 An Example Transformation Space Time (0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3) (0,4)(1,4)(2,4)(3,4) YWX X Y W

Lect-13.31CprE 583 – Reconfigurable ComputingOctober 2, 2007 An Example Transformation (cont.) Space Time (0,0) (0,1) (0,2) (0,3) (1,0) (1,1) (1,2) (1,3) (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3) (0,4)(1,4)(2,4)(3,4) X Y W YWX

Lect-13.32CprE 583 – Reconfigurable ComputingOctober 2, 2007 Summary Non-numeric (database ops) example of systolic computing Multiple use of each input data item Concurrency Regular data and control flow Loop transformations Data dependency analysis Restructure code for parallelism, locality