1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Slides:

Advertisements

Similar presentations

Fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Advertisements

Chapter 6 Matrix Algebra.

Global Value Numbering using Random Interpretation Sumit Gulwani George C. Necula CS Department University of California, Berkeley.

ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.

Chapter 7: Arrays In this chapter, you will learn about

Addison Wesley is an imprint of © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 10 Arrays and Tile Mapping Starting Out with Games & Graphics.

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Chapter 4 Systems of Linear Equations; Matrices

Adding & Subtracting Matrices

Problems and Their Classes

5.4 Basis And Dimension.

5.1 Real Vector Spaces.

6.4 Best Approximation; Least Squares

Addition 1’s to 20.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Overview Structural Testing Introduction – General Concepts

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Introduction to arrays

Applied Informatics Štefan BEREŽNÝ

Lecture 19: Parallel Algorithms

1 ECE734 VLSI Arrays for Digital Signal Processing Chapter 3 Parallel and Pipelined Processing.

Chapter 4 Retiming.

Chapter 4 Systems of Linear Equations; Matrices Section 2 Systems of Linear Equations and Augmented Matrics.

ECE734 VLSI Arrays for Digital Signal Processing Algorithm Representations and Iteration Bound.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Frame-Level Pipelined Motion Estimation Array Processor Surin Kittitornkun and Yu Hen Hu IEEE Trans. on, for Video Tech., Vol. 11, NO.2 FEB, 2001.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

VLSI DSP 2008Y.T. Hwang3-1 Chapter 3 Algorithm Representation & Iteration Bound.

Chapter 5 Unfolding.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Arithmetic Operations on Matrices. 1. Definition of Matrix 2. Column, Row and Square Matrix 3. Addition and Subtraction of Matrices 4. Multiplying Row.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.

Matrices Write and Augmented Matrix of a system of Linear Equations Write the system from the augmented matrix Solve Systems of Linear Equations using.

1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.

Systems and Matrices (Chapter5)

Slide Chapter 7 Systems and Matrices 7.1 Solving Systems of Two Equations.

Three variables Systems of Equations and Inequalities.

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Copyright © 2007 Pearson Education, Inc. Slide 7-1.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 5 Systems and Matrices Copyright © 2013, 2009, 2005 Pearson Education, Inc.

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Chapter 8 Matrices and Determinants Matrix Solutions to Linear Systems.

Section 4Chapter 4. 1 Copyright © 2012, 2008, 2004 Pearson Education, Inc. Objectives Solving Systems of Linear Equations by Matrix Methods Define.

Matrices and Systems of Equations

LEARNING OUTCOMES At the end of this topic, student should be able to :  D efination of matrix  Identify the different types of matrices such as rectangular,

Matrices, Vectors, Determinants.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.

CS314 – Section 5 Recitation 13

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Chapter 7: Matrices and Systems of Equations and Inequalities

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

Presentation transcript:

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation

2 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Representing Nested Loops M-level nested loop: L 1 : DO i 1 = p 1, q 1 L 2 : DO i 2 = p 2, q 2  L m : DO i m = p m, q m H(i 1, i 2, , i m ) Enddo  Enddo The loop indices {i k ; 1  k  m} form an m  1 index vector i = [i 1, i 2, , i m ] T which corresponds to a lattice point in the m-dimensional index space I Loop bounds: {p k, q k }. Loop body: H(i 1, i 2, , i m ) that is to be executed in a single processor in a single time unit (t.u.). The granularity considered here is a loop body.

3 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Regular Nested Loops If the loop bounds are all constants, the index points of this nested loop form a rectangular parallelepiped in the index space: {i; p  i  q } (General situation) the loop bounds are linear (affined) function with integer coefficients of outer loop indices and can be formulated as two inequalities: p 0  P i and Q i  q 0 P, Q: lower triangular matrices If P = Q, it is a regular nested loop. Examples: Do i = 0, 5 Do j = 3, 7 a(i,j)=b(i)+c(j) Enddo Endod Do i = 0, 5 Do j = 2*i-1, 3*i+2 a(i,j)=b(i)+c(j) Enddo

4 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Schedule and Precedence A schedule S: i  t(i) is a mapping from each index point i in the index space I to a positive integer t(i) which dictates when this iteration is to be executed. An iteration H(i) will be executed before H(j) if its index vector i lexicographically precedes index vector j. That is, i  j. This implies there exists an integer r, 1  r  m, such that i k = j k for k < r, and i r < j r. Example [1 3 4]  [2 1 1] If two iterations have no (inter-iteration) dependence between them, then these two iterations can be executed concurrently.

5 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Inter-iteration Dependence An iteration H(j) is dependent on iteration H(i) if 1. i  j; and 2.H(j) will read from a memory location (including registers) whose value is last written during execution of iteration H(i). The corresponding dependence vector d is defined as: d = j - i  0 A matrix D consisting of all dependence vectors of an algorithm is called a dependence matrix. Observation If H(j) is dependent on H(i), then t(i) < t(j). The dependence relation imposes a partial ordering on the execution of the iterative loop nest. Example Do i=1,4 Do j=1,4 a(i,j)=a(i-1,j)+a(i,j-1) Enddo

6 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Data Dependence (General) True (Data) Dependence S1: A:= B + C S2: D:= A + 2 S3: E:= A + 3 S2 and S3 depend on S1 Anti-dependence S1: A:= B + C S2: B:= D + 2 S2 depends on S1 because the same variable B is assigned to new values during execution more than once. Output Dependence S1: A:= B + C S2: D:= A + 2 S3: A:= E + 3 S3 depends on S1 because the same variable A is assigned to new values in both statements. Both Anti-dependence and output dependence can be removed using single- assignment transform to ensure each variable is assigned to new values only once during the execution of the algorithm

7 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Single Assignment Transformation Consider the code segment: S1: A:= B + C S2: B:= D + 2 Variable B is assigned with new values in addition to its initially assigned value. Thus, it causes anti-dependence. Solution: variable renaming S1: A:= B + C S2: B1:= D + 2 By introducing a new variable B1, S1 and S2 can be executed in parallel. When an algorithm is represented in single assignment form, false dependence (anti- and output dependence) are removed at the expense of additional storage requirement. No specific algorithm available to perform single assignment transform automatically yet.

8 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Single Assignment Transform Methods for transforming an algorithm into single assignment format: –For scalars: introduce new variables (renaming) –For arrays: introducing additional array indices Example (array) Do j=1,N C(i)=C(i)+A(j)*B(j) Do j=1,N C(i,j)=C(i,j-1)+A(j)*B(j) Another Example Do i=1,N A(i)=B(i)+C(i) D(i)=A(i)+A(i+1) Enddo Note that there is an anti- dependence in the loop body. Introduce a new array A1 Do i=1,N A1(i)=B(i)+C(i) D(i)=A1(i)+A(i+1) Enddo Then problem is solved.

9 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Variable Localization by Duplication In a parallel, distributed processing system, a variable used by more than one iterations may need to be broadcast to multiple processors physically. In the loop body of a nested loop algorithm, inter-iteration data broadcasting is needed when an indexed variable has lower dimensions (fewer index vector dimensions) then other variables. Example: c(i,j)=c(i,j-1)+a(i,j)*b(j) b(j) will be used by each i- iterations. Solution: –Rename the variable to a new indexed variable with the same index dimensions as other variables. Then use variable duplication through the newly added index. b1(0,j)=b(j) b1(i,j)=b1(i-1,j) C(i,j)=c(i,j-1)+a(i,j)*b1(i,j)

10 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Algorithm Rewrite Example Matrix-vector product: c = A b do i=1,m c(i)=0 do j=1,n c(i)=c(i)+a(i,j)*b(j) enddo Need: –Single assignment transform –Index localization Transformed formulation –Loop is replaced by the Doall statement c1(i,0)=0; 1  i  m b1(0,j)=b(j); 1  j  n b1(i,j)=b1(i-1,j); 1  i  m; 1  j  n c1(i,j)=c1(i,j-1)+a(i,j)*b1(i,j) 1  i  m; 1  j  n c(i)=c1(i,m); 1  i  m

11 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Data (Iteration) Dependence Graph c1(i,0)=0; 1  i  m b1(0,j)=b(j); 1  i  m b1(i,j)=b1(i-1,j); 1  j  n c1(i,j)=c1(i,j-1)+a(i,j)*b1(i,j) 1  i  m; 1  j  n c(i)=c1(i,m); 1  j  n j i c(1)c(2) c(3) b(3) b(2) b(1) c(4) c1(i,j-1) c1(i,j) b1(i,j) b1(i-1,j) Induced dependence due to distribution of duplicated data. Its direction is flexible! Data dependence In an iteration space data (loop) dependence graph, no delay nor loops allowed. Granularity = Loop body

12 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Shift-Invariant Iteration DG If the dependence structure of each node in the iteration DG remains the same, it is called a shift-invariant DG. In a SIDG, the entire DG can be generated by shifting a single copy of the node dependence structure to every node inside the iteration bounds. Conditional statements are not allowed. An algorithm formulation that leads to a shift-invariant DG is called a regular iterative algorithm (RIA). An RIA algorithm is a single regular nested loop such that their loop index vector i satisfies p  Mi  q where p, q are constant vectors, and M is a lower triangular matrix.

13 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Parallel Execution by Vectorization (DoAll) If the last row of the dependence matrix contains all zero entries, the innermost loop can be replaced by a Doall loop to have all iterations executed concurrently. Refer to the example to the right. The inner loop (index j) can be executed simultaneously as there are no dependence between operations of different values of j. Example: Do i=1,4 Do j=1,4 a(i,j)=a(i-1,j)+b(i,j) Enddo Do i=1,4 Doall j=1,2,3,4 a(i,j)=a(i-1,j)+b(i,j) Enddoall Enddo

14 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu DG Analysis a(i,j)=a(i-1,j)+b(i,j) 1  i  4, 1  j  4, Vectorized execution schedule j i j i Sequential execution schedule

15 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Levels of Dependence Vectors For a dependence vector d  0. Its level =, and Loop L carries the dependence. Example, Do i=0,3 Do j=0,3 Do k=0,3 a(i,j,k)=a(i,j-1,k-1)+1; b(i,j,k)=2*b(i,j,k-1)-1; c(i,j,k)=c(i-1,j,k-1)-1; Enddo The levels of its dependence matrix is 2, 3, 1. All 3 loops carries dependence. i j k

16 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Loop Interchange Inter-change L 2 and L 3 : Do i=0,3 Do k=0,3 Do j=0,3 A(i,k,j)=A(i,k-1,j-1)+1; B(i,k,j)=2*B(i,k-1,j)-1; C(i,k,j)=C(i-1,k-1,j)-1; Enddo where A(i,k,j) = a(i,j,k), B(i,k,j) = b(i,j,k), and C(i,k,j) = c(i,j,k) New dependence matrix New levels = 2,2,1 J-loop can be executed in parallel when i and k are fixed. To verify: Let i = k = 0, A(0,0,j)=A(0,-1,j-1)+1 B(0,0,j)=2*B(0,-1,j)-1 C(0,0,j)=C(-1,-1,j)-1 all these operations can be executed for different values of j simultaneously! k j

17 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Exploitation of Parallelism Inner Loop Parallelism If the first non-zero element in each dependence vector is above loop level k, then all inner loop nests, starting from level k can be executed in parallel. Outer Loop Parallelism To execute an outer loop in parallel (where each inner loop nest is executed sequentially), the corresponding dependence matrix must have at least a row containing only zero entries.

18 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu Uni-Modular Loop Transformation A square matrix U is uni- modular if –It contains integer entries –Det(U) = 1 Examples: Uni-modular index transform: i  U i = k shift and rotate index vectors If used properly, a uni- modular transformation enables more loops to be executed in parallel. A loop transformation matrix U is valid if for each d in D, Ud  0. The dependence matrix of the transformed loop is UD.

19 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu An Example for i = 0,3 for j = 0,3 A(i,j)=A(i,j-1)+A(i-1,j) end Dependence matrix: Applying uni-modular matrix: Index vector transform: Indices of variable A(i-1,j): Indices of variable A(i,j-1): Transformed new formulation: for k 1 = 0,6 for k 2 = max{0,k 1 -3},min{3,k 1 } A(k 1,k 2 )=A(k 1 -1,k 2 )+A(k 1 -1,k 2 -1) end

20 ECE734 VLSI Arrays for Digital Signal Processing © by Yu Hen Hu DGs of Uni-modular transform j i j i i+j=0 i+j=6 k2k2 k1k1