1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Programmability Issues

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.

Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Practical Dependence Test Gina Goff, Ken Kennedy, Chau-Wen Tseng PLDI ’91 presented by Chong Liang Ooi.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

Data Flow Analysis Compiler Design Nov. 8, 2005.

Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.

Data Dependences CS 524 – High-Performance Computing.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Multi-Dimensional Arrays

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,

Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

High-Level Transformations for Embedded Computing

Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Dependence Analysis and Loops CS 3220 Spring 2016.

Lecture 38: Compiling for Modern Architectures 03 May 02

CS314 – Section 5 Recitation 13

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Parallelizing Loops Moreno Marzolla

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Parallelization, Compilation and Platforms 5LIM0

Preliminary Transformations

Numerical Algorithms • Parallelizing matrix multiplication

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences  Loop parallelism  Dependence distance and direction vectors  Loop-carried dependences  Loop-independent dependences Loop dependence and loop iteration reordering Dependence tests

2 Data and Control Dependences Data dependence: data is produced and consumed in the correct order Control dependence: a dependence that arises as a result of control flow S 1 PI = 3.14 S 2 R = 5.0 S 3 AREA = PI * R ** 2 S 1 IF (T.EQ.0.0) GOTO S 3 S 2 A = A / T S 3 CONTINUE

3 Data Dependence Definition  There is a data dependence from statement S 1 to statement S 2 iff 1) both S 1 and S 2 access the same memory location and at least one of them stores into it, and 2) there is a feasible run-time execution path from S 1 to S 2

4 Data Dependence Classification Dependence relations are similar to hardware hazards (which may cause pipeline stalls)  True dependences (also called flow dependences), denoted by S 1  S 2, are the same as RAW (Read After Write) hazards  Anti dependences, denoted by S 1  -1 S 2, are the same as WAR (Write After Read) hazards  Output dependences, denoted by S 1  o S 2, are the same as WAW (Write After Write) hazards S 1 X = … S 2 … = X S 1 … = X S 2 X = … S 1 X = … S 2 X = … S1  S2S1  S2 S 1  -1 S 2 S1 o S2S1 o S2

5 Compiling for Loop Parallelism Dependence properties are crucial for determining and exploiting loop parallelism Bernstein’s conditions  Iteration I 1 does not write into a location that is read by iteration I 2  Iteration I 2 does not write into a location that is read by iteration I 1  Iteration I 1 does not write into a location that is written into by iteration I 2

6 Fortran’s PARALLEL DO The PARALLEL DO guarantees that there are no scheduling constraints among its iterations PARALLEL DO I = 1, N A(I) = A(I) + B(I) ENDDO PARALLEL DO I = 1, N A(I+1) = A(I) + B(I) ENDDO PARALLEL DO I = 1, N A(I-1) = A(I) + B(I) ENDDO PARALLEL DO I = 1, N S = S + B(I) ENDDO OK Not OK

7 Loop Dependences While the PARALLEL DO requires the programmer to assure the loop iterations can be executed in parallel, an optimizing compiler must analyze loop dependences for automatic loop parallelization and vectorization Granularity of parallelism  Instruction level parallelism (ILP), not loop-specific  Synchronous (e.g. SIMD, also vector and SSE) parallelism is fine grain and compiler aims to optimize inner loops  Asynchronous (e.g. MIMD, work sharing) parallelism is coarse grain and compiler aims to optimize outer loops

8 Synchronous Parallel Machines SIMD model, simpler and cheaper compared to MIMD Processors operate in lockstep executing the same instruction on different portions of the data space Application should exhibit data parallelism Control flow requires masking operations (similar to predicated instructions), which can be inefficient p 1,1 p 1,2 p 1,3 p 1,4 p 2,1 p 2,2 p 2,3 p 2,4 p 3,1 p 3,2 p 3,3 p 3,4 p 4,1 p 4,2 p 4,3 p 4,4 Example 4x4 processor mesh

9 Asynchronous Parallel Computers Multiprocessors  Shared memory is directly accessible by each PE (processing element) Multicomputers  Distributed memory requires message passing between PEs SMPs  Shared memory in small clusters of processors, need message passing across clusters p1p1 p2p2 p3p3 p4p4 mem p1p1 p2p2 p3p3 p4p4 m1m1 m2m2 m3m3 m4m4 bus bus, x-bar, or network

10 Advanced Compiler Technology Powerful restructuring compilers transform loops into parallel or vector code Must analyze data dependences in loop nests besides straight-line code DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I)+A(J,K)*B(K,I) ENDDO ENDDO ENDDO PARALLEL DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I)+A(J,K)*B(K,I) ENDDO ENDDO ENDDO DO I = 1, N DO J = 1, N, 64 C(J:J+63,I) = 0.0 DO K = 1, N C(J:J+63,I) = C(J:J+63,I) +A(J:J+63,K)*B(K,I) ENDDO ENDDO ENDDO parallelizevectorize

11 Normalized Iteration Space In most situations it is preferable to use normalized iteration spaces For an arbitrary loop with loop index I and loop bounds L and U and stride S, the normalized iteration number is i = (I-L+S)/S DO I = 100, 20, -10 A(I) = B(100-I) + C(I/5) ENDDO DO i = 1, 9 A(110-10*i) = B(10*i-10) + C(22-2*i) ENDDO is analyzed with i = (110-I)/10

12 Dependence in Loops In this example loop statement S 1 “depends on itself” from the previous iteration More precisely, statement S 1 in iteration i has a true dependence with S 1 in iteration i+1 DO I = 1, N S 1 A(I+1) = A(I) + B(I) ENDDO S 1 A(2) = A(1) + B(1) S 1 A(3) = A(2) + B(2) S 1 A(4) = A(3) + B(3) S 1 A(5) = A(4) + B(4) … is executed as

13 Iteration Vector Definition  Given a nest of n loops, the iteration vector i of a particular iteration of the innermost loop is a vector of integers i = (i 1, i 2, …, i n ) where i k represents the iteration number for the loop at nesting level k  The set of all possible iteration vectors is an iteration space

14 Iteration Vector Example The iteration space of the statement at S 1 is {(1,1), (2,1), (2,2), (3,1), (3,2), (3,3)} At iteration i = (2,1) the value of A(1,2) is assigned to A(2,1) DO I = 1, 3 DO J = 1, I S 1 A(I,J) = A(J,I) ENDDO ENDDO (1,1) (2,1)(2,2) (3,2)(3,3)(3,1) I J

15 Iteration Vector Ordering The iteration vectors are naturally ordered according to a lexicographical order, e.g. iteration (1,2) precedes (2,1) and (2,2) in the example on the previous slide Definition  Iteration i precedes iteration j, denoted i < j, iff 1) i[1:n-1] < j[1:n-1], or 2) i[1:n-1] = j[1:n-1] and i n < j n

16 Loop Dependence Definition  There exist a dependence from S 1 to S 2 in a loop nest iff there exist two iteration vectors i and j such that 1) i < j and there is a path from S 1 to S 2 2) S 1 accesses memory location M on iteration i and S 2 accesses memory location M on iteration j 3) one of these accesses is a write

17 Loop Dependence Example Show that the example loop has no loop dependences Answer: there are no iteration vectors i and j in {(1,1), (2,1), (2,2), (3,1), (3,2), (3,3)} such that i < j and either S 1 in i writes to the same element of A that is read at S 1 in iteration j, or S 1 in iteration i reads an element A that is written in iteration j DO I = 1, 3 DO J = 1, I S 1 A(I,J) = A(J,I) ENDDO ENDDO

18 Loop Reordering Transformations Definitions  A reordering transformation is any program transformation that changes the execution order of the code, without adding or deleting any statement executions  A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

19 Fundamental Theorem of Dependence Any reordering transformation that preserves every dependence in a program preserves the meaning of that program

20 Valid Transformations A transformation is said to be valid for the program to which it applies if it preserves all dependences in the program DO I = 1, 3 S 1 A(I+1) = B(I) S 2 B(I+1) = A(I) ENDDO DO I = 1, 3 B(I+1) = A(I) A(I+1) = B(I) ENDDO DO I = 3, 1, -1 A(I+1) = B(I) B(I+1) = A(I) ENDDO valid invalid

21 Dependences and Loop Transformations Loop dependences are tested before a transformation is applied When a dependence test is inconclusive, dependence must be assumed In this example the value of K is unknown and loop dependence is assumed: DO I = 1, N S 1 A(I+K) = A(I) + B(I) ENDDO

22 Dependence Distance Vector Definition  Suppose that there is a dependence from S 1 on iteration i to S 2 on iteration j ; then the dependence distance vector d(i, j) is defined as d(i, j) = j - i Example: DO I = 1, 3 DO J = 1, I S 1 A(I+1,J) = A(I,J) ENDDO ENDDO True dependence between S 1 and itself on i = (1,1) and j = (2,1): d(i, j) = (1,0) i = (2,1) and j = (3,1): d(i, j) = (1,0) i = (2,2) and j = (3,2): d(i, j) = (1,0)

23 Dependence Direction Vector Definition  Suppose that there is a dependence from S 1 on iteration i and S 2 on iteration j ; then the dependence direction vector D(i, j) is defined as D(i, j) k = “ 0 “=” if d(i, j) k = 0 “>” if d(i, j) k < 0

24 Example DO I = 1, 3 DO J = 1, I S 1 A(I+1,J) = A(I,J) ENDDO ENDDO True dependence between S 1 and itself on i = (1,1) and j = (2,1): d(i, j) = (1,0), D(i, j) = (<, =) i = (2,1) and j = (3,1): d(i, j) = (1,0), D(i, j) = (<, =) i = (2,2) and j = (3,2): d(i, j) = (1,0), D(i, j) = (<, =) (1,1) (2,1)(2,2) (3,2)(3,3)(3,1) I J sourcesink

25 Data Dependence Graphs Data dependences arise between statement instances It is generally infeasible to represent all data dependences that arise in a program Usually only static data dependences are recorded using the data dependence direction vector and depicted in a data dependence graph to compactly represent data dependences in a loop nest DO I = 1, S 1 A(I) = B(I) * 5 S 2 C(I+1) = C(I) + A(I) ENDDO Static data dependences for accesses to A and C: S 1  (=) S 2 and S 2  (<) S 2 Data dependence graph: S1S1 S2S2 (=) (<)

26 Loop-Independent Dependences Definition  Statement S 1 has a loop-independent dependence on S 2 iff there exist two iteration vectors i and j such that 1) S 1 refers to memory location M on iteration i, S 2 refers to M on iteration j, and i = j 2) there is a control flow path from S 1 to S 2 within the iteration

27 Loop-Carried Dependences Definitions  Statement S 1 has a loop-carried dependence on S 2 iff there exist two iteration vectors i and j such that S 1 refers to memory location M on iteration i, S 2 refers to M on iteration j, and d(i, j) > 0, (that is, D(i, j) contains a “<” as its leftmost non-“=” component)  A loop-carried dependence from S 1 to S 2 is 1) forward if S 2 appears after S 1 in the loop body 2) backward if S 2 appears before S 1 (or if S 1 =S 2 )  The level of a loop-carried dependence is the index of the leftmost non- “=” of D(i, j) for the dependence

28 Example (1) DO I = 1, 3 S 1 A(I+1) = F(I) S 2 F(I+1) = A(I) ENDDO S 1  (<) S 2 and S 2  (<) S 1 The loop-carried dependence S 1  (<) S 2 is forward The loop-carried dependence S 2  (<) S 1 is backward All loop-carried dependences are of level 1, because D(i, j) = (<) for every dependence

29 Example (2) DO I = 1, 10 DO J = 1, 10 DO K = 1, 10 S 1 A(I,J,K+1) = A(I,J,K) ENDDO ENDDO ENDDO S 1  (=,=,<) S 1 All loop-carried dependences are of level 3, because D(i, j) = (=, =, <) Note: level-k dependences are also denoted by S x  k S y S1 3 S1S1 3 S1 Alternative notation for a level-3 dependence

30 Direction Vectors and Reordering Transformations (1) Let T be a transformation on a loop nest that does not rearrange the statements in the body of the innermost loop; then T is valid if, after it is applied, none of the direction vectors for dependences with source and sink in the nest has a leftmost non-“=” component that is “>”

31 Direction Vectors and Reordering Transformations (2) A reordering transformation preserves all level-k dependences if 1) it preserves the iteration order of the level-k loop, 2) if it does not interchange any loop at level k to a position outside the level-k loop

32 Examples DO I = 1, 10 DO J = 1, 10 DO K = 1, 10 S 1 A(I+1,J+2,K+3) = A(I,J,K) + B ENDDO ENDDO ENDDO DO I = 1, 10 S 1 F(I+1) = A(I) S 2 A(I+1) = F(I) ENDDO Both S 1  (<) S 2 and S 2  (<) S 1 are level-1 dependences S 1  (<,<,<) S 1 is a level-1 dependence DO I = 1, 10 S 2 A(I+1) = F(I) S 1 F(I+1) = A(I) ENDDO DO I = 1, 10 DO K = 10, 1, -1 DO J = 1, 10 S 1 A(I+1,J+2,K+3) = A(I,J,K) + B ENDDO ENDDO ENDDO Find deps. Choose transform Find deps.

33 Direction Vectors and Reordering Transformations (3) A reordering transformation preserves a loop- independent dependence between statements S 1 and S 2 if it does not move statement instances between iterations and preserves the relative order of S 1 and S 2

34 Examples DO I = 1, 10 S 1 A(I) = … S 2 … = A(I) ENDDO S 1  (=) S 1 is a loop-independent dependence DO I = 10, 1, -1 S 2 A(I) = … S 1 … = A(I) ENDDO Choose transform Find deps. DO I = 1, 10 DO J = 1, 10 S 1 A(I,J) = … S 2 … = A(I,J) ENDDO ENDDO S 1  (=,=) S 1 is a loop-independent dependence DO J = 10, 1, -1 DO I = 10, 1, -1 S 1 A(I,J) = … S 2 … = A(I,J) ENDDO ENDDO Find deps. Choose transform

35 Combining Direction Vectors When a loop nest has multiple different directions at the same loop level k, we sometimes abbreviate the level-k direction vector component with “*” DO I = 1, 10 DO J = 1, 10 S 1 A(J) = A(J)+1 ENDDO ENDDO DO I = 1, 10 S 1 S = S + A(I) ENDDO DO I = 1, 9 S 1 A(I) = … S 2 … = A(10-I) ENDDO S 1  (*) S 2 = S 1  (<) S 2 S 1  (=) S 2 S 1  (>) S 2 S 1  (*) S 1 S 1  (*,=) S 1

36 Iteration Vectors Iteration vectors are renditions of the dependences in a loop’s iteration space on a grid Helps to visualize dependences DO I = 1, 3 DO J = 1, I S 1 A(I+1,J) = A(I,J) ENDDO ENDDO J I Distance vector is (1,0) S 1  (<,=) S 1

37 Example 1 DO I = 1, 4 DO J = 1, 4 S 1 A(I,J+1) = A(I-1,J) ENDDO ENDDO J I Distance vector is (1,1) S 1  (<,<) S 1

38 Example 2 I Distance vector is (1,-1) S 1  ( ) S 1 DO I = 1, 4 DO J = 1, 5-I S 1 A(I+1,J+I-1) = A(I,J+I-1) ENDDO ENDDO J

39 Example 3 I Distance vectors are (0,1) and (1,0) S 1  (=,<) S 1 and S 2  (<,=) S 2 DO I = 1, 4 DO J = 1, 4 S 1 B(I) = B(I) + C(I) S 2 A(I+1,J) = A(I,J) ENDDO ENDDO J

40 Parallelization It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence DO I = 1, 4 DO J = 1, 4 S 1 A(I,J+1) = A(I,J) ENDDO ENDDO S 1  (=,<) S 1 J I PARALLEL DO I = 1, 4 DO J = 1, 4 S 1 A(I,J+1) = A(I,J) ENDDO ENDDO parallelize

41 Vectorization (1) A single-statement loop that carries no dependence can be vectorized DO I = 1, 4 S 1 X(I) = X(I) + C ENDDO S 1 X(1:4) = X(1:4) + C Fortran 90 array statement X(1)X(2)X(3)X(4) X(1)X(2)X(3)X(4) CCCC + vectorize Vector operation

42 Vectorization (2) This example has a loop- carried dependence, and is therefore invalid DO I = 1, N S 1 X(I+1) = X(I) + C ENDDO S 1 X(2:N+1) = X(1:N) + C Invalid Fortran 90 statement S 1  (<) S 1

43 Vectorization (3) Because only single loop statements can be vectorized, loops with multiple statements must be transformed using the loop distribution transformation Loop has no loop-carried dependence or has forward flow dependences DO I = 1, N S 1 A(I+1) = B(I) + C S 2 D(I) = A(I) + E ENDDO S 1 A(2:N+1) = B(1:N) + C S 2 D(1:N) = A(1:N) + E DO I = 1, N S 1 A(I+1) = B(I) + C ENDDO DO I = 1, N S 2 D(I) = A(I) + E ENDDO vectorize loop distribution

44 Vectorization (4) When a loop has backward flow dependences and no loop- independent dependences, interchange the statements to enable loop distribution DO I = 1, N S 1 D(I) = A(I) + E S 2 A(I+1) = B(I) + C ENDDO S 2 A(2:N+1) = B(1:N) + C S 1 D(1:N) = A(1:N) + E DO I = 1, N S 2 A(I+1) = B(I) + C ENDDO DO I = 1, N S 1 D(I) = A(I) + E ENDDO vectorize loop distribution

45 Preliminary Transformations to Support Dependence Testing To determine distance and direction vectors with dependence tests require array subscripts to be in a standard form Preliminary transformations put more subscripts in affine form, where subscripts are linear integer functions of loop induction variables  Loop normalization  Induction-variable substitution  Constant propagation

46 Loop Normalization See slide on normalized iteration space Dependence tests assume iteration spaces are normalized The normalization distorts dependences, but it is otherwise preferred to support loop analysis  Make lower bound 1 (or 0)  Make stride 1 DO I = 1, M DO J = I, N S 1 A(J,I) = A(J,I-1) + 5 ENDDO ENDDO DO I = 1, M DO J = 1, N - I + 1 S 1 A(J+I-1,I) = A(J+I-1,I-1) + 5 ENDDO ENDDO normalize S 1  ( ) S 1 S 1  (<,=) S 1

47 Data Flow Analysis Data flow analysis with reaching definitions or SSA form supports constant propagation and dead code elimination The above are needed to standardize array subscripts K = 2 DO I = 1, 100 S 1 A(I+K) = A(I) + 5 ENDDO DO I = 1, 100 S 1 A(I+2,I) = A(I) + 5 ENDDO constant propagation

48 Forward Substitution Forward substitution and induction variable substitution (IVS) yield array subscripts that are amenable to dependence analysis DO I = 1, 100 K = I + 2 S 1 A(K) = A(K) + 5 ENDDO forward substitution DO I = 1, 100 S 1 A(I+2) = A(I+2) + 5 ENDDO

49 Induction Variable Substitution Example loop with IVs After IV substitution (IVS) (note the affine indexes) After parallelization I = 0 J = 1 while (I<N) I = I+1 … = A[J] J = J+2 K = 2*I A[K] = … endwhile for i=0 to N-1 S 1 : … = A[2*i+1] S 2 : A[2*i+2] = … endfor forall (i=0,N-1) … = A[2*i+1] A[2*i+2] = … endforall GCD test to solve dependence equation 2i d - 2i u = -1 Since 2 does not divide 1 there is no data dependence. WRWRWR A[2*i+1] … A[2*i+2] A[] Dep testIVS

50 Example (1) Before and after induction-variable substitution of KI in the inner loop INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 KI = KI + INC U(KI) = U(KI) + W(J) ENDDO S(I) = U(KI) ENDDO INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+J*INC) = U(KI+J*INC) + W(J) ENDDO KI = KI * INC S(I) = U(KI) ENDDO

51 Example (2) Before and after induction-variable substitution of KI in the outer loop INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+J*INC) = U(KI+J*INC) + W(J) ENDDO KI = KI + 100*INC S(I) = U(KI) ENDDO INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+(I-1)*100*INC+J*INC) = U(KI+(I-1)*100*INC+J*INC) + W(J) ENDDO S(I) = U(KI+I*100*INC) ENDDO KI = KI + 100*100*INC

52 Example (3) Before and after constant propagation of KI and INC INC = 2 KI = 0 DO I = 1, 100 DO J = 1, 100 U(KI+(I-1)*100*INC+J*INC) = U(KI+(I-1)*100*INC+J*INC) + W(J) ENDDO S(I) = U(KI+I*100*INC) ENDDO KI = KI + 100*100*INC DO I = 1, 100 DO J = 1, 100 U(I*200+J*2-200) = U(I*200+J*2-200) + W(J) ENDDO S(I) = U(I*200) ENDDO KI = Affine subscripts

53 Dependence Testing After loop normalization, IVS, and constant propagation, array subscripts are standardized Affine subscripts are important, since most dependence tests requires affine forms: a 1 i 1 + a 2 i 2 + … + a n i n + e Non-linear subscripts contain unknowns (symbolic terms), function values, array values, etc.

54 Dependence Equation (1) A dependence equation defines the access requirement DO I = 1, N S 1 A(I+1) = A(I) ENDDO DO I = 1, N S 1 A(f(I)) = A(g(I)) ENDDO DO I = 1, N S 1 A(2*I+1) = A(2*I) ENDDO To prove flow dependence: for which values of  <  is f(  ) = g(  ) 2*  +1 = 2*  has no solution  +1 =  has solution  =  -1

55 Dependence Equation (2) A dependence equation defines the access requirement DO I = 1, N S 1 A(I+1) = A(I) ENDDO DO I = 1, N S 1 A(f(I)) = A(g(I)) ENDDO DO I = 1, N S 1 A(2*I) = A(2*I+2) ENDDO To prove anti dependence: for which values of  >  is f(  ) = g(  ) 2*  = 2*  +2 has solution  =  +1  +1 =  has no solution

56 Dependence Equation (3) DO I = 1, N S 1 A(5) = A(I) ENDDO DO I = 1, N S 1 A(5) = A(6) ENDDO To prove flow dependence: for which values of  <  is f(  ) = g(  ) 5 = 6 has no solution 5 =  has solution if 5 < N DO I = 1, N DO J = 1, N S 1 A(I-1) = A(J) ENDDO ENDDO  1 -1 =  2 has solution, note  =(  1,  2 ) and  =(  1,  2 ) are iteration vectors

57 ZIV, SIV, and MIV ZIV (zero index variable) subscript pairs SIV (single index variable) subscript pairs MIV (multiple index variable) subscript pairs DO I = … DO J = … DO K = … S 1 A(5,I+1,J) = A(N,I,K) ENDDO ENDDO ENDDO ZIV pairSIV pairMIV pair

58 Example Distance vector is (1,-1) S 1  ( ) S 1 DO I = 1, 4 DO J = 1, 5-I S 1 A(I+1,J+I-1) = A(I,J+I-1) ENDDO ENDDO First distance vector component is easy (SIV), but second is not so easy (MIV) SIV pairMIV pair I J

59 Separability and Coupled Subscripts When testing multidimensional arrays, subscript are separable if indices do not occur in other subscripts ZIV and SIV subscripts pairs are separable MIV pairs may give rise to coupled subscript groups DO I = … DO J = … DO K = … S 1 A(I,J,J) = A(I,J,K) ENDDO ENDDO ENDDO SIV pair is separable MIV pair gives coupled indices

60 Dependence Testing Overview 1. Partition subscripts into separable and coupled groups 2. Classify each subscript as ZIV, SIV, or MIV 3. For each separable subscript, apply the applicable single subscript test (ZIV, SIV, or MIV) to prove independence or produce direction vectors for dependence 4. For each coupled group, apply a multiple subscript test 5. If any test yields independence, no further testing needed 6. Otherwise, merge all dependence vectors into one set

61 Dependence Testing is Conservative or Exact Dependence tests solve dependence equations Conservative tests attempts to prove that there is no solution to a dependence equation  Absence of a dependence is proven  Proving dependence sometimes possible and useful  But if proofs fail, dependence must be assumed  When dependence has to be assumed, translated code is correct but suboptimal Exact tests detect dependences iff they exists

62 ZIV Test The ZIV test compares two subscripts If the expressions are proved unequal, no dependence can exist DO I = 1, N S 1 A(5) = A(6) ENDDO K = 10 DO I = 1, N S 1 A(5) = A(K) ENDDO K = 10 DO I = 1, N DO J = 1, N S 1 A(I,5) = A(I,K) ENDDO ENDDO

63 Strong SIV Test Requires subscript pairs of the form aI+c 1 and aI’+c 2 for the def and use, respectively Dependence equation aI+c 1 = aI’+c 2 has a solution if the dependence distance d = (c 1 - c 2 )/a is integer and |d| < U - L with L and U loop bounds DO I = 1, N S 1 A(I+1) = A(I) ENDDO DO I = 1, N S 1 A(2*I+2) = A(2*I) ENDDO DO I = 1, N S 1 A(I+N) = A(I) ENDDO

64 Weak-Zero SIV Test Requires subscript pairs of the form a 1 I+c 1 and a 2 I’+c 2 with either a 1 = 0 or a 2 = 0 If a 2 = 0, the dependence equation I = (c 2 - c 1 )/a 1 has a solution if (c 2 - c 1 )/a 1 is integer and L < (c 2 - c 1 )/a 1 < U If a 1 = 0, similar case DO I = 1, N S 1 A(I) = A(1) ENDDO DO I = 1, N S 1 A(2*I+2) = A(2) ENDDO DO I = 1, N S 1 A(1) = A(I) + A(1) ENDDO

65 GCD Test Assumes that f(  ) and g(  ) are affine: f( x 1,x 2,…,x n )= a 0 +a 1 x 1 +…+a n x n g( y 1,y 2,…,y n )= b 0 +b 1 y 1 +…+b n y n Reordering gives linear Diophantine equation a 1 x 1 -b 1 y 1 +…+a n x n -b n y n =b 0 -a 0 which has a solution iff gcd( a 1,…,a n,b 1,…,b n ) divides b 0 -a 0 DO I = 1, N S 1 A(2*I+1) = A(2*I) ENDDO DO I = 1, N S 1 A(4*I+1) = A(2*I+3) ENDDO DO I = 1, 10 DO J = 1, 10 S 1 A(1+2*I+20*J) = A(2+20*I+2*J) ENDDO ENDDO

66 Banerjee Test (1) f(  ) and g(  ) are affine: f( x 1,x 2,…,x n )= a 0 +a 1 x 1 +…+a n x n g( y 1,y 2,…,y n )= b 0 +b 1 y 1 +…+b n y n Dependence equation a 0 -b 0 + a 1 x 1 -b 1 y 1 +…+a n x n -b n y n =0 has no solution if 0 is not within the lower bound L and upper bound U of the equation’s LHS IF M > 0 THEN DO I = 1, 10 S 1 A(I) = A(I+M) + B(I) ENDDO Dependence equation  =  +M is rewritten as - M +  -  = 0 Constraints: 1 < M <  0 <  < 9 0 <  < 9

67 IF M > 0 THEN DO I = 1, 10 S 1 A(I) = A(I+M) + B(I) ENDDO - M +  -  = 0 LUstep - M +  -  original eq. - M +  - 9- M +  - 0eliminate  - M - 9- M + 9 eliminate  -  8eliminate M Banerjee Test (2) Test for (*) dependence: possible dependence because 0 lies within [- ,8] Constraints: 1 < M <  0 <  < 9 0 <  < 9

68 IF M > 0 THEN DO I = 1, 10 S 1 A(I) = A(I+M) + B(I) ENDDO - M +  -  = 0 LUstep - M +  -  original eq. - M +  - 9- M +  - (  +1)eliminate  - M - 9- M - 1 eliminate  -  - 2eliminate M Banerjee Test (3) Test for (<) dependence: assume that  <  -1 No dependence because 0 does no lie within [- ,-2] Constraints: 1 < M <  0 <  <  -1 < 8 1 <  +1 <  < 9

69 IF M > 0 THEN DO I = 1, 10 S 1 A(I) = A(I+M) + B(I) ENDDO - M +  -  = 0 LUstep - M +  -  original eq. - M +  - (  +1)- M +  eliminate  - M - 1- M + 9 eliminate  -  8eliminate M Banerjee Test (4) Test for (>) dependence: assume that  +1 >  Possible dependence because 0 lies within [- ,8] Constraints: 1 < M <  1 <  -1 <  < 9 0 <  <  +1 < 8

70 IF M > 0 THEN DO I = 1, 10 S 1 A(I) = A(I+M) + B(I) ENDDO - M +  -  = 0 LUstep - M +  -  original eq. - M +  -  eliminate  - M eliminate  -  - 1eliminate M Banerjee Test (5) Test for (=) dependence: assume that  =  No dependence because 0 does not lie within [- ,-1] Constraints: 1 < M <  0 <  =  < 9

71 Testing for All Direction Vectors Refine dependence between a pair of statements by testing for the most general set of direction vectors and upon failure refine each component recursively (*,*,*) (=,*,*)(<,*,*)(>,*,*) (<,=,*)(<,<,*)(,*) (<,=,=)(<,=,<)( ) yes (may depend) yes no (indep)

72 Run-Time Dependence Testing When a dependence equation includes unknowns, we can sometimes find the constraints on these unknowns to solve the equation and impose these constraints at run time on multi-version code DO I = 1, 10 S 1 A(I) = A(I-K) + B(I) ENDDO IF K = 0 OR K 9 THEN PARALEL DO I = 1, 10 S 1 A(I) = A(I-K) + B(I) ENDDO ELSE DO I = 1, 10 S 1 A(I) = A(I-K) + B(I) ENDDO ENDIF