Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

Compiler Challenges for High Performance Architectures

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

A Seminar on Optimizations for Modern Architectures Computer Science Seminar 2, Winter 2006 Lecturer: Erez Petrank

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Enhancing Fine-Grained Parallelism - P art 2 Chapter 5 of Allen and Kennedy Mirit & Haim.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.

Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

ECE 1754 Loop Transformations by: Eric LaForest

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Memory-Aware Compilation Philip Sweany 10/20/2011.

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

A few words on locality and arrays

CS314 – Section 5 Recitation 13

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

CS 213: Data Structures and Algorithms

Register Pressure Guided Unroll-and-Jam

Multivector and SIMD Computers

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Data Parallel Pattern 6c.1

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures

Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: Loop Interchange Scalar Expansion Scalar Renaming Array Renaming Node Splitting

Optimizing Compilers for Modern Architectures Recall Vectorization procedure…. procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to R construct R p from R by reducing each S i to a single node and compute D p, the dependence graph naturally induced on R p by D let {p 1, p 2,..., p m } be the m nodes of R p numbered in an order consistent with D p (use topological sort to do the numbering); for i = 1 to m do begin if p i is cyclic then begin generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement; end else generate a vector statement for p i in r(p i )-k+1 dimensions, where r (p i ) is the number of loops containing p i ; end We can fail here

Optimizing Compilers for Modern Architectures Can we do better? Codegen: tries to find parallelism using transformations of loop distribution and statement reordering If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops Goal in Chapter 5: To explore other transformations to exploit parallelism

Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO

Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO

Optimizing Compilers for Modern Architectures Motivational Example II Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO

Optimizing Compilers for Modern Architectures Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO A couple of new transformations used: —Loop interchange —Scalar Expansion

Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B DV: (=, <) ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B DV: (<, =) ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO

Optimizing Compilers for Modern Architectures Loop Interchange Loop interchange is a reordering transformation Why? —Think of statements being parameterized with the corresponding iteration vector —Loop interchange merely changes the execution order of these statements. — It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S ENDDO If interchanged, S(2, 1) will execute before S(1, 2)

Optimizing Compilers for Modern Architectures Loop Interchange: Safety Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO Direction vector ( ) If we interchange loops, we violate the dependence

Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-sensitive if it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level. Example: Interchange-Sensitive? Example: Interchange-Insensitive?

Optimizing Compilers for Modern Architectures Loop Interchange: Safety Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

Optimizing Compilers for Modern Architectures Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO The direction matrix for the loop nest is: < < = Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. Follows from Theorem 5.1 and Theorem 2.3

Optimizing Compilers for Modern Architectures Loop Interchange: Profitability Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO Not suitable for vector register machines

Optimizing Compilers for Modern Architectures Loop Interchange: Profitability For Vector machines, we want to vectorize loops with stride-one memory access Since Fortran stores in column-major order: —useful to vectorize the I-loop Thus, transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO

Optimizing Compilers for Modern Architectures Loop Interchange: Profitability MIMD machines with vector execution units: want to cut down synchronization costs Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

Optimizing Compilers for Modern Architectures Scalar Expansion DO I = 1, N S 1 T = A(I) S 2 A(I) = B(I) S 3 B(I) = T ENDDO Scalar Expansion: DO I = 1, N S 1 T$(I) = A(I) S 2 A(I) = B(I) S 3 B(I) = T$(I) ENDDO T = T$(N) leads to: S 1 T$(1:N) = A(1:N) S 2 A(1:N) = B(1:N) S 3 B(1:N) = T$(1:N) T = T$(N)

Optimizing Compilers for Modern Architectures Scalar Expansion However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO Scalar expansion gives us: T$(0) = T DO I = 1, N S 1 T$(I) = T$(I-1) + A(I) + A(I+1) S 2 A(I) = T$(I) ENDDO T = T$(N)

Optimizing Compilers for Modern Architectures Scalar Expansion: Safety Scalar expansion is always safe When is it profitable? —Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. —However, we want to predict when expansion is profitable Dependences due to reuse of memory location vs. reuse of values —Dependences due to reuse of values must be preserved —Dependences due to reuse of memory location can be deleted by expansion

Optimizing Compilers for Modern Architectures Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions: —Expand in a single loop —Strip mine loop before expansion —Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO

Optimizing Compilers for Modern Architectures Scalar Renaming DO I = 1, 100 S 1 T = A(I) + B(I) S 2 C(I) = T + T S 3 T = D(I) - B(I) S 4 A(I+1) = T * T ENDDO Renaming scalar T: DO I = 1, 100 S 1 T1 = A(I) + B(I) S 2 C(I) = T1 + T1 S 3 T2 = D(I) - B(I) S 4 A(I+1) = T2 * T2 ENDDO

Optimizing Compilers for Modern Architectures Scalar Renaming will lead to: S 3 T2$(1:100) = D(1:100) - B(1:100) S 4 A(2:101) = T2$(1:100) * T2$(1:100) S 1 T1$(1:100) = A(1:100) + B(1:100) S 2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

Optimizing Compilers for Modern Architectures Node Splitting Sometimes Renaming fails DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence kept intact by renaming algorithm

Optimizing Compilers for Modern Architectures Node Splitting DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N)

Optimizing Compilers for Modern Architectures Node Splitting Determining minimal set of critical antidependences is in NP-C Perfect job of Node Splitting difficult Heuristic: —Select antidependences —Delete it to see if acyclic —If acyclic, apply Node Splitting