Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Programmability Issues

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Parallel Computing Presented by Justin Reschke

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

CS314 – Section 5 Recitation 13

Parallel Programming By J. H. Wang May 2, 2017.

Computer Engg, IIT(BHU)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Preliminary Transformations

Register Pressure Guided Unroll-and-Jam

Parallel Programming in C with MPI and OpenMP

Introduction to Optimization

Presentation transcript:

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14

Optimizing Compilers for Modern Architectures Overview Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary

Optimizing Compilers for Modern Architectures Motivation for HPF Require “Message Passing” to communicate data between processors Approach 1: Use MPI calls in Fortran/C code Scalable Distributed Memory Multiprocessor

Optimizing Compilers for Modern Architectures MPI implementation Motivation for HPF Consider the following sum reduction PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, SUM = SUM + A(I) ENDDO PRINT SUM END PROGRAM SUM REAL A(100), BUFF(100) IF (PID == 0) THEN DO IP = 0, 99 READ (9) BUFF(1:100) IF (IP == 0) A(1:100) = BUFF(1:100) ELSE SEND(IP,BUFF,100) ENDDO ELSE RECV(0,A,100) ENDIF /*Actual sum reduction code here */ IF (PID == 0) SEND(1,SUM,1) IF (PID > 0) RECV(PID-1,T,1) SUM = SUM + T IF (PID < 99) SEND(PID+1,SUM,1) ELSE SEND(0,SUM,1) ENDIF IF (PID == 0) PRINT SUM; END

Optimizing Compilers for Modern Architectures Motivation for HPF Disadvantages of MPI approach —User has to rewrite the program in SPMD form [Single Program Multiple Data] —User has to manage data movement [send & receive], data placement and synchronization —Too messy and not easy to master

Optimizing Compilers for Modern Architectures Motivation for HPF Approach 2: Use HPF —HPF is an extended version of Fortran 90 —HPF has Fortran 90 features and a few directives Directives —Tell how data is laid out in processor memories in parallel machine configuration. For example, –!HPF DISTRIBUTE A(BLOCK) —Assist in identifying parallelism. For example, –!HPF INDEPENDENT

Optimizing Compilers for Modern Architectures Motivation for HPF The same sum reduction code PROGRAM SUM REAL A(10000) READ (9) A SUM = 0.0 DO I = 1, SUM = SUM + A(I) ENDDO PRINT SUM END When written in HPF... PROGRAM SUM REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) READ (9) A SUM = 0.0 DO I = 1, SUM = SUM + A(I) ENDDO PRINT SUM END Minimum modification Easy to write Now compiler has to do more work

Optimizing Compilers for Modern Architectures Motivation for HPF Advantages of HPF —User needs only to write some easy directives; need not write the whole program in SPMD form —User does not need to manage data movement [send & receive] and synchronization —Simple and easy to master

Optimizing Compilers for Modern Architectures Overview Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary

Optimizing Compilers for Modern Architectures Dependence Analysis Used for communication analysis —Fact used: No dependence carried by I loop HPF Compilation Overview Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, DO I = 2, S1: A(I) = B(I-1) + C ENDDO DO I = 1, S2: B(I) = A(I) ENDDO

Optimizing Compilers for Modern Architectures HPF Compilation Overview Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, DO I = 2, S1: A(I) = B(I-1) + C ENDDO DO I = 1, S2: B(I) = A(I) ENDDO Dependence Analysis Distribution Analysis

Optimizing Compilers for Modern Architectures HPF Compilation Overview Running example: REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, DO I = 2, S1: A(I) = B(I-1) + C ENDDO DO I = 1, S2: B(I) = A(I) ENDDO Dependence Analysis Distribution Analysis Computation Partitioning —Partition so as to distribute work of the I loops

Optimizing Compilers for Modern Architectures HPF Compilation Overview REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, I1:IF (PID /= 100) SEND(PID+1,B(100),1) I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 2, 100 S1:A(I) = B(I-1)+C ENDDO DO I = 1, 100 S2:B(I) = A(I) ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement —Communication reqd for B(0)for each iteration —Shadow region B(0)

Optimizing Compilers for Modern Architectures HPF Compilation Overview REAL A(1,100), B(0:100) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK) DO J = 1, I1: IF (PID /= 100) SEND(PID+1,B(100),1) DO I = 2, 100 S1:A(I) = B(I-1)+C ENDDO I2: IF (PID /= 0) THEN RECV(PID-1,B(0),1) A(1) = B(0) + C ENDIF DO I = 1, 100 S2:B(I) = A(I) ENDDO Dependence Analysis Distribution Analysis Computation Partitioning Communication Analysis and placement Optimization —Aggregation —Overlap communication and computation —Recognition of reduction

Optimizing Compilers for Modern Architectures Overview Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary

Optimizing Compilers for Modern Architectures Basic Loop Compilation Distribution Propagation and analysis —Analyze what distribution holds for a given array at a given point in the program —Difficult due to –REALIGN and REDISTRIBUTE directives –Distribution of formal parameters inherited from calling procedure —Use “Reaching Decompositions” data flow analysis and its interprocedural version

Optimizing Compilers for Modern Architectures Basic Loop Compilation For simplicity assume single distribution for an array at all points in a subprogram Define For example suppose array A of size N is block distributed over p processors —Block size,

Optimizing Compilers for Modern Architectures Basic Loop Compilation Iteration Partitioning —Dividing work among processors –Computation partitioning —Determine which iterations of a loop will be executed on which processor —Owner-computes rule REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, A(I) = A(I) + C ENDDO Iteration I is executed on owner of A(I) 100 processors: 1st 100 iterations on processor 0, the next 100 on processor 1 and so on

Optimizing Compilers for Modern Architectures Iteration Partitioning Multiple statements in a loop in a recurrence: choose a partitioning reference Processor responsible for performing computation for iteration I is Set of indices executed on p

Optimizing Compilers for Modern Architectures Iteration Partitioning Have to map global loop index to local loop index Smallest value in maps to 1 REAL A(10000) !HPF$ DISTRIBUTE A(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO

Optimizing Compilers for Modern Architectures Iteration Partitioning REAL A(10000),B(10000) !HPF$ DISTRIBUTE A(BLOCK),B(BLOCK) DO I = 1, N A(I+1) = B(I) + C ENDDO Map global iteration space, I to local iteration space,i as follows:

Optimizing Compilers for Modern Architectures Iteration Partitioning Adjust array subscripts for local iterations:

Optimizing Compilers for Modern Architectures Iteration Partitioning For interior processors the code becomes.. DO i = 1, 100 A(i) = B(i-1) + C ENDDO Adjust lower bound for 1st processor and upper bound of last processor to take care of boundary conditions.. lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL(N+1/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi A(i) = B(i-1) + C ENDDO

Optimizing Compilers for Modern Architectures Communication Generation For our example no communication is required for iterations in Iterations which require receiving data are Iterations which require sending data are

Optimizing Compilers for Modern Architectures Communication Generation REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK), B(BLOCK)... DO I = 1, N A(I+1) = B(I) + C ENDDO Receive required for iterations in [100p:100p] Send required for iterations in [100p+100:100p+100] No communication required for iterations in [100p+1:100p+99]

Optimizing Compilers for Modern Architectures Communication Generation After inserting receive lo = 1 IF (PID==0) lo = 2 hi = 100 IF (PID==CEIL((N+1)/100)-1) hi = MOD(N,100) + 1 DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO Send must happen in the 101st iteration lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO

Optimizing Compilers for Modern Architectures Communication Generation lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 DO i = lo, hi+1 IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) IF (i <= hi) THEN A(i) = B(i-1) + C ENDIF IF (i == hi+1 && PID /= lastP) SEND(PID+1, B(100), 1) ENDDO Move SEND outside the loop lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF

Optimizing Compilers for Modern Architectures Communication Generation lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO i = lo, hi IF (i==1 && PID /= 0) RECV (PID-1, B(0), 1) A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF Move receive outside loop and loop peel lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF

Optimizing Compilers for Modern Architectures Communication Generation lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF ! lo = MAX(lo,1+1) == 2 DO i = 2, hi A(i) = B(i-1) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100), 1) ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100), 1) IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF

Optimizing Compilers for Modern Architectures Communication Generation When is such rearrangement legal? Receive: copy from global to local location Send: copy local to global location IF (PID <= lastP) THEN S1:IF (lo == 1 && PID /= 0) THEN B(0) = Bg(0) ! RECV A(1) = B(0) + C ENDIF DO i = 2, hi A(i) = B(i-1) + C ENDDO S2:IF (PID /= lastP) Bg(100) = B(100) ! SEND ENDIF No chain of dependences from S1 to S2

Optimizing Compilers for Modern Architectures Communication Generation REAL A(10000), B(10000) !HPF$ DISTRIBUTE A(BLOCK)... DO I = 1, N A(I+1) = A(I) + C ENDDO Would be rewritten as.. IF (PID <= lastP) THEN S1:IF (lo == 1 && PID /= 0) THEN A(0) = Ag(0) ! RECV A(1) = A(0) + C ENDIF DO i = 2, hi A(i) = A(i-1) + C ENDDO S2:IF (PID /= lastP) Ag(100) = A(100) ! SEND ENDIF Rearrangement won’t be correct

Optimizing Compilers for Modern Architectures Overview Motivation for HPF Overview of compiling HPF programs Basic Loop Compilation for HPF Optimizations for compiling HPF Results and Summary

Optimizing Compilers for Modern Architectures Communication Vectorization REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = B(I,J) + C ENDDO Using Basic Loop compilation gives.. DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO

Optimizing Compilers for Modern Architectures Communication Vectorization DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (PID /= lastP) SEND(PID+1, B(100,J), 1) IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF Distribute J Loop

Optimizing Compilers for Modern Architectures Communication Vectorization lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (PID /= lastP) SEND(PID+1, B(100,J), 1) ENDDO DO J = 1, M IF (lo == 1) THEN RECV (PID-1, B(0,J), 1) A(i,J) = B(i-1,J) + C ENDIF ENDDO DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN IF (lo == 1) THEN RECV (PID-1, B(0,1:M), M) DO J = 1, M A(1,J) = B(0,J) + C ENDDO ENDIF DO J = 1, M DO i = lo+1, hi A(i,J) = B(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, B(100,1:M), M) ENDIF

Optimizing Compilers for Modern Architectures Communication Vectorization DO J = 1, M lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S1:IF (PID /= lastP) Bg(100,J)=B(100,J) IF (lo == 1) THEN S2:B(0,J)=Bg(0,J) S3:A(1,J) = B(0,J) + C ENDIF DO i = lo+1, hi S4:A(i,J) = B(i-1,J) + C ENDDO ENDIF ENDDO Communication stmts resulting from an inner loop can be vectorized wrt an outer loop if the communication statements are not involved in a recurrence carried by outer loop

Optimizing Compilers for Modern Architectures Communication Vectorization REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*), B(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + B(I,J) ENDDO Can sends be done before the receives? Can communication be vectorized? REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J+1) = A(I,J) + C ENDDO Can sends be done before the receives? Can communication be fully vectorized?

Optimizing Compilers for Modern Architectures Overlapping Communication and Computation lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0:IF (PID /= lastP) SEND(PID+1, B(100), 1) S1:IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF L1:DO i = 2, hi A(i) = B(i-1) + C ENDDO ENDIF lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN S0:IF (PID /= lastP) SEND(PID+1, B(100), 1) L1:DO i = 2, hi A(i) = B(i-1) + C ENDDO S1:IF (lo == 1 && PID /= 0) THEN RECV (PID-1, B(0), 1) A(1) = B(0) + C ENDIF

Optimizing Compilers for Modern Architectures Pipelining REAL A(10000,100) !HPF$ DISTRIBUTE A(BLOCK,*) DO J = 1, M DO I = 1, N A(I+1,J) = A(I,J) + C ENDDO Initial code generation for the I loop gives.. lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF Can be vectorized But gives up parallelism

Optimizing Compilers for Modern Architectures Pipelining Pipelined parallelism with communication

Optimizing Compilers for Modern Architectures Pipelining Pipelined parallelism with communication overhead

Optimizing Compilers for Modern Architectures Pipelining: Blocking lo = 1 IF (PID==0) lo = 2 hi = 100 lastP = CEIL((N+1)/100) - 1 IF (PID==lastP) hi = MOD(N,100) + 1 IF (PID <= lastP) THEN DO J = 1, M IF (lo == 1) THEN RECV (PID-1, A(0,J), 1) A(1,J) = A(0,J) + C ENDIF DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J), 1) ENDDO ENDIF... IF (PID <= lastP) THEN DO J = 1, M, K IF (lo == 1) THEN RECV (PID-1, A(0,J:J+K-1), K) DO j = J, J+K-1 A(1,J) = A(0,J) + C ENDDO ENDIF DO j = J, J+K-1 DO i = lo+1, hi A(i,J) = A(i-1,J) + C ENDDO IF (PID /= lastP) SEND(PID+1, A(100,J:J+K-1),K) ENDDO ENDIF

Optimizing Compilers for Modern Architectures Other Optimizations Alignment and Replication Identification of Common recurrences Storage Mangement —Minimize temporary storage used for communication —Space taken for temporary storage should be at most equal to the space taken by the arrays Interprocedural Optimizations

Optimizing Compilers for Modern Architectures Results

Optimizing Compilers for Modern Architectures Summary HPF is easy to code —But hard to compile Steps required to compile HPF programs —Basic loop compilation –Communication generation —Optimizations –Communication vectorization –Overlapping communication with computation –Pipelining