Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Carnegie Mellon Lecture 6 Register Allocation I. Introduction II. Abstraction and the Problem III. Algorithm Reading: Chapter Before next class:
Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.
Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Locality CS 524 – High-Performance Computing.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
An Object-Oriented Approach to Programming Logic and Design Chapter 6 Looping.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
High-Level Transformations for Embedded Computing
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
ECE 1754 Loop Transformations by: Eric LaForest
Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Linear Systems Dinesh A.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Memory-Aware Compilation Philip Sweany 10/20/2011.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Dependence Analysis and Loops CS 3220 Spring 2016.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
Lecture 38: Compiling for Modern Architectures 03 May 02
Automatic Thread Extraction with Decoupled Software Pipelining
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Dependence Analysis Important and difficult
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Computer Engg, IIT(BHU)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
CS 213: Data Structures and Algorithms
Preliminary Transformations
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli

Single loop methods  Privatization  Loop distribution  Alignment  Loop Fusion Last time …

Perfect Loop Nests  Loop Interchange  Loop Selection  Loop Reversal  Loop Skewing  Profitibility-Based Methods This time …

Imperfectly Nested Loops  Multilevel Loop Fusion  Parallel Code Generation Packaging Parallelism  Strip Mining  Pipeline Parallelism  Guided Self Scheduling

Loop Interchange: Reminder Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non- "=" direction in any row. i j k < < = j i k < < = = j k i < = < = > < © H. Kermany & M. Shalem

Vectorization: Bad Parallelization: Good Loop Interchange A(I+1, J) = A(I, J) + B(I, J) ENDDO Vectorization: OK Parallelization: Problematic DO J = 1, M DO I = 1, NPARALLEL DO J = 1, M DO I = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO END PARALLEL DO D = ( <, = )

DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) ENDDO DO I = 1, N PARALLEL DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) END PARALLEL DO ENDDO Loop Interchange (Cont.) Loop Interchange doesn’t work, as both loops carry dependence!! Best we can do D = ( <, < ) When can a loop be moved to the outermost position in the nest, and be guaranteed to be parallel?

Loop Interchange (Cont.) Theorem: In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contains only ‘=‘ entries. Proof. If. A column with only “=“ entries represents a loop that can be interchanged, and carries no dependence. Only If. There is a non “=“ entry in that column:  If it is “>” – Can’t interchange loops (dependence will be reversed)  If it is “<“ – Can interchange, but can’t shake the dependece (Will not allow parallelization anyway...)

Loop Interchange (Cont.) Working with direction matrix  1. Move loops with all “=“ entries into outermost position and parallelize it. Remove the column from the matrix  2. Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences  3. Repeat step 1

DO I = 1, N DO J = 1, M DO K = 1, L A(I+1, J,K) = A(I, J,K) + X1 B(I, J,K+1) = B(I, J,K) + X2 C(I+1, J+1,K+1) = C(I, J,K) + X3 ENDDO ENDO DO I = 1, N PARALLEL DO J = 1, M DO K = 1, L A(I+1, J,K) = A(I, J,K) + X1 B(I, J,K+1) = B(I, J,K) + X2 C(I+1, J+1,K+1) = C(I, J,K) + X3 ENDDO END PARALLEL DO ENDO Loop Interchange (Cont.) Example: < = = = = < < < <

< < = = < = < = = < = < = = = = < = = = = < Loop Selection – Optimal? Is the approach of selecting the loop with the most ‘ < ‘ directions optimal?  Will result in NO parallelization for this matrix  While other selections may allow parallelization < < = = < = < = = < = < = = = = < = = = = < Is it possible to derive a selection heuristic that provides optimal code?

The problem of loop selection is NP-complete  Loop selection is best done by a heuristic! Loop Selection < < = = < = < = = < = < = = = = < = = = = < Favor the selection of loops that must be sequentialized before parallelism can be uncovered.

Loop Selection Goal: Generate most parallelism with adequate granularity  Key is to select proper loops to run in parallel Informal parallel code generation strategy: 1. While there are loops that can be run in parallel, move them to the outermost position and parallelize them 2. Select a sequential loop, run it sequentially, and find what new parallelism may have been revealed.

Heuristic Loop Selection (Cont.) Example of principals involved in heuristic loop selection DO I = 2, N DO J = 2, M DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) ENDDO The I-loop must be sequentialized because of the fourth dependence The J-loop must be sequentialized because of the first dependence DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO = < = < = < = < <

Loop Reversal DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO Using loop reversal to create coarse-grained parallelism. Consider: DO I = 2, N+1 DO J = 2, M+1 DO K = L, 1, -1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO DO K = L, 1, -1 DO I = 2, N+1 DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO DO K = L, 1, -1 PARALLEL DO I = 2, N+1 PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO = = < < < = <

Loop Skewing: Reminder I J < = = < S(1,1) S(1,3) S(1,2) S(1,4) S(2,1) S(2,3) S(2,2) S(2,4) S(3,1) S(3,3) S(3,2) S(3,4) S(4,1) S(4,3) S(4,2) S(4,4) J = 1 J = 2 J = 3 J = 4 I = 1I = 2I = 3I = 4 Note: there are diagonal lines of parallelism © H. Kermany & M. Shalem

Loop Skewing DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO Skewed using k = K + I + J yield: DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO = < = < = = = = < = = = DO k = 5, N+M+1 PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2) PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1) A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) END PARALLEL DO ENDDO = < < < = < = = < = = =

Loop Skewing - Main Benefits Eliminate “>” signs in the matrix Transforms skewed loops in such a way, that after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed

Loop Skewing - Drawback The resulting parallelism is usually unbalanced. (The resulting loop executes a variable amount of iterations each time).  As we shall see – It’s not really a problem for asynchronous parallelism (unlike vectorization).

Loop Skewing (Cont.) Updated strategy 1. Parallelize outermost loop if possible 2. Sequentializes at most one outer loop to find parallelism in the next loop 3. If 1 and 2 fail, try skewing 4. If 3 fails, sequentialize the loop that can be moved to the outermost position and cover the most other loops

In Practice –  Sometimes we get much worse execution times, than we would have gotten parallelizing less\different loops.

Profitability-Based Methods Static performance estimation function  No need to be accurate, just good at selecting the better of two alternatives Key considerations  Cost of memory references  Sufficiency of granularity

Profitability-Based Methods (Cont.) Impractical to choose from all arrangements Consider only subset of the possible code arrangements, based on properties of the cost function  In our case: consider only the inner-most loop

Profitability-Based Methods (Cont.) 1.Subdivide all the references in the loop body into reference groups  Two references are in the same group if: There is a loop independent dependence between them. There is a constant-distance loop carried dependence between them. A possible cost evaluation heuristics:

Profitability-Based Methods (Cont.) 2.Determine whether subsequent accesses to the same reference are  Loop invariant Cost = 1  Unit stride Cost = number of iterations / cache line size  Non-unit stride Cost = number of iterations A possible cost evaluation heuristics:

Profitability-Based Methods (Cont.) 3.Compute loop cost: A possible cost evaluation heuristics:

Profitability-Based Methods: Example DO I = 1, N DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO

Profitability-Based Methods: Example DO I = 1, N DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO 2N 3 /L+N 2 1N/L I 2N 3 +N 2 N1NJ N 3 (1+1/L)+N 2 N/LN1K COSTBAC Inner- most loop Worst Best

Profitability-Based Methods: Example Reorder loop from innermost to outermost by increasing loop cost: I,K,J Can’t always have desired loop order (as some permutations are illegal) - Try to find the possible permutation closest to the desired one. DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO

Profitability-Based Methods (Cont.) Goal: Given a desired loop order and a direction matrix for a loop nest - find the legal permutation closest to the desired one. Method: Until there are no more loops: Choose from all the loops that can be interchanged to the outermost position, the one that is outermost in the desired permutation. Drop that loop. It can be shown that if a legal permutation with the desired innermost loop in the innermost position exists – this algorithm will find such a permutation.

Profitability-Based Methods (Cont.) DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO For performance reasons – the compiler may mark the inner loop as “not meant for parallelization” (sequential performance utilizes locality in memory accesses).

Multilevel Loop Fusion Commonly used for imperfect loop nests Used after maximal loop distribution

Multilevel Loop Fusion DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C B(I+1, J) = B(I, J) + D ENDDO DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C ENDDO DO I = 1, N DO J = 1, M B(I+1, J) = B(I, J) + D ENDDO PARALLEL DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = B(I, J) + D ENDDO END PARALLEL DO After distribution each nest is better with a different outer loop – Can’t fuse!

Multilevel Loop Fusion (Cont.) DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) C(I, J+1) = A(I, J) + B(I,J) D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X ENDDO DO I = 1, N DO J = 1, M B(I+1, J) = A(I, J) + B(I,J) ENDDO DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO DO I = 1, N DO J = 1, M D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X ENDDO DO I = 1, N DO J = 1, M B(I+1, J) = A(I, J) + B(I,J) ENDDO DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO DO I = 1, N DO J = 1, M D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO Which loop should be fused into the A loop? i,j A j B i C j D

Multilevel Loop Fusion (Cont.) PARALLEL DO J = 1, M DO I = 1, N A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) ENDDO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO PARALLEL DO J = 1, M DO I = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO j AB i C j D Fusing A loop with B loop 2 barriers

Multilevel Loop Fusion (Cont.) PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X C(I, J+1) = A(I, J) + C(I,J) ENDDO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) ENDDO PARALLEL DO J = 1, M DO I = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO i AC j B j D PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X C(I, J+1) = A(I, J) + C(I,J) ENDDO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO Now we can also fuse B-D i AC j BD Fusing A loop with C loop 1 barrier

Multilevel Loop Fusion (Cont.) Decision making needs look-ahead Strategy: Fuse with the loop that cannot be fused with one of its successors Rationale: If it can’t be fused with its successors – a barrier will be formed anyway. i,j A j B i C j D A barrier is inevitable!!

Parallel Code Generation Parallelize(l,D) 1. Try methods for perfect nests (loop interchange, loop skewing, loop reversal), and stop if parallelism is found. 2. If nest can be distributed: distribute, run recursively on the distributed nests, and merge. 3. Else sequentialize outer loop, eliminate the dependences it carries, and try recursively on each of the loops nested in it. Code generation scheme:

Parallel Code Generation procedure Parallelize(l, D l ); ParallelizeNest(l, success); //(try methods for perfect nests..) if ¬success then begin if l can be distributed then begin distribute l into loop nests l1, l2, …, ln; for i:=1 to n do begin Parallelize(li, Di); end Merge({l1, l2, …, ln}); end

Parallel Code Generation (Cont.) else begin // if l cannot be distributed then for each outer loop l 0 nested in l do begin let D 0 be the set of dependences between statements in l 0 less dependences carried by l; Parallelize(l 0,D 0 ); end let S - the set of outer loops and statements loops left in l; If ||S||>1 then Merge(S); end end Parallelize

Parallel Code Generation (Cont.) DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C X(I, J) = A(I, J) + C ENDDO DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C ENDDO DO J = 1, M DO I = 1, N X(I, J) = A(I, J) + C ENDDO Both loops carry dependence – loop interchange will not find sufficient parallelism. Try distribution… PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C ENDDO END PARALLEL DO DO J = 1, M DO I = 1, N X(I, J) = A(I, J) + C ENDDO I loop can be parallelized PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO I = 1, N ! Left sequential for memory hierarchy X(I, J) = A(I, J) + C ENDDO END PARALLEL DO Both loops can be parallelized Now fusing… Type: (I-loop, parallel) Type: (J-loop, parallel) Different types – can’t fuse

Parallel Code Generation (Cont.) DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) C(I, J+1) = A(I, J) + C(I,J) D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X ENDDO ENDPARALLEL DO PARALLEL DO J = 1, M DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) ENDDO END PARALLEL DO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO J = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO END PARLLEL DO PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X ENDDO DO I = 1, N B(I+1, J) = A(I, J) + B(I,J) ENDDO END PARALLEL DO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO J = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO END PARLLEL DO PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X B(I+1, J) = A(I, J) + B(I,J) ENDDO END PARALLEL DO PARALLEL DO I = 1, N DO J = 1, M C(I, J+1) = A(I, J) + C(I,J) ENDDO END PARALLEL DO PARALLEL DO J = 1, M DO J = 1, N D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDO END PARLLEL DO A BC D I loop, parallel J loop, parallel A C D

DO J = 1, JMAX DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N-1 DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = 0.0 DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1 DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) Erlebacher PARALLEL DO J = 1, JMAX DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N-1 PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = 0.0 PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1 PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

Erlebacher PARALLEL DO J= 1, MAXD L1 :DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) L2:DO K = 2, N – 1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) L3:DO I = 1, IMAXD TOT(I, J) = 0.0 L4:DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) L5:DO K = 2, N-1 DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) END PARALLEL DO L1 L4 L2 L3 L5

Erlebacher PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO END PARALLEL DO

Packaging of Parallelism Trade off between parallelism and granularity of synchronization.  Larger granularity work-units means synchronization needs to be done less frequently, but at a cost of less parallelism, and poorer load balance.

Strip Mining Converts available parallelism into a form more suitable for the hardware DO I = 1, N A(I) = A(I) + B(I) ENDDO  Interruptions may be disastrous k = CEIL (N / P) PARALLEL DO I = 1, N, k DO i = I, MIN(I + k-1, N) A(i) = A(i) + B(i) ENDDO END PARALLEL DO The value of P is unknown until runtime, so strip mining is often handled by special hardware (Convex C2 and C3)

Strip Mining (Cont.) What if the execution time varies among iteraions? PARALLEL DO I = 1, N DO J = 2, I A(J, I) = A(J-1, I) * 2.0 ENDDO END PARALLEL DO Solution: smaller unit size to allow more balanced distribution

Pipeline Parallelism Fortran command DOACROSS – pipelines parallel loop iterations with cross-iteration synchronization. Useful where parallelization is not available High synchronization costs DOACROSS I = 2, N S1: A(I) = B(I) + C(I) POST(EV(I)) IF (I>2) WAIT (EV(I-1)) S2: C(I) = A(I-1) + A(I) ENDDO

Scheduling Parallel Work Load balance Little Sychro.

Scheduling Parallel Work Parallel execution is slower than serial execution if Bakery-counter scheduling  Moderate synchronization overhead N- number of iterations B- time of one iteration p- number of processors σ 0 - constant overhead per processor

Guided Self-Scheduling Incorporates some level of static scheduling to guide dynamic self-scheduling  Schedules groups of iterations  Going from large to small chunks of work Iterations dispensed at time t follows:

Guided Self-Scheduling (Cont.) GSS: (20 iteration, 4 processors) Not completely balanced Required synchronization: 9 In bakery counter: 20

Guided Self-Scheduling (Cont.) In the example, last 4 allocation are for a single iteration. Coincidence?  Last p-1 iterations will always be of 1 iteration. GSS(2): No block of iterations smaller than 2 GSS(k): No block is smaller than k

Yaniv Carmeli B.A. in CS Thanks for you attention!