Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.

Parallel Processing & Parallel Algorithm May 8, 2003 B4 Yuuki Horita.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

1 Code Optimization. 2 The Code Optimizer Control flow analysis: control flow graph Data-flow analysis Transformations Front end Code generator Code optimizer.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Optimizing Compilers for Modern Architectures Preliminary Transformations Chapter 4 of Allen and Kennedy.

Preliminary Transformations Chapter 4 of Allen and Kennedy Harel Paz.

1 Maximum Flow Networks Suppose G = (V, E) is a directed network. Each edge (i,j) in E has an associated ‘capacity’ u ij. Goal: Determine the maximum amount.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli.

Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.

Improving Code Generation Honors Compilers April 16 th 2002.

Distributed Combinatorial Optimization

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Memory-Aware Compilation Philip Sweany 10/20/2011.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Dependence Analysis and Loops CS 3220 Spring 2016.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Lecture 38: Compiling for Modern Architectures 03 May 02

CS314 – Section 5 Recitation 13

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

מבוסס על פרק 7 בספר של Allen and Kennedy

Parallelization, Compilation and Platforms 5LIM0

Preliminary Transformations

Register Pressure Guided Unroll-and-Jam

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Introduction to Optimization

Optimizing single thread performance

Locality In Distributed Graph Algorithms

Presentation transcript:

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez

Introduction Pervious lectures: Fine-Grained Parallelism Superscalar and vector architecture Parallelizing Inner loops This lecture: Coarse-Grained Parallelism Symmetric Multi Processor (SMP) architecture Parallelizing Outer loops

SMP Architecture Multiple asynchronous Processors Shared memory Synchronization required! Communication between processors Barrier as synchronization mechanism Expensive!

Roadmap Single Loop Methods Privatization Alignment Loop Fusion

Coarse-Grained vs. Fine Grained Fine-GrainedCoarse-Grained Scalar ExpansionPrivatization Loop Distribution Alignment Loop fusion

Roadmap Privatization Alignment Loop Fusion

Scalar Expansion - Reminder Scalar Expansion – loop carried dependences elimination+ vectorizaion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO DO I = 1, N T(I) = A(I) A(I) = B(I) B(I) = T(I) ENDDO T(1:N) = A( 1:N ) A( 1:N ) = B( 1:N ) B( 1:N ) = T( 1:N ) Scalar Expansion Vectoring

Privatization Privatization - loop carried dependences elimination DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO PARALLEL DO I = 1, N PRIVATE t t = A(I) A(I) = B(I) B(I) = t END PARALLEL DO Privatization

Privatization - Definition A scalar variable x defined within a loop is said to be privatizable with respect to that loop if and only if every path from the beginning of the loop body to a use of x within the loop body must pass through a definition of x before reaching that use

Privatization – formal solution For each block x in the loop body define equation (upward-exposed variables): use(x) – the set of all variables used within block x that have no prior definitions within the block def(x) – variables defined in x Solve the above equations …

Privatization – formal solution The set of private variables is: B – collection of loop body blocks b 0 – the entry block to the loop body

Privatization - Theorem A variable x defined in a loop may be made private if and only if the SSA graph for the variable does not have a Φ -node at the entry to the loop

Roadmap Privatization Alignment Loop Fusion

Loop Distribution Distributing loops using codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) D(I,J) = A(I,J-1) *2 ENDDO codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO

Loop Distribution - vector architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO DO I = 1, 100 A(I,1:100) = B(I, 1:100) + C(I, 1:100) D(I, 1:100) = A(I,0:99) *2 ENDDO

Loop Distribution - SMP architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO Barrier() DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO 

The solution - Alignment DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO S1S1 S2S2 I 2N

Basic Alignment S1S1 S2S2 DO I = 1, N S 1 IF (I > 1) A(I) = B(I) + C(I) S 2 IF (I< N) D(I+1) = A(I) *2 ENDDO DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO I 2N1

Optimized Alignment – Option 1 S 1 S 2 DO I = 1, N-1 J=I ; IF (I = 1) J = N A(J) = B(J) + C(J) D(I+1) = A(I) *2 ENDDO DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1

Optimized Alignment – Option 2 S1S1 S2S2 D(2) = A(1)*2 DO I = 2, N - 1 A(I) = B(I) + C(I) D(I+1) = A(I) *2 ENDDO A(N) = B(N)+C(N) DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1

Alignment problems Recurrence – impossible to align Different dependency distances – Alignment fails (Alignment conflict) DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 0, N IF (I>0) A(I+1) = B(I) + C IF (I<N) X(I+1) = A(I+2) +A(I+1) ENDDO

Code Replication Solves Alignment conflict DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 1, N A(I+1) = B(I) + C IF (I=1) t = A(I) ELSE t = B(I-1) + C X(I) = A(I+1) + t ENDDO

Alignment Graph G = (V,E) – Directed Acyclic Graph. V - Set of loop body statements Labeling o(v) – vertex offset E – Set of dependences Labeling d(e) - dependence distance

Alignment Graph - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO S1S1 S2S2 S3S3 d=2 d=1 o = 0

Alignment Goal The Graph G = (V,E) is said to be carry- free if for each edge e=(u,v) o(u) + d(e) = o(v) Alignment procedure gets alignment graph and generated carry-free alignment graph

Align Procedure While V is not empty Add to worklist W arbitrary vertex v from V While W is not empty Remove vertex u from worklist W Align all adjacent vertices of u, replicate node if different alignments required Add new aligned nodes in W

Align Procedure-Example S1S1 S2S2 S3S3 d=2 d=1 o = 0 o = -1 S1`S1` o = -3 d=2

GenAlign Procedure Set variables hi maximal vertex offset lo minimal vertex offset Ivar original iteration variable Lvar original loop lower bound Uvar original loop upper bound Generate loop statement “ DO Ivar = Lvar-hi, Uvar + lo ”

GenAlign Procedure cont. Scan vertices in a topological sort order Let v be the current vertex if o(v) = lo then generate “ IF ( Ivar >= Lvar-o(v) ) THEN “ + The related statement of v with Ivar+o(v) substituted for Ivar else if o(v) = hi then generate “ IF ( Ivar =< Uvar-o(v) ) THEN “… else generate “ IF ( Ivar >= Lvar-o(v) AND Ivar <= Uvar-o(v) ) THEN “…

GenAlign Procedure cont. if v is a replicated vertex, replace the statement S with the following “ THEN t v = RHS(S) with Ivar+o(v) substituted for Ivar ELSE t v = LHS(S) with Ivar+o(v) substituted for Ivar ENDIF ” Where t v is new unique scalar. Replace reference at the sink of every dependence from v by t v

GenAlign - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO DO I = 1, N+3 S 1 IF (I>=4) A(I-1) = B(I-3) + C S 1 `IF (I>=2 AND I<=N+1) THEN t = B(I-1) + C ELSE t = A(I+1) ENDIF S 2 IF (I>=2 AND I<=N+1) X(I)=A(I-1)+D S 3 IF (I<=N) Y(I)=t+X(I) ENDDO

Roadmap Privatization Alignment Loop Fusion

Loop Fusion - Motivation DO I = 1, N A(I) = B(I) + 1 C(I) = A(I) + C(I-1) D(I) = A(I) + X ENDDO DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO DO I = 1, N D(I) = A(I) + X ENDDO Distribution Parallelizable Serial

Loop Fusion - Motivation PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO PARALLE DO I = 1, N D(I) = A(I) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I) + X ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO

Loop Fusion – Graphical View L1L1 L2L2 L3L3 L 1,3 L2L2

Loop Fusion - Safety Constraints PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I+1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I+1) + X ENDDO Fusion-preventing dependence constraint: the fused loops generates backward loop carried dependence X

Loop Fusion - Safety Constraints Ordering constraint: there is a path that contains a loop-independent dependence between the loops that can ’ t be fused with them L1L1 L2L2 L3L3 L 1,3 L2L2 X Fusion

Loop Fusion - Profitability Constraints Separation constraint: do not fuse parallel loops with sequential loops … Parallelism inhibiting constraint: the fused loop will have a forward carried dependence. PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I-1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I-1) + X ENDDO 

Typed Fusion - Definition P=(G, T, m, B, t 0 ) G=(V,E) directed acyclic graph (dependence graph) T – set of types (parallel, sequential) m:V  T mapping of types to vertices B – set of bad edges (constraints) t 0 – objective type (parallel)

Main Data Structures num[n] – holds the number of the node number of n in the fused graph maxBadPrev[n] – holds the maximal vertex number of type t 0 in the fused graph that cannot be fused with n “The maximal fused vertex that n is preventing from being further fused”

TypedFusion Example , 2, 3, 7 4, 5,

update_successors(n) Set t = type(n) For each edge (n,m) do If (t != t 0 ) maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n]) Else If ( type(m) != t 0 or (n,m) in B ) maxBadPrev[m] = MAX(maxBadPrev[m], num[n]) Else maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n])

TypedFusion Procedure Scan the vertices in topological order For each vertex v If v is of type t 0 update_successors(v) fuse v with the first possible available node Else update_successors(v) create_new_node(v)

TypedFusion Example

(0,1) (0,2) (1,3) (1,4) (1,5) (1,4) (4,6) (maxBadPrev, num) (0, ) (1, ) (0, )(1, ) (4, ) (1, )

TypedFusion Example 1, 3 4 5, (1) (2) (3) (4) (5) (6)

Fusing sequential loops … 1, 3 2,4,6 5, 8 7 (1) (3) (4) (6)

Ordered Fusion More than 2 types of loops: Different loop headers Setting the “ right ” order of types to fuse is NP-hard Define priorities to each type

Type conflict

Ordered Fusion

Cohort Fusion

Allows running sequential loops with parallel loop Settings: Single type Bad edges Fusion-preventing edges Parallelism-inhibiting edges Edges mixing parallel and sequential loops

Cohort Fusion Pros Minimal number of barriers Cons Bad load balancing