1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Copyright, 1996 © Dale Carnegie & Associates, Inc. Dependence Testing Allen and Kennedy, Chapter 3 thru Section.

Advertisements

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

DATAFLOW TESTING DONE BY A.PRIYA, 08CSEE17, II- M.s.c [C.S].

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Overview Structural Testing Introduction – General Concepts

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

CS 201 Compiler Construction

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

1 Optimizing compilers Managing Cache Bercovici Sivan.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Affine Partitioning for Parallelism & Locality Amy Lim Stanford University

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

1 CSE1301 Computer Programming: Lecture 15 Flowcharts and Debugging.

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David A. Padua PLDI.

Techniques for Reducing the Overhead of Run-time Parallelization Lawrence Rauchwerger Department of Computer Science Texas A&M University

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.

Parallelizing Compilers Presented by Yiwei Zhang.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Data Dependences CS 524 – High-Performance Computing.

Improving Code Generation Honors Compilers April 16 th 2002.

Data Shackling Locality enhancement of dense numerical linear algebra codes Traversals along co-ordinate axes Data-centric reference for each statement.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

Computer Graphics Group Tobias Weyand Mesh-Based Inverse Kinematics Sumner et al 2005 presented by Tobias Weyand.

Aug 15-18, Montreal, Canada1 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D ’ Hollander.

General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed

Agenda Introduction Overview of White-box testing Basis path testing

Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,

Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

High-Level Transformations for Embedded Computing

1 Partitioning Loops with Variable Dependence Distances Yijun Yu and Erik D’Hollander Department of Electronics and Information Systems University of Ghent,

CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.

1 Software Testing & Quality Assurance Lecture 13 Created by: Paulo Alencar Modified by: Frank Xu.

System To Generate Test Data: The Analysis Program Syed Nabeel.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

/ PSWLAB Evidence-Based Analysis and Inferring Preconditions for Bug Detection By D. Brand, M. Buss, V. C. Sreedhar published in ICSM 2007.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.

Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

1 CSE1301 Computer Programming: Lecture 16 Flow Diagrams and Debugging.

Dependence Analysis and Loops CS 3220 Spring 2016.

Dependence Analysis Important and difficult

Data Locality Analysis and Optimization

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Amir Kamil and Katherine Yelick

Amir Kamil and Katherine Yelick

Introduction to Optimization

Presentation transcript:

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

2 Introduction Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

3 Overview of the approach Program Code Generation Visualize dependence Visualize transformation Dependence Analysis Dataflow Analysis Program Transformation Construct the ISDG Instrument the program Iteration Space VisualizerParallel Compiler Automatic Interactive exact? why?

4 Introduction (2) Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

5 Loop Dependence Nested loops are the focus of the parallel programming Data dependences happen when there are multiple accesses to the same memory locations where at least one of them WRITE Data dependence is classified as flow (first WRITE then READ), anti-flow (first READ then WRITE) or output (WRITE after WRITE) Loop dependence is the ordering between data dependent loop iterations

6 The Iteration Space Dependence Graph (ISDG) The object to be visualized is … ISDG = Iteration Space + Loop Dependence An iteration I=(i 1..i m ) is a point in the m-D iteration space, which is mapped to the 3D space The dependent iterations I and J are linked by an arrow I J

7 An example of ISDG do i=1,n do j=1,n do k=1,2 if(k.eq.1) then a(i,j,k)=(a(i-1,j,k)+a(i+1,j,k))/2 else a(i,j,k)=(a(i,j-1,k)+a(i,j+1,k))/2 endif enddo enddo enddo i j k (1,1,1) (1,2,1) (1,3,1) (2,1,1) (1,4,1) (2,2,1) (2,3,1) (3,1,1)(2,4,1) (3,2,1) (3,3,1) (4,1,1) (3,4,1) (4,2,1) (4,3,1) (4,4,1) (1,1,2) (1,2,2) (1,3,2) (1,4,2) (2,1,2) (2,2,2) (2,3,2) (2,4,2)(3,1,2) (3,2,2) (4,1,2) (3,3,2) (3,4,2) (4,2,2) (4,3,2) (4,4,2)

8 Instrumentation and the ISDG construction Program instrumentation Loop iteration: id + indices Array reference: id + name + Read | Write + subscripts ISDG construction 1. Create the iteration points from indices 2. Setup a reference list for every accessed location 3. Mark Flow-, Anti- and Output-dependence arrows

9 Introduction (3) Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

10 Dependence Visualization Loop visualization 3D view-port of Iteration space Graphical operations Detecting and enhancing parallelism Automatic parallelization Maximal parallelism detection Parallelization by plane execution

11 Loop Visualization Visualization of the ISDG Points + Arrows + Colors + Labels + Axes 3D view-port of Iteration space =3D, >3D and < 3D projection (condensed points and arrows) expansion (dummy index dimension) ISDG operations Graphical operations: rotate, move and animate Query dialogs: selection, variable zooming and dependence type filtering, etc.

12 Automatic Parallelization Sequential execution Traverse the iteration space in lexicographical order and count the iterations T Seq Parallel execution Traverse the iterations in a marked loop in parallel and count the steps T par Report speedup S para = T seq / T par Automatic parallelization Test whether the dependence ordering is kept for all combinations of loop parallelizations : DOALLi1,i2,i3?+DOALLi1,i2?+DOALLi1,i3? + DOALLi2,i3?+DOALLi1?+DOALLi2?+DOALLi3?

13 Maximal Parallelism Detection Data-flow order An iteration is executed as soon as its data are ready, i.e. after all the dependent iterations are carried out The iterations of the same delay are executed at the same time, i.e. in parallel The dependent iterations are executed sequentially. Count the steps T df Minimal executing time = Maximal parallelism Maximal speedup S max = T seq /T df

14 Plane Parallelization Define a cutting plane Ax+By+Cz=D Clicking at three points Giving parameters A,B,C,D Plane execution Traverse the planes d 0  Ax+By+Cz<d 0 +T d along the normal vector (A,B,C) Plane parallelization Matching the dataflow execution may enhance speedup S plane =T seq /T d Verified by cross-plane dependence checking or 3D->2D projection checking

15 Dependence Visualization procedural summary S para =S df ? Start Maximal parallelism detection S df Automatic parallelization S para Prune false dependences End Yes Plane parallelization S plane S plane >S para ? No Yes Program transformation

16 Program Transformations When S df >S para, loop transformations may enhance the parallelism of the target loop… Unimodular Loop Transformations Why? 3D 3D, 1-to-1, etc. Loop Projections and Expansions Loop Projection: >3D 3D Loop Expansion: <3D 3D

17 Unimodular Transformations ? ? ? ? ? ? ? ? ? Normal vector (A,B,C) A B C ? ? ? ? ? ? A B C ! ! ! ! ! ! Unimodular Legality Look for a suitable transformation Interactive way Automatic way Possible when array index expression are linear and all the distance vectors lie in a plane Extract largest base vectors of the dependence distances and construct the transformation (pseudo distance matrix approach)

18 Loop Expansion Non-perfectly vs perfectly nested loop Statement vs Iteration-level parallelism Statement reordering affine remapping Loop expansion Use additional dimension to index the statements in the loop body Unimodular loop transformations are still applicable at the statement level

19 Introduction Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

20 Application and Results Gauss-Jordan: linear system solver Lim’s example: statement-level parallelism Cholesky kernel: loop projection CFD application: unimodular transformation

21 Gauss-Jordan elimination do i=1,n do j=1,n if(i.ne.j) then f=a(j,i)/a(i,i) C$doisv do k=i+1,n+1 a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo id=0 do i = 1,n do j = 1,n if (i.ne.j) then write(11,*) id+1," r ","a",2,j,i write(11,*) id+1," r ","a",2,i,i write(11,*) id+1," w ","f"," 1 0 " f=a(j,i)/a(i,i) do k = i+1,n id=id+1 write(11,*) id,i,j,k write(11,*) id," r ","a",2,j,k write(11,*) id," r ","f"," 1 0 " write(11,*) id," r ","a",2,i,k write(11,*) id," w ","a",2,j,k a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo

22 Gauss-Jordan elimination

23 Lim’s Example The original program do l1=1,n do l2=1,n a(l1,l2)=a(l1,l2)+b(l1-1,l2) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo do l1=1,n do l2=1,n c$doisv do l3=0,1 if(l3.eq.0) a(l1,l2)=a(l1,l2)+b(l1-1,l2) if(l3.eq.1) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo Loop Expansion

24 Lim’s example unimodular transformation Plane: L 1 -L 2 +L 3 =0 DOALL L 3 valid Seq. time: 32Dataflow: 7, Speedup: 4.57 Loop time: 16, Speedup: 2.00 l1 l2 l3 i1 i2 i3 Plane: i1 = 0 DOALL i 1 valid Seq. time: 32Dataflow: 7, Speedup: 4.57 Loop time:7, Speedup: 4.57

25 Lim’s example Code generation C The unimodular transformed code doall i1 = 1-n, n do i2 = max(i1,1), min(n,i1+n) do i3 = max(-i1+i2,1), min(-i1+i2+1,n) l1 = i2 l2 = i3 l3 = i1 - i2 + i3 if (l3.eq.1)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if (l3.eq.2)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddoall Fourier Motzkin Inversion

26 Lim’s example Code generation symbolic n; IS1:={[i,j,k]:1<=i,j<=n && k=0}; IS2:={[i,j,k]:1<=i,j<=n && k=1}; T1:={[i,j,k]->[i-j+k,i,j]}; T2:={[i,j,k]->[i-j+k,i,j]}; codegen 0 T1:IS1,T2:IS2; I’ = I – J + K J’ = I K’= J

27 Lim’s example Code generation C the optimized code by Omega calculator doall p = 1-n, n if (p.ge.1)b(p,1) = a(p,0) * b(p,1) do l1 = max(p+1,1), min(p+n-1,n) a(l1,l1-p) =a(l1,l1-p)+b(l1-1,l1-p) a(l1,l1-p+1)=a(l1,l1-p)*b(l1,l1-p+1) enddo if (p.le.0)a(p+n,n)=a(p+n,n)+b(p+n-1,n) enddoall

28 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF 1 CONTINUE C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K) Loop Fusion

29

30 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DOALL 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF 1 CONTINUE (L,I,K,J)

31 CFD application Computation Fluid Dynamics CFD Navier-Stokes equations Successive Over-Relaxation SOR Kernel 3D loop: difficult to analyze 172 array references/iteration 33 if-branches/iteration Unimodular transformation found!

32 CFD Application Range : I1’= 6,24 I2’= 1, 4 I3’= 1, 4 Plane: i1’=9 Seq. time DOALL i2’,i3’ Dataflow: 19, Speedup: 3.37 Loop time:19,Speedup: 3.37 I1’ I2’ I3’ (9,1,1) (9,2,1) (9,1,2) Range: i1= 1, 4 i2= 1, 4 i3= 1, 4 Plane: 3 i1+2 i2+i3=9 Seq. time: 64Dataflow: 19, Speedup: 3.37 Loop time: 64, Speedup: 1.00 i1 i2 i3 (2,1,1) (1,2,2) (1,1,4)

33 Conclusion and Future work Allowing the exact visualization of real program loops Assistance with detecting parallel loops Estimation of maximal speedup using dataflow execution Assistance with finding suitable loop transformations Future work: Seemless Integration into PPT (parallel programming environment)

34 THANKS For you attention! Any question?