1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium http://www.elis.rug.ac.be/paris/ppt

2 Introduction Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

3 Overview of the approach Program Code Generation Visualize dependence Visualize transformation Dependence Analysis Dataflow Analysis Program Transformation Construct the ISDG Instrument the program Iteration Space VisualizerParallel Compiler Automatic Interactive exact? why?

4 Introduction (2) Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

5 Loop Dependence Nested loops are the focus of the parallel programming Data dependences happen when there are multiple accesses to the same memory locations where at least one of them WRITE Data dependence is classified as flow (first WRITE then READ), anti-flow (first READ then WRITE) or output (WRITE after WRITE) Loop dependence is the ordering between data dependent loop iterations

6 The Iteration Space Dependence Graph (ISDG) The object to be visualized is … ISDG = Iteration Space + Loop Dependence An iteration I=(i 1..i m ) is a point in the m-D iteration space, which is mapped to the 3D space The dependent iterations I and J are linked by an arrow I J

7 An example of ISDG do i=1,n do j=1,n do k=1,2 if(k.eq.1) then a(i,j,k)=(a(i-1,j,k)+a(i+1,j,k))/2 else a(i,j,k)=(a(i,j-1,k)+a(i,j+1,k))/2 endif enddo enddo enddo i j k (1,1,1) (1,2,1) (1,3,1) (2,1,1) (1,4,1) (2,2,1) (2,3,1) (3,1,1)(2,4,1) (3,2,1) (3,3,1) (4,1,1) (3,4,1) (4,2,1) (4,3,1) (4,4,1) (1,1,2) (1,2,2) (1,3,2) (1,4,2) (2,1,2) (2,2,2) (2,3,2) (2,4,2)(3,1,2) (3,2,2) (4,1,2) (3,3,2) (3,4,2) (4,2,2) (4,3,2) (4,4,2)

8 Instrumentation and the ISDG construction Program instrumentation Loop iteration: id + indices Array reference: id + name + Read | Write + subscripts ISDG construction 1. Create the iteration points from indices 2. Setup a reference list for every accessed location 3. Mark Flow-, Anti- and Output-dependence arrows

9 Introduction (3) Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

10 Dependence Visualization Loop visualization 3D view-port of Iteration space Graphical operations Detecting and enhancing parallelism Automatic parallelization Maximal parallelism detection Parallelization by plane execution

11 Loop Visualization Visualization of the ISDG Points + Arrows + Colors + Labels + Axes 3D view-port of Iteration space =3D, >3D and < 3D projection (condensed points and arrows) expansion (dummy index dimension) ISDG operations Graphical operations: rotate, move and animate Query dialogs: selection, variable zooming and dependence type filtering, etc.

12 Automatic Parallelization Sequential execution Traverse the iteration space in lexicographical order and count the iterations T Seq Parallel execution Traverse the iterations in a marked loop in parallel and count the steps T par Report speedup S para = T seq / T par Automatic parallelization Test whether the dependence ordering is kept for all combinations of loop parallelizations : DOALLi1,i2,i3?+DOALLi1,i2?+DOALLi1,i3? + DOALLi2,i3?+DOALLi1?+DOALLi2?+DOALLi3?

13 Maximal Parallelism Detection Data-flow order An iteration is executed as soon as its data are ready, i.e. after all the dependent iterations are carried out The iterations of the same delay are executed at the same time, i.e. in parallel The dependent iterations are executed sequentially. Count the steps T df Minimal executing time = Maximal parallelism Maximal speedup S max = T seq /T df

14 Plane Parallelization Define a cutting plane Ax+By+Cz=D Clicking at three points Giving parameters A,B,C,D Plane execution Traverse the planes d 0  Ax+By+Cz<d 0 +T d along the normal vector (A,B,C) Plane parallelization Matching the dataflow execution may enhance speedup S plane =T seq /T d Verified by cross-plane dependence checking or 3D->2D projection checking

15 Dependence Visualization procedural summary S para =S df ? Start Maximal parallelism detection S df Automatic parallelization S para Prune false dependences End Yes Plane parallelization S plane S plane >S para ? No Yes Program transformation

16 Program Transformations When S df >S para, loop transformations may enhance the parallelism of the target loop… Unimodular Loop Transformations Why? 3D 3D, 1-to-1, etc. Loop Projections and Expansions Loop Projection: >3D 3D Loop Expansion: <3D 3D

17 Unimodular Transformations ? ? ? ? ? ? ? ? ? Normal vector (A,B,C) A B C ? ? ? ? ? ? A B C ! ! ! ! ! ! Unimodular Legality Look for a suitable transformation Interactive way Automatic way Possible when array index expression are linear and all the distance vectors lie in a plane Extract largest base vectors of the dependence distances and construct the transformation (pseudo distance matrix approach)

18 Loop Expansion Non-perfectly vs perfectly nested loop Statement vs Iteration-level parallelism Statement reordering affine remapping Loop expansion Use additional dimension to index the statements in the loop body Unimodular loop transformations are still applicable at the statement level

19 Introduction Overview of the approach interactive vs automatic Loop dependence Iteration Space Dependence Graph ISDG Instrumentation and construct ISDG Visualization of … Dependence Transformations Applications and Results Conclusion and Future work

20 Application and Results Gauss-Jordan: linear system solver Lim’s example: statement-level parallelism Cholesky kernel: loop projection CFD application: unimodular transformation

21 Gauss-Jordan elimination do i=1,n do j=1,n if(i.ne.j) then f=a(j,i)/a(i,i) C$doisv do k=i+1,n+1 a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo id=0 do i = 1,n do j = 1,n if (i.ne.j) then write(11,*) id+1," r ","a",2,j,i write(11,*) id+1," r ","a",2,i,i write(11,*) id+1," w ","f"," 1 0 " f=a(j,i)/a(i,i) do k = i+1,n id=id+1 write(11,*) id,i,j,k write(11,*) id," r ","a",2,j,k write(11,*) id," r ","f"," 1 0 " write(11,*) id," r ","a",2,i,k write(11,*) id," w ","a",2,j,k a(j,k)=a(j,k)-f*a(i,k) enddo endif enddo

22 Gauss-Jordan elimination

23 Lim’s Example The original program do l1=1,n do l2=1,n a(l1,l2)=a(l1,l2)+b(l1-1,l2) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo do l1=1,n do l2=1,n c$doisv do l3=0,1 if(l3.eq.0) a(l1,l2)=a(l1,l2)+b(l1-1,l2) if(l3.eq.1) b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo Loop Expansion

24 Lim’s example unimodular transformation 1 1 1 0 0 0 1 0 Plane: L 1 -L 2 +L 3 =0 DOALL L 3 valid Seq. time: 32Dataflow: 7, Speedup: 4.57 Loop time: 16, Speedup: 2.00 l1 l2 l3 i1 i2 i3 Plane: i1 = 0 DOALL i 1 valid Seq. time: 32Dataflow: 7, Speedup: 4.57 Loop time:7, Speedup: 4.57

25 Lim’s example Code generation C The unimodular transformed code doall i1 = 1-n, n do i2 = max(i1,1), min(n,i1+n) do i3 = max(-i1+i2,1), min(-i1+i2+1,n) l1 = i2 l2 = i3 l3 = i1 - i2 + i3 if (l3.eq.1)a(l1,l2)=a(l1,l2)+b(l1-1,l2) if (l3.eq.2)b(l1,l2)=a(l1,l2-1)*b(l1,l2) enddo enddoall 1 1 1 0 0 0 1 0 Fourier Motzkin 0 1 0 0 0 1 1 1 Inversion

26 Lim’s example Code generation symbolic n; IS1:={[i,j,k]:1<=i,j<=n && k=0}; IS2:={[i,j,k]:1<=i,j<=n && k=1}; T1:={[i,j,k]->[i-j+k,i,j]}; T2:={[i,j,k]->[i-j+k,i,j]}; codegen 0 T1:IS1,T2:IS2; 1 1 1 0 0 0 1 0 I’ = I – J + K J’ = I K’= J

27 Lim’s example Code generation 1 1 1 0 0 0 1 0 C the optimized code by Omega calculator doall p = 1-n, n if (p.ge.1)b(p,1) = a(p,0) * b(p,1) do l1 = max(p+1,1), min(p+n-1,n) a(l1,l1-p) =a(l1,l1-p)+b(l1-1,l1-p) a(l1,l1-p+1)=a(l1,l1-p)*b(l1,l1-p+1) enddo if (p.le.0)a(p+n,n)=a(p+n,n)+b(p+n-1,n) enddoall

28 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF 1 CONTINUE C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K) Loop Fusion

30 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DOALL 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF 1 CONTINUE (L,I,K,J)

31 CFD application Computation Fluid Dynamics CFD Navier-Stokes equations Successive Over-Relaxation SOR Kernel 3D loop: difficult to analyze 172 array references/iteration 33 if-branches/iteration Unimodular transformation found!

32 CFD Application Range : I1’= 6,24 I2’= 1, 4 I3’= 1, 4 Plane: i1’=9 Seq. time DOALL i2’,i3’ Dataflow: 19, Speedup: 3.37 Loop time:19,Speedup: 3.37 I1’ I2’ I3’ (9,1,1) (9,2,1) (9,1,2) Range: i1= 1, 4 i2= 1, 4 i3= 1, 4 Plane: 3 i1+2 i2+i3=9 Seq. time: 64Dataflow: 19, Speedup: 3.37 Loop time: 64, Speedup: 1.00 i1 i2 i3 (2,1,1) (1,2,2) (1,1,4) 3 2 1 0 1 0 1 0 0

33 Conclusion and Future work Allowing the exact visualization of real program loops Assistance with detecting parallel loops Estimation of maximal speedup using dataflow execution Assistance with finding suitable loop transformations Future work: Seemless Integration into PPT (parallel programming environment)

34 THANKS For you attention! Any question?

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Similar presentations

Presentation on theme: "1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium

Similar presentations

Presentation on theme: "1 Parallel Programming using the Iteration Space Visualizer Yijun YuYijun Yu and Erik H. D'HollanderErik H. D'Hollander University of Ghent, Belgium"— Presentation transcript:

Similar presentations

About project

Feedback