Download presentation
Presentation is loading. Please wait.
Published byMelvyn Taylor Modified over 9 years ago
1
1 Array Operation Synthesis to Optimize Data Parallel Programs Speaker : Gwan-Hwan Hwang (黃冠寰), Ph.D. Associate Professor Department of Information and Computer Education, National Taiwan Normal University ghhwang@ice.ntnu.edu.tw http://www.ice.ntnu.edu.tw/~ghhwang
2
2 Outline of Presentation Fortran 90 Intrinsic Array Operations Array Operation Synthesis(AOS) SYNTOOL Apply AOS to Shared-Memory Machines Apply AOS to Distributed-Memory Machines Conclusion and Related Work
3
3 Intrinsic Array Operations Provided by Modern Program Languages. E.g. Fortran 90, High Performance Fortran(HPF), HPF2, Fortran 97, APL, MATLAB, MATHEMATICA, NESL, C* Engineering and Scientific Applications Facilitate a Compilation Analysis for Optimization Support Parallel Execution and Portability
4
4 Intrinsic Array Operations(Cont’d) Array Operation Provided by Fortran 90, HPF. Examples: CSHIFT, TRANSPOSE, MERGE, EOSHIFT, RESHAPE SPREAD, Section Move, Where Constructs, Reductions. B=CSHIFT(A,1,1) C=TRANSPOSE(B)
5
5 Consecutive Array Expressions Array Expression Consecutive Array Operations C=EOSHIFT(MERGE(RESHAPE(S,/N,N/),A+B,T),1,0,1) FXP=CSHIFT(F1,1,+1) FXM=CSHIFT(F1,1,-1) FXP=CSHIFT(F1,2,+1) FYM=CSHIFT(F1,2,-1) FDERIV=ZXP*(FXP-F1)+ZXM*(FXM-F1)+ ZYP*(FYP-F1)+ZYM*(FYM-F1)
6
6 Straightforward Compilation Translate each operation into a parallel loop B=CSHIFT((TRANSPOSE(EOSHIFT(A,1,0,1),1,1) FORALL (I=1:N:1; J=1:N:1) T2 (I,J)= T1 (J,I) ENDFORALL FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN B(I,J)= T2 (I+1,J) ELSE B(I,J)= T2 (I-N,J) ENDFORALL FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN T1 (I,J)=A(I+1,J) ELSE T1 (I,J)=0 ENDFORALL EOSHIFT TRANSPOSE CSHIFT
7
7 Data Access Functions 1.Model Each Array Operation by A Data Access Function (DAF) 2.Composition of Data Access Functions
8
8 Examples of Data Access Function (1) One Source Array One Data Access Pattern B=TRANSPOSE(A) Data Access Function is B(I,J)=A(J,I)
9
9 Examples of Data Access Function (2) Multiple Source Arrays One Data Access Pattern R=MERGE(T,F,M) Data Access Function is where Array TArray FArray MArray R
10
10 Examples of Data Access Function (3) Single Source Array Multiple Data Access Patterns B=CSHIFT(A,1,1) Array AArray B Data Access Function is : a segmentation descriptor
11
11 Classification of Array Operations Model Array Operations by Data Access Functions (DAF) Type 1Type 2Type 3 Type 4
12
12 Array Operation Synthesis Construct the Parse Tree of Array Expression Represent Array Operations by Mathematical Functions ( Data Access Function ) B=CSHIFT((TRANSPOSE(EOSHIFT(A,1,0,1),1,1) CSHIFT TRANSPOSE EOSHIFT
13
13 Array Operation Synthesis (Cont’d) EOSHIFT TRANSPOSE Synthesis of two functions D1 D2 EOSHIFT+ TRANSPOSE D3 CSHIFT D4 D5
14
14 Code Generation for Synthesized Data Access Function FORALL (I=1:N:1; J=1:N:1) IF (/I,J/,/1:N-1,1:N/) (/J,I+1/,/1:N-1,1:N/) THEN B(I,J)=A(J+1, I+1) IF (/I,J/,/1:N-1,1:N/) (/J,I+1/,/N:N,1:N/) THEN B(I,J)=0 IF (/I,J/,/N:N,1:N/) (/J,I+1/,/1:N-1,1:N/) THEN B(I,J)=A(J+1, I-N+1) IF (/I,J/,/N:N,1:N/) (/J,I+1/,/N:N,1:N/) THEN B(I,J)=0 ENDFORALL Code Generation D5
15
15 Code Generation for Synthesized Data Access Function After Optimization 1 N-1 N 1 N D5 D6
16
16 Simplifying the ranges at compilation time instead of runtime Optimization process: Normalize: Intersection for each dimension: Optimization
17
17 FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN T1 (I,J)=A(I+1,J) ELSE T1 (I,J)=0 ENDFORALL FORALL (I=1:N:1; J=1:N:1) T2 (I,J)= T1 (J,I) ENDFORALL FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN B(I,J)= T2 (I+1,J) ELSE B(I,J)= T2 (I-N,J) ENDFORALL Comparison of Array Operation Synthesis and Straightforward Compilation FORALL (I=1:N-1:1; J=1:N-1:1) B (I,J)=A(J+1,I+1) ENDFORALL FORALL (I=1:N-1:1;J=1:N:N) B (I,J)= 0 ENDFORALL FORALL (I=N:N:1;J=1:N-1:1) B (I,J)=A(J+1,I-N+1) ENDFORALL FORALL (I=N:N:1;J=1:N:N) B (I,J)= 0 ENDFORALL Code with Array Operation Synthesis Code by straightforward compilation
18
18 Synthesis Array Expression S1 and S2 separately Synthesis Array Expression S1 and S2 collectively SPREAD: one-to-many data movement function FORALL (I=1:N:1,J=1:M:1) A(I)=SIN(SQRT(B(I)+0.5)+COS(C(I)))+D(I,J) END FORALL Synthesis Anomaly REAL A(N),B(N),C(N),T(N,M),D(N,M) A=SIN(SQRT(B+0.5)+COS(C)) T=SPREAD(A,DIM=2,NCOPIES=M)+D Statement S1 Statement S2 FORALL (I=1:N:1) A(I)=SIN(SQRT(B(I)+0.5)+COS(C(I))) END FORALL FORALL (I=1:N:1,J=1:M:1) T(I,J)=A(I)+D(I,J) END FORALL N SIN,SQRT,COS 2*N+N*M addition N+N*M assignments N*M SIN,SQRT,COS 3*N*M addition N*M assignments
19
19 We propose a polynomial time Synthesis Anomaly Prevention algorithm Loop Interchange for more Synthesis Synthesis Anomaly(Cont’d)
20
20 Analysis of Array Operation Synthesis We prove Array Operation Synthesis can: reduce the number of stores. reduce the number of loads. do not increase the required computations.
21
21 Advanced Techniques Optimization for Segmentation Descriptors with Coupled Index Functions Synthesis of Array Reduction and Location intrinsic operations Synthesis of WHERE CONSTRUCT
22
22 Contributions The first scheme which can synthesis the following Fortran 90 intrinsic array operations Array Section Movement, SPREAD, TRANSPOSE, EOSHIFT, CHIFT, MERGE WHERE CONSTRUCT, Array Reduction Functions(ALL,COUNT,MAXVAL) Array Location Functions(MAXLOC,MINLOC)
23
23 SYNTOOL An implementation of array operation as a web- based tool Kernel Implemented in C A Web Page + CGI program Perform source-to-source Array Operation Synthesis and return Fortran 90 or HPF program Available on WWW at http://puma.cs.nthu.edu.tw/~project/synth.html
24
24
25
25
26
26 SYNTOOL Test Beds Sequent S27 with 10 identical processors SGI Power Challenge with 10 identical processors Seven test suites of Fortran 90 are used last four program fragments are from real application codes Synthesis on Shared-Memory Systems CPU Cache CPU Cache CPU Cache CPU Cache Main Memory Shared Bus
27
27 Code Fragment 1 (CSHIFT, TRASPOSE, ADDITION, RESHAPE) Code Fragment 2 (Where construct) Experimental Results on Sequent (N=256)
28
28 Code Fragment 3 (EOSHIFT,MERGE RESHAPE, ADDITION) Code Fragment 4 (Purdue-set Problem 9) Experimental Results on Sequent (N=256)
29
29 Code Fragment 5 (APULE routine electromagnetic scattering problem) Code Fragment 6 (Sandia Wave) Experimental Results on Sequent (N=256)
30
30 Code Fragment 7 (Linear Equation Solve) Experimental Results on Sequent (N=256)
31
31 Experimental Results on SGI Power Challenge (N=512) Code Fragment 4 (Purdue-set Problem 9) Code Fragment 5 (APULE routine electromagnetic scattering problem)
32
32 Experimental Results on SGI Power Challenge (N=512) Code Fragment 6 (Sandia Wave) Code Fragment 7 (Linear Equation Solve)
33
33
34
34 Synthesis on Distributed-memory System Test Bed 8-node DEC Alpha Farm with DEC HPF compiler IBM SP2 with HPF compiler nCUBE/2 with 16 nodes CPU Memory CPU Memory CPU Memory CPU Memory Interconnection Network CPU Memory
35
35 HPF Example REAL A(N,2*N), B(2*N), C(2*N,2*N) REAL D(2*N,N), E(N), F(N), G(N) !HPF TEMPLATE TEMP(N*N*4,N*N*4) !HPF ALIGN A(i,j) WITH TEMP(4*i-3,4*j-3) !HPF ALIGN B(i) WITH TEMP(*,4*i-3) !HPF ALIGN C(i,j) WITH TEMP(4*j-3,i) !HPF ALIGN D(i,j) WITH TEMP(4*j-3,4*i-3) !HPF ALIGN G(i) WITH TEMP(4*i-3,*) D=C(:,1:4*N:4) E=SUM(A+TRANSPOSE(D),DIM=2) F=SUM(SPREAD(B,DIM=2,NCOPIES=N)+D,DIM=1) G=B(1:2*N:2)+E+F S. Chatterjee et al. “Automatic Array Alignment in Data-Parallel Programs” ACM Symposium on Principles of Programming Languages,1993.
36
36 HPF Example (Cont’d) We execute the codes by DEC HPF Compiler on 8-node DEC Farm with an FDDI network. Loop 100 times. N is set to 128.
37
37 Apply Array Operation Synthesis to Distributed-memory Machines Owner Computes Rule of HPF Memory References of Distributed-memory Machines Local References Remote References(Communication)
38
38 Distributed-Memory Synthesis Anomaly Array Operation Synthesis may either decrease Remote References or increase Remote References or GOOD Require more communication time Synthesis Anomaly Example of Distributed Memory Synthesis Anomaly !HPF TEMPLATE TEMP(N,N) REAL A(N,N),B(N,N),C(N,N) !HPF ALIGN A(I,J),B(I,J),C(I,J)WITH TEMP(I,J) C=TRANSPOSE(A+B,1)
39
39 Evaluating Array Expression Except the optimal solution, we also propose a heuristic algorithm. Optimal Solution is NP-hard Under the Owner Computes Rule To Synthesize Part of Array Operations To Find Data Layout of Temporary Arrays
40
40 A Heuristic to Reduce Synthesis Anomaly Do Array Operation Synthesis Does synthesis increase communication cost? Roll back temporary arrays Do code generation normally
41
41 (Purdue-set Problem 9) Experimental Results on DEC Workstation Farm (N=128) (APULE routine electromagnetic scattering problem)
42
42 Experimental Results on DEC Workstation Farm (N=128) (Sandia Wave)
43
43 Experimental Results on IBM SP2 (N=512) (Purdue-set Problem 9) (APULE routine electromagnetic scattering problem)
44
44 Experimental Results on IBM SP2 (N=512) (Sandia Wave)
45
45 Experimental Results on nCUBE/2 with 16 processors
46
46 Experimental Results on nCUBE/2 with 16 processors
47
47 Performance of eight suites on nCUBE/2
48
48 Performance of eight suites on nCUBE/2
49
49 Array Operation Synthesis in Distributed-memory Machines Optimal solution is NP-hard A heuristic algorithm for code generation Experimental results show speedups from 1.6 to 10.4 for HPF code fragments on DEC alpha farm and IBM SP2 We demonstrated that it is also profitable in applying AOS to HPF programs
50
50 Conclusion The Array Operation Synthesis can handle RESHAPE, SPREAD, CSHIFT, EOSHIFT, TRANSPOSE, MERGE, section movement, reduction operations, and WHERE construct The measured speedups from real applications vary between 1.21 and 7.55 in Sequent S27 and SGI Power Challenge. Experimental results show speedups from 1.6 to 10.4 fro HPF code fragments from real applications on DEC alpha Farm and IBM SP2
51
51 Related Work To handle PACK, UNPACK and Matrix Multiplication Integrating Automatic Data Alignment and AOS Synthesis for array operation functions which includes message passing codes Applying AOS toward a more extensive set of data parallel programs
52
52 Any question? Gwan-Hwan Hwang (黃冠寰) ghhwang@ice.ntnu.edu.tw http://www.ice.ntnu.edu.tw/~ghhwang
53
53 Substitution (Term Rewriting like method) Having two Data Access Patterns: The Synthesized Data Access Pattern is: Synthesis of two Data Access Functions where
54
54 For example, By the substitution rule Synthesis of two DAFs (Cont’d)
55
55 For example, Synthesis of two DAFs (Cont’d)
56
56 Reference Location The reference location of F(2*J-1,I) with respect to TEMP is: !HPF$ ALIGN F(I,J) WITH TEMP(3*I-1,J) TEMP(3*(2*J-1)-1,I)=TEMP(6*J-4,I)
57
57 Demonstration Example of Heuristic Algorithm !HPF$ TEMPLATE TEMP(300,300) REAL A(100,100),B(100,100),C(100,100) REAL D(100,100),E(100,100),F(200,100), G(300,100) !HPF$ ALIGN A(i,j),B(i,j),C(i,j),D(i,j),E(i,j) with TEMP(i,j) !HPF$ ALIGN F(i,j) with TEMP(3*i-1,j) !HPF$ ALIGN G(i,j) with TEMP(2*i,j) C(1:100,:)=F(1:200:2,:) D(1:100,:)=G(1:300:3,:) E=CSHIFT(TRANSPOSE(A+B),1,1 )*(TRANSPOSE(C)-TRANSPOSE(D) ) FORALL (I=1:99,J=1:100) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) END FORALL FORALL (I=100:100,J=1:100) E(I,J)=(A[J,I-99]+B[J,I-99])*(F[2*J-1,I]-G[3*J-2,I]) END FORALL Do Array Operation Synthesis
58
58 Heuristic Algorithm (1) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) E[I,J] A[J,I+1]B[J,I+1] F[2*J-1,I] G[3*J-2,I]
59
59 Heuristic Algorithm (1) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) E[I,J] A[J,I+1]B[J,I+1] F[2*J-1,I] G[3*J-2,I] TEMP[I,J] TEMP[J,I+1] TEMP[6*J-4,I] SB1SB2
60
60 Heuristic Algorithm (2) Create Temporary Arrays FORALL (I=1:99,J=1:100) TA1(I,J) =A(J,I+1)+B(J,I+1) END FORALL FORALL (I=1:99,J=1:100) TA2(I,J) =F(2*J-1,I)-G(3*J-2,I) END FORALL FORALL (I=1:99,J=1:100) E(I,J)=TA1(I,J)*TA2(I,J) END FORALL !HPF$ ALIGN TA1(I,J) WITH TEMP(J,I+1) !HPF$ ALIGN TA2(I,J) WITH TEMP(6*J-4,I) Communication Free Loop Communication Free Loop For Subtree SB1 For Subtree SB2
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.