1 Array Operation Synthesis to Optimize Data Parallel Programs Speaker : Gwan-Hwan Hwang （黃冠寰）, Ph.D. Associate Professor Department of Information and.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Advertisements

Comparison and Evaluation of Back Translation Algorithms for Static Single Assignment Form Masataka Sassa #, Masaki Kohama + and Yo Ito # # Dept. of Mathematical.

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

Array Operation Synthesis to Optimize Data Parallel Programs Department of Computer Science, National Tsing-Hua University Student:Gwan-Hwan Hwang Advisor:

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,

Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Complexity Analysis (Part I)

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Parallel Computing Overview CS 524 – High-Performance Computing.

October 14-15, 2005Conformal Computing Geometry of Arrays: Mathematics of Arrays and  calculus Lenore R. Mullin Computer Science Department College.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

Bulk Synchronous Parallel Processing Model Jamie Perkins.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

PL/B: A Programming Language for the Blues George Almasi, Luiz DeRose, Jose Moreira, and David Padua.

1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.

Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

Parallel muiticategory Support Vector Machines (PMC-SVM) for Classifying Microarray Data 研究生 : 許景復單位 : 光電與通訊研究所.

Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Comparison of Array Operation Synthesis and Straightforward Compilation FORALL (I=1:N:1; J=1:N:1) IF (1

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.

A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.

Compiling Fortran D For MIMD Distributed Machines Authors: Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng Published: 1992 Presented by: Sunjeev Sikand.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

Assembly - Arrays תרגול 7 מערכים.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

HPF (High Performance Fortran). What is HPF? HPF is a standard for data-parallel programming. Extends Fortran-77 or Fortran-90. Similar extensions exist.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Ada, Scheme, R Emory Wingard. Ada History Department of Defense in search of high level language around Requirements drafted for the language.

University at Albany,SUNY lrm-1 lrm 6/28/2016 Levels of Processor/Memory Hierarchy Can be Modeled by Increasing Dimensionality of Data Array. –Additional.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

课程名编译原理 Compiling Techniques

STUDY AND IMPLEMENTATION

MATLAB HPCS Extensions

MATLAB HPCS Extensions

Presentation transcript:

1 Array Operation Synthesis to Optimize Data Parallel Programs Speaker : Gwan-Hwan Hwang （黃冠寰）, Ph.D. Associate Professor Department of Information and Computer Education, National Taiwan Normal University

2 Outline of Presentation Fortran 90 Intrinsic Array Operations Array Operation Synthesis(AOS) SYNTOOL Apply AOS to Shared-Memory Machines Apply AOS to Distributed-Memory Machines Conclusion and Related Work

3 Intrinsic Array Operations Provided by Modern Program Languages.  E.g. Fortran 90, High Performance Fortran(HPF), HPF2, Fortran 97, APL, MATLAB, MATHEMATICA, NESL, C* Engineering and Scientific Applications Facilitate a Compilation Analysis for Optimization Support Parallel Execution and Portability

4 Intrinsic Array Operations(Cont’d) Array Operation Provided by Fortran 90, HPF. Examples: CSHIFT, TRANSPOSE, MERGE, EOSHIFT, RESHAPE SPREAD, Section Move, Where Constructs, Reductions. B=CSHIFT(A,1,1) C=TRANSPOSE(B)

5 Consecutive Array Expressions Array Expression Consecutive Array Operations C=EOSHIFT(MERGE(RESHAPE(S,/N,N/),A+B,T),1,0,1) FXP=CSHIFT(F1,1,+1) FXM=CSHIFT(F1,1,-1) FXP=CSHIFT(F1,2,+1) FYM=CSHIFT(F1,2,-1) FDERIV=ZXP*(FXP-F1)+ZXM*(FXM-F1)+ ZYP*(FYP-F1)+ZYM*(FYM-F1)

6 Straightforward Compilation Translate each operation into a parallel loop B=CSHIFT((TRANSPOSE(EOSHIFT(A,1,0,1),1,1) FORALL (I=1:N:1; J=1:N:1) T2 (I,J)= T1 (J,I) ENDFORALL FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN B(I,J)= T2 (I+1,J) ELSE B(I,J)= T2 (I-N,J) ENDFORALL FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN T1 (I,J)=A(I+1,J) ELSE T1 (I,J)=0 ENDFORALL EOSHIFT TRANSPOSE CSHIFT

7 Data Access Functions 1.Model Each Array Operation by A Data Access Function (DAF) 2.Composition of Data Access Functions

8 Examples of Data Access Function (1) One Source Array One Data Access Pattern B=TRANSPOSE(A) Data Access Function is B(I,J)=A(J,I)

9 Examples of Data Access Function (2) Multiple Source Arrays One Data Access Pattern R=MERGE(T,F,M) Data Access Function is where Array TArray FArray MArray R

10 Examples of Data Access Function (3) Single Source Array Multiple Data Access Patterns B=CSHIFT(A,1,1) Array AArray B Data Access Function is : a segmentation descriptor

11 Classification of Array Operations Model Array Operations by Data Access Functions (DAF) Type 1Type 2Type 3 Type 4

12 Array Operation Synthesis Construct the Parse Tree of Array Expression Represent Array Operations by Mathematical Functions ( Data Access Function ) B=CSHIFT((TRANSPOSE(EOSHIFT(A,1,0,1),1,1) CSHIFT TRANSPOSE EOSHIFT

13 Array Operation Synthesis (Cont’d) EOSHIFT TRANSPOSE Synthesis of two functions D1 D2 EOSHIFT+ TRANSPOSE D3 CSHIFT D4 D5

14 Code Generation for Synthesized Data Access Function FORALL (I=1:N:1; J=1:N:1) IF  (/I,J/,/1:N-1,1:N/)   (/J,I+1/,/1:N-1,1:N/) THEN B(I,J)=A(J+1, I+1) IF  (/I,J/,/1:N-1,1:N/)   (/J,I+1/,/N:N,1:N/) THEN B(I,J)=0 IF  (/I,J/,/N:N,1:N/)   (/J,I+1/,/1:N-1,1:N/) THEN B(I,J)=A(J+1, I-N+1) IF  (/I,J/,/N:N,1:N/)   (/J,I+1/,/N:N,1:N/) THEN B(I,J)=0 ENDFORALL Code Generation D5

15 Code Generation for Synthesized Data Access Function After Optimization 1 N-1 N 1 N D5 D6

16 Simplifying the ranges at compilation time instead of runtime Optimization process:  Normalize:  Intersection for each dimension: Optimization

17 FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN T1 (I,J)=A(I+1,J) ELSE T1 (I,J)=0 ENDFORALL FORALL (I=1:N:1; J=1:N:1) T2 (I,J)= T1 (J,I) ENDFORALL FORALL (I=1:N:1; J=1:N:1) IF (1<=I<=N-1) and (1<=J<=N) THEN B(I,J)= T2 (I+1,J) ELSE B(I,J)= T2 (I-N,J) ENDFORALL Comparison of Array Operation Synthesis and Straightforward Compilation FORALL (I=1:N-1:1; J=1:N-1:1) B (I,J)=A(J+1,I+1) ENDFORALL FORALL (I=1:N-1:1;J=1:N:N) B (I,J)= 0 ENDFORALL FORALL (I=N:N:1;J=1:N-1:1) B (I,J)=A(J+1,I-N+1) ENDFORALL FORALL (I=N:N:1;J=1:N:N) B (I,J)= 0 ENDFORALL Code with Array Operation Synthesis Code by straightforward compilation

18 Synthesis Array Expression S1 and S2 separately Synthesis Array Expression S1 and S2 collectively SPREAD: one-to-many data movement function FORALL (I=1:N:1,J=1:M:1) A(I)=SIN(SQRT(B(I)+0.5)+COS(C(I)))+D(I,J) END FORALL Synthesis Anomaly REAL A(N),B(N),C(N),T(N,M),D(N,M) A=SIN(SQRT(B+0.5)+COS(C)) T=SPREAD(A,DIM=2,NCOPIES=M)+D Statement S1 Statement S2 FORALL (I=1:N:1) A(I)=SIN(SQRT(B(I)+0.5)+COS(C(I))) END FORALL FORALL (I=1:N:1,J=1:M:1) T(I,J)=A(I)+D(I,J) END FORALL N SIN,SQRT,COS 2*N+N*M addition N+N*M assignments N*M SIN,SQRT,COS 3*N*M addition N*M assignments

19 We propose a polynomial time Synthesis Anomaly Prevention algorithm Loop Interchange for more Synthesis Synthesis Anomaly(Cont’d)

20 Analysis of Array Operation Synthesis We prove Array Operation Synthesis can:  reduce the number of stores.  reduce the number of loads.  do not increase the required computations.

21 Advanced Techniques Optimization for Segmentation Descriptors with Coupled Index Functions Synthesis of Array Reduction and Location intrinsic operations Synthesis of WHERE CONSTRUCT

22 Contributions The first scheme which can synthesis the following Fortran 90 intrinsic array operations Array Section Movement, SPREAD, TRANSPOSE, EOSHIFT, CHIFT, MERGE WHERE CONSTRUCT, Array Reduction Functions(ALL,COUNT,MAXVAL) Array Location Functions(MAXLOC,MINLOC)

23 SYNTOOL An implementation of array operation as a web- based tool  Kernel Implemented in C  A Web Page + CGI program Perform source-to-source Array Operation Synthesis and return Fortran 90 or HPF program Available on WWW at

24

25

26 SYNTOOL Test Beds  Sequent S27 with 10 identical processors  SGI Power Challenge with 10 identical processors Seven test suites of Fortran 90 are used  last four program fragments are from real application codes Synthesis on Shared-Memory Systems CPU Cache CPU Cache CPU Cache CPU Cache Main Memory Shared Bus

27 Code Fragment 1 (CSHIFT, TRASPOSE, ADDITION, RESHAPE) Code Fragment 2 (Where construct) Experimental Results on Sequent (N=256)

28 Code Fragment 3 (EOSHIFT,MERGE RESHAPE, ADDITION) Code Fragment 4 (Purdue-set Problem 9) Experimental Results on Sequent (N=256)

29 Code Fragment 5 (APULE routine electromagnetic scattering problem) Code Fragment 6 (Sandia Wave) Experimental Results on Sequent (N=256)

30 Code Fragment 7 (Linear Equation Solve) Experimental Results on Sequent (N=256)

31 Experimental Results on SGI Power Challenge (N=512) Code Fragment 4 (Purdue-set Problem 9) Code Fragment 5 (APULE routine electromagnetic scattering problem)

32 Experimental Results on SGI Power Challenge (N=512) Code Fragment 6 (Sandia Wave) Code Fragment 7 (Linear Equation Solve)

33

34 Synthesis on Distributed-memory System Test Bed  8-node DEC Alpha Farm with DEC HPF compiler  IBM SP2 with HPF compiler  nCUBE/2 with 16 nodes CPU Memory CPU Memory CPU Memory CPU Memory Interconnection Network CPU Memory

35 HPF Example REAL A(N,2*N), B(2*N), C(2*N,2*N) REAL D(2*N,N), E(N), F(N), G(N) !HPF TEMPLATE TEMP(N*N*4,N*N*4) !HPF ALIGN A(i,j) WITH TEMP(4*i-3,4*j-3) !HPF ALIGN B(i) WITH TEMP(*,4*i-3) !HPF ALIGN C(i,j) WITH TEMP(4*j-3,i) !HPF ALIGN D(i,j) WITH TEMP(4*j-3,4*i-3) !HPF ALIGN G(i) WITH TEMP(4*i-3,*) D=C(:,1:4*N:4) E=SUM(A+TRANSPOSE(D),DIM=2) F=SUM(SPREAD(B,DIM=2,NCOPIES=N)+D,DIM=1) G=B(1:2*N:2)+E+F S. Chatterjee et al. “Automatic Array Alignment in Data-Parallel Programs” ACM Symposium on Principles of Programming Languages,1993.

36 HPF Example (Cont’d) We execute the codes by DEC HPF Compiler on 8-node DEC Farm with an FDDI network. Loop 100 times. N is set to 128.

37 Apply Array Operation Synthesis to Distributed-memory Machines Owner Computes Rule of HPF Memory References of Distributed-memory Machines  Local References  Remote References(Communication)

38 Distributed-Memory Synthesis Anomaly Array Operation Synthesis may either decrease Remote References or increase Remote References or GOOD Require more communication time Synthesis Anomaly Example of Distributed Memory Synthesis Anomaly !HPF TEMPLATE TEMP(N,N) REAL A(N,N),B(N,N),C(N,N) !HPF ALIGN A(I,J),B(I,J),C(I,J)WITH TEMP(I,J) C=TRANSPOSE(A+B,1)

39 Evaluating Array Expression Except the optimal solution, we also propose a heuristic algorithm. Optimal Solution is NP-hard Under the Owner Computes Rule To Synthesize Part of Array Operations To Find Data Layout of Temporary Arrays

40 A Heuristic to Reduce Synthesis Anomaly Do Array Operation Synthesis Does synthesis increase communication cost? Roll back temporary arrays Do code generation normally

41 (Purdue-set Problem 9) Experimental Results on DEC Workstation Farm (N=128) (APULE routine electromagnetic scattering problem)

42 Experimental Results on DEC Workstation Farm (N=128) (Sandia Wave)

43 Experimental Results on IBM SP2 (N=512) (Purdue-set Problem 9) (APULE routine electromagnetic scattering problem)

44 Experimental Results on IBM SP2 (N=512) (Sandia Wave)

45 Experimental Results on nCUBE/2 with 16 processors

46 Experimental Results on nCUBE/2 with 16 processors

47 Performance of eight suites on nCUBE/2

48 Performance of eight suites on nCUBE/2

49 Array Operation Synthesis in Distributed-memory Machines Optimal solution is NP-hard A heuristic algorithm for code generation Experimental results show speedups from 1.6 to 10.4 for HPF code fragments on DEC alpha farm and IBM SP2 We demonstrated that it is also profitable in applying AOS to HPF programs

50 Conclusion The Array Operation Synthesis can handle RESHAPE, SPREAD, CSHIFT, EOSHIFT, TRANSPOSE, MERGE, section movement, reduction operations, and WHERE construct The measured speedups from real applications vary between 1.21 and 7.55 in Sequent S27 and SGI Power Challenge. Experimental results show speedups from 1.6 to 10.4 fro HPF code fragments from real applications on DEC alpha Farm and IBM SP2

51 Related Work To handle PACK, UNPACK and Matrix Multiplication Integrating Automatic Data Alignment and AOS Synthesis for array operation functions which includes message passing codes Applying AOS toward a more extensive set of data parallel programs

52 Any question? Gwan-Hwan Hwang （黃冠寰）

53 Substitution (Term Rewriting like method)  Having two Data Access Patterns:  The Synthesized Data Access Pattern is: Synthesis of two Data Access Functions where

54 For example, By the substitution rule  Synthesis of two DAFs (Cont’d)

55 For example, Synthesis of two DAFs (Cont’d)

56 Reference Location The reference location of F(2*J-1,I) with respect to TEMP is: !HPF$ ALIGN F(I,J) WITH TEMP(3*I-1,J) TEMP(3*(2*J-1)-1,I)=TEMP(6*J-4,I)

57 Demonstration Example of Heuristic Algorithm !HPF$ TEMPLATE TEMP(300,300) REAL A(100,100),B(100,100),C(100,100) REAL D(100,100),E(100,100),F(200,100), G(300,100) !HPF$ ALIGN A(i,j),B(i,j),C(i,j),D(i,j),E(i,j) with TEMP(i,j) !HPF$ ALIGN F(i,j) with TEMP(3*i-1,j) !HPF$ ALIGN G(i,j) with TEMP(2*i,j) C(1:100,:)=F(1:200:2,:) D(1:100,:)=G(1:300:3,:) E=CSHIFT(TRANSPOSE(A+B),1,1 )*(TRANSPOSE(C)-TRANSPOSE(D) ) FORALL (I=1:99,J=1:100) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) END FORALL FORALL (I=100:100,J=1:100) E(I,J)=(A[J,I-99]+B[J,I-99])*(F[2*J-1,I]-G[3*J-2,I]) END FORALL Do Array Operation Synthesis

58 Heuristic Algorithm (1) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) E[I,J] A[J,I+1]B[J,I+1] F[2*J-1,I] G[3*J-2,I]

59 Heuristic Algorithm (1) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) E[I,J] A[J,I+1]B[J,I+1] F[2*J-1,I] G[3*J-2,I] TEMP[I,J] TEMP[J,I+1] TEMP[6*J-4,I] SB1SB2

60 Heuristic Algorithm (2) Create Temporary Arrays FORALL (I=1:99,J=1:100) TA1(I,J) =A(J,I+1)+B(J,I+1) END FORALL FORALL (I=1:99,J=1:100) TA2(I,J) =F(2*J-1,I)-G(3*J-2,I) END FORALL FORALL (I=1:99,J=1:100) E(I,J)=TA1(I,J)*TA2(I,J) END FORALL !HPF$ ALIGN TA1(I,J) WITH TEMP(J,I+1) !HPF$ ALIGN TA2(I,J) WITH TEMP(6*J-4,I) Communication Free Loop Communication Free Loop For Subtree SB1 For Subtree SB2