Presentation is loading. Please wait.

Presentation is loading. Please wait.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Similar presentations


Presentation on theme: "Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris."— Presentation transcript:

1 Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

2 5/9/2003Parallel Computing 20032 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

3 5/9/2003Parallel Computing 20033 Introduction  Motivation: A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! There is no complete method to generate code for non-rectangular tiles

4 5/9/2003Parallel Computing 20034 Introduction  Contribution: Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces Simulation of blocking and non-blocking communication primitives Experimental evaluation of proposed scheduling scheme

5 5/9/2003Parallel Computing 20035 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

6 5/9/2003Parallel Computing 20036 Background Algorithmic Model: FOR j 1 = min 1 TO max 1 DO … FOR j n = min n TO max n DO Computation(j 1,…,j n ); ENDFOR … ENDFOR  Perfectly nested loops  Constant flow data dependencies (D)

7 5/9/2003Parallel Computing 20037 Background Tiling:  Popular loop transformation  Groups iterations into atomic units  Enhances locality in uniprocessors  Enables coarse-grain parallelism in distributed memory systems  Valid tiling matrix H:

8 5/9/2003Parallel Computing 20038 Tiling Transformation Example: FOR j 1 =0 TO 11 DO FOR j 2 =0 TO 8 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1 -1,j 2 -1]; ENDFOR

9 5/9/2003Parallel Computing 20039 Rectangular Tiling Transformation 3 0 P = 0 3 1/3 0 H = 0 1/3

10 5/9/2003Parallel Computing 200310 Non-rectangular Tiling Transformation 3 3 P = 0 3 1/3 -1/3 H = 0 1/3

11 5/9/2003Parallel Computing 200311 Why Non-rectangular Tiling?  Reduces communication  Enables more efficient scheduling schemes 8 communication points 6 communication points 6 time steps5 time steps

12 5/9/2003Parallel Computing 200312 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

13 5/9/2003Parallel Computing 200313 Computation Distribution  We map tiles along the longest dimension to the same processor because: It reduces the number of processors required It simplifies message-passing It reduces total execution times when overlapping computation with communication

14 5/9/2003Parallel Computing 200314 Computation Distribution P3 P2 P1

15 5/9/2003Parallel Computing 200315 Data Distribution  Computer-owns rule: Each processor owns the data it computes  Arbitrary convex iteration space, arbitrary tiling  Rectangular local iteration and data spaces

16 5/9/2003Parallel Computing 200316 Data Distribution

17 5/9/2003Parallel Computing 200317 Data Distribution

18 5/9/2003Parallel Computing 200318 Data Distribution

19 5/9/2003Parallel Computing 200319 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

20 5/9/2003Parallel Computing 200320 P3 P2 P1 Communication Schemes  With whom do I communicate?

21 5/9/2003Parallel Computing 200321 P3 P2 P1  With whom do I communicate? Communication Schemes

22 5/9/2003Parallel Computing 200322  What do I send? Communication Schemes

23 5/9/2003Parallel Computing 200323 Blocking Scheme P3 P2 P1 j1j1 j2j2 12 time steps

24 5/9/2003Parallel Computing 200324 Non-blocking Scheme P3 P2 P1 j1j1 j2j2 6 time steps

25 5/9/2003Parallel Computing 200325 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

26 5/9/2003Parallel Computing 200326 Code Generation Summary Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Sequential Code Dependence Analysis Parallelization Parallel SPMD Code Tiling Transformation Sequential Tiled Code Tiling Computation Distribution Data Distribution Communication Primitives

27 5/9/2003Parallel Computing 200327 Code Summary – Blocking Scheme

28 5/9/2003Parallel Computing 200328 Code Summary – Non-blocking Scheme

29 5/9/2003Parallel Computing 200329 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

30 5/9/2003Parallel Computing 200330 Experimental Results  8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20)  MPICH v.1.2.5 ( --with-device=p4, --with-comm=shared )  g++ compiler v.2.95.4 ( -O3 )  FastEthernet interconnection  2 micro-kernel benchmarks (3D): Gauss Successive Over-Relaxation (SOR) Texture Smoothing Code (TSC)  Simulation of communication schemes

31 5/9/2003Parallel Computing 200331 SOR  Iteration space M x N x N  Dependence matrix:  Rectangular Tiling:  Non-rectangular Tiling:

32 5/9/2003Parallel Computing 200332 SOR

33 5/9/2003Parallel Computing 200333 SOR

34 5/9/2003Parallel Computing 200334 TSC  Iteration space T x N x N  Dependence matrix:  Rectangular Tiling:  Non-rectangular Tiling:

35 5/9/2003Parallel Computing 200335 TSC

36 5/9/2003Parallel Computing 200336 TSC

37 5/9/2003Parallel Computing 200337 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

38 5/9/2003Parallel Computing 200338 Conclusions  Automatic code generation for arbitrary tiled spaces can be efficient  High performance can be achieved by means of  a suitable tiling transformation  overlapping computation with communication

39 5/9/2003Parallel Computing 200339 Future Work  Application of methodology to imperfectly nested loops and non-constant dependencies  Investigation of hybrid programming models (MPI+OpenMP)  Performance evaluation on advanced interconnection networks (SCI, Myrinet)

40 5/9/2003Parallel Computing 200340 Questions? http://www.cslab.ece.ntua.gr/~ndros


Download ppt "Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris."

Similar presentations


Ads by Google