Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

5/9/2003Parallel Computing 20032 Overview  Introduction  Background  Code Generation Computation/Data Distribution Communication Schemes Summary  Experimental Results  Conclusions – Future Work

5/9/2003Parallel Computing 20033 Introduction  Motivation: A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! There is no complete method to generate code for non-rectangular tiles

5/9/2003Parallel Computing 20034 Introduction  Contribution: Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces Simulation of blocking and non-blocking communication primitives Experimental evaluation of proposed scheduling scheme

5/9/2003Parallel Computing 20036 Background Algorithmic Model: FOR j 1 = min 1 TO max 1 DO … FOR j n = min n TO max n DO Computation(j 1,…,j n ); ENDFOR … ENDFOR  Perfectly nested loops  Constant flow data dependencies (D)

5/9/2003Parallel Computing 20037 Background Tiling:  Popular loop transformation  Groups iterations into atomic units  Enhances locality in uniprocessors  Enables coarse-grain parallelism in distributed memory systems  Valid tiling matrix H:

5/9/2003Parallel Computing 20038 Tiling Transformation Example: FOR j 1 =0 TO 11 DO FOR j 2 =0 TO 8 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1 -1,j 2 -1]; ENDFOR

5/9/2003Parallel Computing 20039 Rectangular Tiling Transformation 3 0 P = 0 3 1/3 0 H = 0 1/3

5/9/2003Parallel Computing 200310 Non-rectangular Tiling Transformation 3 3 P = 0 3 1/3 -1/3 H = 0 1/3

5/9/2003Parallel Computing 200311 Why Non-rectangular Tiling?  Reduces communication  Enables more efficient scheduling schemes 8 communication points 6 communication points 6 time steps5 time steps

5/9/2003Parallel Computing 200313 Computation Distribution  We map tiles along the longest dimension to the same processor because: It reduces the number of processors required It simplifies message-passing It reduces total execution times when overlapping computation with communication

5/9/2003Parallel Computing 200314 Computation Distribution P3 P2 P1

5/9/2003Parallel Computing 200315 Data Distribution  Computer-owns rule: Each processor owns the data it computes  Arbitrary convex iteration space, arbitrary tiling  Rectangular local iteration and data spaces

5/9/2003Parallel Computing 200316 Data Distribution

5/9/2003Parallel Computing 200320 P3 P2 P1 Communication Schemes  With whom do I communicate?

5/9/2003Parallel Computing 200321 P3 P2 P1  With whom do I communicate? Communication Schemes

5/9/2003Parallel Computing 200322  What do I send? Communication Schemes

5/9/2003Parallel Computing 200323 Blocking Scheme P3 P2 P1 j1j1 j2j2 12 time steps

5/9/2003Parallel Computing 200324 Non-blocking Scheme P3 P2 P1 j1j1 j2j2 6 time steps

5/9/2003Parallel Computing 200326 Code Generation Summary Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Sequential Code Dependence Analysis Parallelization Parallel SPMD Code Tiling Transformation Sequential Tiled Code Tiling Computation Distribution Data Distribution Communication Primitives

5/9/2003Parallel Computing 200327 Code Summary – Blocking Scheme

5/9/2003Parallel Computing 200328 Code Summary – Non-blocking Scheme

5/9/2003Parallel Computing 200330 Experimental Results  8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20)  MPICH v.1.2.5 ( --with-device=p4, --with-comm=shared )  g++ compiler v.2.95.4 ( -O3 )  FastEthernet interconnection  2 micro-kernel benchmarks (3D): Gauss Successive Over-Relaxation (SOR) Texture Smoothing Code (TSC)  Simulation of communication schemes

5/9/2003Parallel Computing 200331 SOR  Iteration space M x N x N  Dependence matrix:  Rectangular Tiling:  Non-rectangular Tiling:

5/9/2003Parallel Computing 200332 SOR

5/9/2003Parallel Computing 200333 SOR

5/9/2003Parallel Computing 200334 TSC  Iteration space T x N x N  Dependence matrix:  Rectangular Tiling:  Non-rectangular Tiling:

5/9/2003Parallel Computing 200335 TSC

5/9/2003Parallel Computing 200336 TSC

5/9/2003Parallel Computing 200338 Conclusions  Automatic code generation for arbitrary tiled spaces can be efficient  High performance can be achieved by means of  a suitable tiling transformation  overlapping computation with communication

5/9/2003Parallel Computing 200339 Future Work  Application of methodology to imperfectly nested loops and non-constant dependencies  Investigation of hybrid programming models (MPI+OpenMP)  Performance evaluation on advanced interconnection networks (SCI, Myrinet)

5/9/2003Parallel Computing 200340 Questions? http://www.cslab.ece.ntua.gr/~ndros

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Similar presentations

Presentation on theme: "Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Similar presentations

Presentation on theme: "Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris."— Presentation transcript:

Similar presentations

About project

Feedback