Download presentation
Presentation is loading. Please wait.
Published byNichole Blurton Modified over 10 years ago
1
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr
2
5/9/2003Parallel Computing 20032 Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work
3
5/9/2003Parallel Computing 20033 Introduction Motivation: A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! There is no complete method to generate code for non-rectangular tiles
4
5/9/2003Parallel Computing 20034 Introduction Contribution: Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces Simulation of blocking and non-blocking communication primitives Experimental evaluation of proposed scheduling scheme
5
5/9/2003Parallel Computing 20035 Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work
6
5/9/2003Parallel Computing 20036 Background Algorithmic Model: FOR j 1 = min 1 TO max 1 DO … FOR j n = min n TO max n DO Computation(j 1,…,j n ); ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies (D)
7
5/9/2003Parallel Computing 20037 Background Tiling: Popular loop transformation Groups iterations into atomic units Enhances locality in uniprocessors Enables coarse-grain parallelism in distributed memory systems Valid tiling matrix H:
8
5/9/2003Parallel Computing 20038 Tiling Transformation Example: FOR j 1 =0 TO 11 DO FOR j 2 =0 TO 8 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1 -1,j 2 -1]; ENDFOR
9
5/9/2003Parallel Computing 20039 Rectangular Tiling Transformation 3 0 P = 0 3 1/3 0 H = 0 1/3
10
5/9/2003Parallel Computing 200310 Non-rectangular Tiling Transformation 3 3 P = 0 3 1/3 -1/3 H = 0 1/3
11
5/9/2003Parallel Computing 200311 Why Non-rectangular Tiling? Reduces communication Enables more efficient scheduling schemes 8 communication points 6 communication points 6 time steps5 time steps
12
5/9/2003Parallel Computing 200312 Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work
13
5/9/2003Parallel Computing 200313 Computation Distribution We map tiles along the longest dimension to the same processor because: It reduces the number of processors required It simplifies message-passing It reduces total execution times when overlapping computation with communication
14
5/9/2003Parallel Computing 200314 Computation Distribution P3 P2 P1
15
5/9/2003Parallel Computing 200315 Data Distribution Computer-owns rule: Each processor owns the data it computes Arbitrary convex iteration space, arbitrary tiling Rectangular local iteration and data spaces
16
5/9/2003Parallel Computing 200316 Data Distribution
17
5/9/2003Parallel Computing 200317 Data Distribution
18
5/9/2003Parallel Computing 200318 Data Distribution
19
5/9/2003Parallel Computing 200319 Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work
20
5/9/2003Parallel Computing 200320 P3 P2 P1 Communication Schemes With whom do I communicate?
21
5/9/2003Parallel Computing 200321 P3 P2 P1 With whom do I communicate? Communication Schemes
22
5/9/2003Parallel Computing 200322 What do I send? Communication Schemes
23
5/9/2003Parallel Computing 200323 Blocking Scheme P3 P2 P1 j1j1 j2j2 12 time steps
24
5/9/2003Parallel Computing 200324 Non-blocking Scheme P3 P2 P1 j1j1 j2j2 6 time steps
25
5/9/2003Parallel Computing 200325 Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work
26
5/9/2003Parallel Computing 200326 Code Generation Summary Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Sequential Code Dependence Analysis Parallelization Parallel SPMD Code Tiling Transformation Sequential Tiled Code Tiling Computation Distribution Data Distribution Communication Primitives
27
5/9/2003Parallel Computing 200327 Code Summary – Blocking Scheme
28
5/9/2003Parallel Computing 200328 Code Summary – Non-blocking Scheme
29
5/9/2003Parallel Computing 200329 Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work
30
5/9/2003Parallel Computing 200330 Experimental Results 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 ( --with-device=p4, --with-comm=shared ) g++ compiler v.2.95.4 ( -O3 ) FastEthernet interconnection 2 micro-kernel benchmarks (3D): Gauss Successive Over-Relaxation (SOR) Texture Smoothing Code (TSC) Simulation of communication schemes
31
5/9/2003Parallel Computing 200331 SOR Iteration space M x N x N Dependence matrix: Rectangular Tiling: Non-rectangular Tiling:
32
5/9/2003Parallel Computing 200332 SOR
33
5/9/2003Parallel Computing 200333 SOR
34
5/9/2003Parallel Computing 200334 TSC Iteration space T x N x N Dependence matrix: Rectangular Tiling: Non-rectangular Tiling:
35
5/9/2003Parallel Computing 200335 TSC
36
5/9/2003Parallel Computing 200336 TSC
37
5/9/2003Parallel Computing 200337 Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions – Future Work
38
5/9/2003Parallel Computing 200338 Conclusions Automatic code generation for arbitrary tiled spaces can be efficient High performance can be achieved by means of a suitable tiling transformation overlapping computation with communication
39
5/9/2003Parallel Computing 200339 Future Work Application of methodology to imperfectly nested loops and non-constant dependencies Investigation of hybrid programming models (MPI+OpenMP) Performance evaluation on advanced interconnection networks (SCI, Myrinet)
40
5/9/2003Parallel Computing 200340 Questions? http://www.cslab.ece.ntua.gr/~ndros
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.