Download presentation
Presentation is loading. Please wait.
Published byNeal Page Modified over 9 years ago
1
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr
2
2/10/2003EuroPVM/MPI 20032 Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
3
2/10/2003EuroPVM/MPI 20033 Introduction Motivation: SMP clusters Hybrid programming models Mostly fine-grain MPI-OpenMP paradigms Mostly DOALL parallelization
4
2/10/2003EuroPVM/MPI 20034 Introduction Contribution: 3 programming models for the parallelization of nested loops algorithms pure MPI fine-grain hybrid MPI-OpenMP coarse-grain hybrid MPI-OpenMP Advanced hyperplane scheduling minimize synchronization need overlap computation with communication
5
2/10/2003EuroPVM/MPI 20035 Introduction Algorithmic Model: FOR j 0 = min 0 TO max 0 DO … FOR j n-1 = min n-1 TO max n-1 DO Computation(j 0,…,j n-1 ); ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies
6
2/10/2003EuroPVM/MPI 20036 Introduction Target Architecture: SMP clusters
7
2/10/2003EuroPVM/MPI 20037 Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
8
2/10/2003EuroPVM/MPI 20038 Pure MPI Model Tiling transformation groups iterations into atomic execution units (tiles) Pipelined execution Overlapping computation with communication Makes no distinction between inter-node and intra-node communication
9
2/10/2003EuroPVM/MPI 20039 Pure MPI Model Example: FOR j 1 =0 TO 9 DO FOR j 2 =0 TO 7 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1,j 2 -1]; ENDFOR
10
2/10/2003EuroPVM/MPI 200310 Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes
11
2/10/2003EuroPVM/MPI 200311 Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes
12
2/10/2003EuroPVM/MPI 200312 Pure MPI Model tile 0 = nod 0 ; … tile n-2 = nod n-2 ; FOR tile n-1 = 0 TO DO Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); END FOR
13
2/10/2003EuroPVM/MPI 200313 Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
14
2/10/2003EuroPVM/MPI 200314 Hyperplane Scheduling Implements coarse-grain parallelism assuming inter-tile data dependencies Tiles are organized into data-independent subsets (groups) Tiles of the same group can be concurrently executed by multiple threads Barrier synchronization between threads
15
2/10/2003EuroPVM/MPI 200315 Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads
16
2/10/2003EuroPVM/MPI 200316 Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads
17
2/10/2003EuroPVM/MPI 200317 Hyperplane Scheduling #pragma omp parallel { group 0 = nod 0 ; … group n-2 = nod n-2 ; tile 0 = nod 0 * m 0 + th 0 ; … tile n-2 = nod n-2 * m n-2 + th n-2 ; FOR(group n-1 ){ tile n-1 = group n-1 - ; if(0 <= tile n-1 <= ) compute(tile); #pragma omp barrier }
18
2/10/2003EuroPVM/MPI 200318 Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
19
2/10/2003EuroPVM/MPI 200319 Fine-grain Model Incremental parallelization of computationally intensive parts Relatively straightforward from pure MPI Threads (re)spawned at computation Inter-node communication outside of multi- threaded part Thread synchronization through implicit barrier of omp parallel directive
20
2/10/2003EuroPVM/MPI 200320 Fine-grain Model FOR(group n-1 ){ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,group n-1 )) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); }
21
2/10/2003EuroPVM/MPI 200321 Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
22
2/10/2003EuroPVM/MPI 200322 Coarse-grain Model SPMD paradigm Requires more programming effort Threads are only spawned once Inter-node communication inside multi- threaded part (requires MPI_THREAD_MULTIPLE) Thread synchronization through explicit barrier ( omp barrier directive)
23
2/10/2003EuroPVM/MPI 200323 Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(group n-1 ){ #pragma omp master{ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); } if(valid(tile,thread_id,group n-1 )) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); } #pragma omp barrier }
24
2/10/2003EuroPVM/MPI 200324 Summary: Fine-grain vs Coarse-grain Fine-grainCoarse-grain Threads re-spawningThreads are only spawned once Inter-node MPI communication outside of multi-threaded region Inter-node MPI communication inside multi-threaded region, assumed by master thread Intra-node synchronization through implicit barrier ( omp parallel ) Intra-node synchronization through explicit OpenMP barrier
25
2/10/2003EuroPVM/MPI 200325 Overview Introduction Pure MPI model Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
26
2/10/2003EuroPVM/MPI 200326 Experimental Results 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 ( --with-device=ch_p4, --with-comm=shared ) Intel C++ compiler 7.0 ( -O3 -mcpu=pentiumpro -static ) FastEthernet interconnection ADI micro-kernel benchmark (3D)
27
2/10/2003EuroPVM/MPI 200327 Alternating Direction Implicit (ADI) Unitary data dependencies 3D Iteration Space (X x Y x Z)
28
2/10/2003EuroPVM/MPI 200328 ADI – 4 nodes
29
2/10/2003EuroPVM/MPI 200329 ADI – 4 nodes X < Y X > Y
30
2/10/2003EuroPVM/MPI 200330 ADI X=512 Y=512 Z=8192 – 4 nodes
31
2/10/2003EuroPVM/MPI 200331 ADI X=128 Y=512 Z=8192 – 4 nodes
32
2/10/2003EuroPVM/MPI 200332 ADI X=512 Y=128 Z=8192 – 4 nodes
33
2/10/2003EuroPVM/MPI 200333 ADI – 2 nodes
34
2/10/2003EuroPVM/MPI 200334 ADI – 2 nodes X < Y X > Y
35
2/10/2003EuroPVM/MPI 200335 ADI X=128 Y=512 Z=8192 – 2 nodes
36
2/10/2003EuroPVM/MPI 200336 ADI X=256 Y=512 Z=8192 – 2 nodes
37
2/10/2003EuroPVM/MPI 200337 ADI X=512 Y=512 Z=8192 – 2 nodes
38
2/10/2003EuroPVM/MPI 200338 ADI X=512 Y=256 Z=8192 – 2 nodes
39
2/10/2003EuroPVM/MPI 200339 ADI X=512 Y=128 Z=8192 – 2 nodes
40
2/10/2003EuroPVM/MPI 200340 ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication
41
2/10/2003EuroPVM/MPI 200341 ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication
42
2/10/2003EuroPVM/MPI 200342 Overview Introduction Pure MPI model Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
43
2/10/2003EuroPVM/MPI 200343 Conclusions Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm Hybrid models can be competitive to the pure MPI paradigm Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated Programming efficiently in OpenMP not easier than programming efficiently in MPI
44
2/10/2003EuroPVM/MPI 200344 Future Work Application of methodology to real applications and benchmarks Work balancing for coarse-grain model Performance evaluation on advanced interconnection networks (SCI, Myrinet) Generalization as compiler technique
45
2/10/2003EuroPVM/MPI 200345 Questions? http://www.cslab.ece.ntua.gr/~ndros
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.