Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory
2/10/2003EuroPVM/MPI Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
2/10/2003EuroPVM/MPI Introduction Motivation: SMP clusters Hybrid programming models Mostly fine-grain MPI-OpenMP paradigms Mostly DOALL parallelization
2/10/2003EuroPVM/MPI Introduction Contribution: 3 programming models for the parallelization of nested loops algorithms pure MPI fine-grain hybrid MPI-OpenMP coarse-grain hybrid MPI-OpenMP Advanced hyperplane scheduling minimize synchronization need overlap computation with communication
2/10/2003EuroPVM/MPI Introduction Algorithmic Model: FOR j 0 = min 0 TO max 0 DO … FOR j n-1 = min n-1 TO max n-1 DO Computation(j 0,…,j n-1 ); ENDFOR … ENDFOR Perfectly nested loops Constant flow data dependencies
2/10/2003EuroPVM/MPI Introduction Target Architecture: SMP clusters
2/10/2003EuroPVM/MPI Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
2/10/2003EuroPVM/MPI Pure MPI Model Tiling transformation groups iterations into atomic execution units (tiles) Pipelined execution Overlapping computation with communication Makes no distinction between inter-node and intra-node communication
2/10/2003EuroPVM/MPI Pure MPI Model Example: FOR j 1 =0 TO 9 DO FOR j 2 =0 TO 7 DO A[j 1,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1,j 2 -1]; ENDFOR
2/10/2003EuroPVM/MPI Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes
2/10/2003EuroPVM/MPI Pure MPI Model CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 4 MPI nodes
2/10/2003EuroPVM/MPI Pure MPI Model tile 0 = nod 0 ; … tile n-2 = nod n-2 ; FOR tile n-1 = 0 TO DO Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); Compute(tile); MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); END FOR
2/10/2003EuroPVM/MPI Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
2/10/2003EuroPVM/MPI Hyperplane Scheduling Implements coarse-grain parallelism assuming inter-tile data dependencies Tiles are organized into data-independent subsets (groups) Tiles of the same group can be concurrently executed by multiple threads Barrier synchronization between threads
2/10/2003EuroPVM/MPI Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads
2/10/2003EuroPVM/MPI Hyperplane Scheduling CPU1 CPU0 CPU1 CPU0 NODE1 NODE0 2 MPI nodes x 2 OpenMP threads
2/10/2003EuroPVM/MPI Hyperplane Scheduling #pragma omp parallel { group 0 = nod 0 ; … group n-2 = nod n-2 ; tile 0 = nod 0 * m 0 + th 0 ; … tile n-2 = nod n-2 * m n-2 + th n-2 ; FOR(group n-1 ){ tile n-1 = group n-1 - ; if(0 <= tile n-1 <= ) compute(tile); #pragma omp barrier }
2/10/2003EuroPVM/MPI Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
2/10/2003EuroPVM/MPI Fine-grain Model Incremental parallelization of computationally intensive parts Relatively straightforward from pure MPI Threads (re)spawned at computation Inter-node communication outside of multi- threaded part Thread synchronization through implicit barrier of omp parallel directive
2/10/2003EuroPVM/MPI Fine-grain Model FOR(group n-1 ){ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); #pragma omp parallel { thread_id=omp_get_thread_num(); if(valid(tile,thread_id,group n-1 )) Compute(tile); } MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); }
2/10/2003EuroPVM/MPI Overview Introduction Pure MPI Model Hybrid MPI-OpenMP Models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
2/10/2003EuroPVM/MPI Coarse-grain Model SPMD paradigm Requires more programming effort Threads are only spawned once Inter-node communication inside multi- threaded part (requires MPI_THREAD_MULTIPLE) Thread synchronization through explicit barrier ( omp barrier directive)
2/10/2003EuroPVM/MPI Coarse-grain Model #pragma omp parallel { thread_id=omp_get_thread_num(); FOR(group n-1 ){ #pragma omp master{ Pack(snd_buf, tile n-1 – 1, nod); MPI_Isend(snd_buf, dest(nod)); MPI_Irecv(recv_buf, src(nod)); } if(valid(tile,thread_id,group n-1 )) Compute(tile); #pragma omp master{ MPI_Waitall; Unpack(recv_buf, tile n-1 + 1, nod); } #pragma omp barrier }
2/10/2003EuroPVM/MPI Summary: Fine-grain vs Coarse-grain Fine-grainCoarse-grain Threads re-spawningThreads are only spawned once Inter-node MPI communication outside of multi-threaded region Inter-node MPI communication inside multi-threaded region, assumed by master thread Intra-node synchronization through implicit barrier ( omp parallel ) Intra-node synchronization through explicit OpenMP barrier
2/10/2003EuroPVM/MPI Overview Introduction Pure MPI model Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
2/10/2003EuroPVM/MPI Experimental Results 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel ) MPICH v ( --with-device=ch_p4, --with-comm=shared ) Intel C++ compiler 7.0 ( -O3 -mcpu=pentiumpro -static ) FastEthernet interconnection ADI micro-kernel benchmark (3D)
2/10/2003EuroPVM/MPI Alternating Direction Implicit (ADI) Unitary data dependencies 3D Iteration Space (X x Y x Z)
2/10/2003EuroPVM/MPI ADI – 4 nodes
2/10/2003EuroPVM/MPI ADI – 4 nodes X < Y X > Y
2/10/2003EuroPVM/MPI ADI X=512 Y=512 Z=8192 – 4 nodes
2/10/2003EuroPVM/MPI ADI X=128 Y=512 Z=8192 – 4 nodes
2/10/2003EuroPVM/MPI ADI X=512 Y=128 Z=8192 – 4 nodes
2/10/2003EuroPVM/MPI ADI – 2 nodes
2/10/2003EuroPVM/MPI ADI – 2 nodes X < Y X > Y
2/10/2003EuroPVM/MPI ADI X=128 Y=512 Z=8192 – 2 nodes
2/10/2003EuroPVM/MPI ADI X=256 Y=512 Z=8192 – 2 nodes
2/10/2003EuroPVM/MPI ADI X=512 Y=512 Z=8192 – 2 nodes
2/10/2003EuroPVM/MPI ADI X=512 Y=256 Z=8192 – 2 nodes
2/10/2003EuroPVM/MPI ADI X=512 Y=128 Z=8192 – 2 nodes
2/10/2003EuroPVM/MPI ADI X=128 Y=512 Z=8192 – 2 nodes Computation Communication
2/10/2003EuroPVM/MPI ADI X=512 Y=128 Z=8192 – 2 nodes Computation Communication
2/10/2003EuroPVM/MPI Overview Introduction Pure MPI model Hybrid MPI-OpenMP models Hyperplane Scheduling Fine-grain Model Coarse-grain Model Experimental Results Conclusions – Future Work
2/10/2003EuroPVM/MPI Conclusions Nested loop algorithms with arbitrary data dependencies can be adapted to the hybrid parallel programming paradigm Hybrid models can be competitive to the pure MPI paradigm Coarse-grain hybrid model can be more efficient than fine-grain one, but also more complicated Programming efficiently in OpenMP not easier than programming efficiently in MPI
2/10/2003EuroPVM/MPI Future Work Application of methodology to real applications and benchmarks Work balancing for coarse-grain model Performance evaluation on advanced interconnection networks (SCI, Myrinet) Generalization as compiler technique
2/10/2003EuroPVM/MPI Questions?