Hybrid Parallel Programming

Slides:



Advertisements
Similar presentations
Parallel Processing with OpenMP
Advertisements

Introduction to Openmp & openACC
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
Parallel Programming in C with MPI and OpenMP
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June
ECE 1747H : Parallel Programming Message Passing (MPI)
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Message-passing Model.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Task/ChannelMessage-passing TaskProcess Explicit channelsMessage communication.
MPI and OpenMP.
Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
Hybrid Parallel Programming with the Paraguin compiler
Introduction to MPI.
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Paraguin Compiler Examples.
Sieve of Eratosthenes.
Parallel Graph Algorithms
CS 584.
Using compiler-directed approach to create MPI code automatically
Multi-core CPU Computing Straightforward with OpenMP
Paraguin Compiler Examples.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Using compiler-directed approach to create MPI code automatically
Paraguin Compiler Communication.
Paraguin Compiler Version 2.1.
Paraguin Compiler Examples.
Programming with Shared Memory Introduction to OpenMP
Paraguin Compiler Version 2.1.
Hybrid Parallel Programming
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
Introduction to OpenMP
CSCE569 Parallel Computing
Patterns Paraguin Compiler Version 2.1.
Hybrid MPI and OpenMP Parallel Programming
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Introduction to Parallel Computing
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Hybrid Parallel Programming
Quiz Questions How does one execute code in parallel in Paraguin?
CS 584 Lecture 8 Assignment?.
Programming Parallel Computers
Presentation transcript:

Hybrid Parallel Programming Introduction

Interconnection Network Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Interconnection Network Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer

Hybrid Parallel Computing We can use MPI to run processes concurrently on each computer We can use OpenMP to run threads concurrently on each core of a computer Advantage: we can make use of shared-shared memory where communication is required Why? – Because inter-computer communication is an order of magnitude slower than synchronization

“More effective forms of parallelism need to be investigated and perhaps utilized within the node for certain applications to maximize efficiency.” [Holt:2011] MPI implementation designers have recognized this notion stating that MPI alone “does not make the most efficient use of the shared resources within the node of a HPC system.” [Dózsa:2010]

Research conducted at the University of Tokyo on three-level hybrid parallel programs running on a large SMP cluster determined that it was inconclusive whether the extra development effort outweighed the perceived benefit [Nakajima:2005] Hybrid parallel programs using MPI and OpenMP have been developed mainly in the past decade with mixed results comparing performance relative to MPI-only parallel versions.[Henty:2000], Chow & Hysom:2001]

Matrix Multiplication, C = A * B where A is an n x l matrix and B is an l x m matrix.

Matrix Multiplication Again Each cell Ci,j can be computed independently of the other elements of the C matrix If we are multiplying to matrices that are NxN, then we can use up to N2 processors without any communication But the computation of Ci,j is a dot-product of two arrays. In other words, it is a reduction

Matrix Multiplication Again If we have more than N2 processors, then we need to divide up the work of the reduction among processors The reduction requires communication Although we can do a reduction using MPI, the communication is much slower than doing a reduction in OpenMP However, N usually needs to be really big to justify parallel computation, and how likely are we to have N3 processors available!

How is this possible? OpenMP is supported by icc and gcc compilers: gcc –fopenmp <file.c> icc –openmp <file.c> MPI is a library that is linked with your C program: mpicc <file.c> Mpicc uses gcc linked with the appropriate libraries

How is this possible? So to use both MPI and OpenMP: mpicc –fopenmp <file.c> Mpicc is simply a script

Simple Example int main(int argc, char *argv[]) { int i, j, blksz, rank, NP, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP);

Simple Example Loop i parallelized across computers #pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n", rank, tid, i, j); } Loop i parallelized across computers Loop j parallelized across threads

Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3

Simple Example Result rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4

Back to Matrix Multiplication MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP); MPI_Scatter (a, N*blksz, MPI_FLOAT, a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD);

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; }

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.659652 (seconds) $ diff out MMULT.o5357 > elapsed_time= 0.626821 (seconds) $ Sequential Execution Time Hybrid Execution Time MPI-only Execution Time Hybrid did not do better than MPI only

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } Perhaps we could do better parallizing the i loop both with MPI and OpenMP But this loop is too complicated for OpenMP

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } An if statement can simplify the loop

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.688119 (seconds) Sequential Execution Time Hybrid Execution Time Still not better

The Paraguin compiler can also create hybrid programs This work was done by graduate student L. Kyle Holt, 2011. This is because it uses mpicc, it will pass the OpenMP pragma through This works unless you are trying to parallelize the same loop using

Hybrid Matrix Multiplication using Paraguin ; #pragma paraguin begin_parallel #pragma paraguin forall C p i j k \ 0x0 -1 1 0x0 0x0 \ 0x0 1 -1 0x0 0x0 #pragma paraguin bcast a b #pragma paraguin gather 1 C i j k \ 0x0 0x0 0x0 1 \ 0x0 0x0 0x0 -1 We are parallelizing the i loop

Hybrid Matrix Multiplication using Paraguin #pragma omp parallel for private(__guin_p,i,j,k) schedule(static) num_threads(4) for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } ; #pragma paraguin end_parallel Since the i loop is going to be parallelized using both MPI and OpenMP, the compiler has to modify the omp pragma

Resulting Matrix Multiplication Parallel Program produced by Paraguin if (0 <= __guin_mypid & __guin_mypid <= -1 + 1 * __guin_NP) { #pragma omp parallel for private ( __guin_p , i , j , k ) schedule ( static ) num_threads ( 4 ) for (__guin_p = __guin_blksz * __guin_mypid; __guin_p <= __suif_min(799, __guin_blksz + -1 + __guin_blksz * __guin_mypid); __guin_p++) { i = 1 * __guin_p; for (j = 0; j <= 799; j++) { c[i][j] = 0.0F; for (k = 0; k <= 799; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } Since the i loop has been replaced with a p loop, the pragma has been modified

Matrix Multiplication (1000x1000 matrices) Code generated by Paraguin Unable to scale Sequential or OpenMP

Matrix Multiplication Code generated by Paraguin 4 PEs 8 PEs 12 PEs 16 PEs 20 PEs 24 PEs 28 PEs 32 PEs

Hybrid Sobel Edge Detection using Paraguin ; #pragma paraguin begin_parallel #pragma paraguin forall C p x y i j \ 0x0 -1 1 0x0 0x0 0x0 \ 0x0 1 -1 0x0 0x0 0x0 #pragma paraguin bcast grayImage #pragma paraguin bcast w #pragma paraguin bcast h We are parallelizing the x loop

Hybrid Sobel Edge Detection using Paraguin #pragma omp parallel for private(__guin_p,x,y,i,j,sumx, sumy,sum) shared(w,h) num_threads(4) for(x=0; x < N; ++x) { for(y=0; y < N; ++y) { sumx = 0; sumy = 0; // handle image boundaries . . . edgeImage[x][y] = clamp(sum); } ; #pragma paraguin end_parallel

Resulting Sobel Edge Detection Parallel Program produced by Paraguin … __guin_blksz = __suif_divceil(999 - 0 + 1, __guin_NP); if (0 <= __guin_mypid & __guin_mypid <= -1 + 1 * __guin_NP) { #pragma omp parallel for private ( __guin_p , x , y , i , j , sumx , sumy , sum ) shared ( w , h ) num_threads ( 4 ) //Local variable produced by s2c __s2c_tmp = __suif_min(999, __guin_blksz + -1 + __guin_blksz * __guin_mypid); //omp pragma moved here for (__guin_p = __guin_blksz * __guin_mypid; __guin_p <= __s2c_tmp; __guin_p++) We had to move the omp pragma by hand because of the temporary variable

Sobel Edge Detection 2.5 MP grayscale 8-bit image MPI-Only/Hybrid Code generated by Paraguin

We found that the majority of the time spent was in gathering the partial results If we remove the gather time from the execution time, we find…

Approx. workload time in Sobel Edge Detection of 2.5 MP image

Questions