Hybrid Parallel Programming

Slides:



Advertisements
Similar presentations
Parallel Processing with OpenMP
Advertisements

Introduction to Openmp & openACC
Introductions to Parallel Programming Using OpenMP
Parallel Programming in C with MPI and OpenMP
1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
Parallel Programming in C with MPI and OpenMP
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
Other Means of Executing Parallel Programs OpenMP And Paraguin 1(c) 2011 Clayton S. Ferner.
12d.1 Two Example Parallel Programs using MPI UNC-Wilmington, C. Ferner, 2007 Mar 209, 2007.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Parallel Processing LAB NO 1.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Lecture 8: Caffe - CPU Optimization
OpenMP Blue Waters Undergraduate Petascale Education Program May 29 – June
ECE 1747H : Parallel Programming Message Passing (MPI)
CS 240A Models of parallel programming: Distributed memory and MPI.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
1 " Teaching Parallel Design Patterns to Undergraduates in Computer Science” Panel member SIGCSE The 45 th ACM Technical Symposium on Computer Science.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Task/ChannelMessage-passing TaskProcess Explicit channelsMessage communication.
MPI and OpenMP.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.
Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Hybrid Parallel Programming with the Paraguin compiler
Sieve of Eratosthenes.
Introduction to MPI.
MPI Message Passing Interface
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Paraguin Compiler Examples.
Using compiler-directed approach to create MPI code automatically
Multi-core CPU Computing Straightforward with OpenMP
Paraguin Compiler Examples.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Lecture 14: Inter-process Communication
Parallel Matrix Operations
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
Hybrid Parallel Programming
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Paraguin Compiler Communication.
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Introduction to parallelism and the Message Passing Interface
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Using compiler-directed approach to create MPI code automatically
Hybrid Parallel Programming
Introduction to OpenMP
CSCE569 Parallel Computing
MPI MPI = Message Passing Interface
Hybrid MPI and OpenMP Parallel Programming
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Introduction to Parallel Computing
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Hybrid Parallel Programming
CS 584 Lecture 8 Assignment?.
Programming Parallel Computers
Presentation transcript:

Hybrid Parallel Programming Introduction

Interconnection Network Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Interconnection Network Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer Core Memory Multi-core Computer

Hybrid Parallel Computing We can use MPI to run processes concurrently on each computer We can use OpenMP to run threads concurrently on each core of a computer Advantage: we can make use of shared- memory where communication is required Why? – Because inter-computer communication is an order of magnitude slower than synchronization

“More effective forms of parallelism need to be investigated and perhaps utilized within the node for certain applications to maximize efficiency.” [Holt:2011] MPI implementation designers have recognized this notion stating that MPI alone “does not make the most efficient use of the shared resources within the node of a HPC system.” [Dózsa:2010]

Research conducted at the University of Tokyo on three-level hybrid parallel programs running on a large SMP cluster determined that it was inconclusive whether the extra development effort outweighed the perceived benefit [Nakajima:2005] Hybrid parallel programs using MPI and OpenMP have been developed mainly in the past decade with mixed results comparing performance relative to MPI-only parallel versions.[Henty:2000], Chow & Hysom:2001]

Matrix Multiplication, C = A * B where A is an n x l matrix and B is an l x m matrix.

Matrix Multiplication Again Each cell Ci,j can be computed independently of the other elements of the C matrix If we are multiplying to matrices that are NxN, then we can use up to N2 processors without any communication But the computation of Ci,j is a dot-product of two arrays. In other words, it is a reduction

Matrix Multiplication Again If we have more than N2 processors, then we need to divide up the work of the reduction among processors The reduction requires communication Although we can do a reduction using MPI, the communication is much slower than doing a reduction in OpenMP However, N usually needs to be really big to justify parallel computation, and how likely are we to have N3 processors available!

How is this possible? OpenMP is supported by icc and gcc compilers: gcc –fopenmp <file.c> icc –openmp <file.c> MPI is a library that is linked with your C program: mpicc <file.c> mpicc uses gcc linked with the appropriate libraries

How is this possible? So to use both MPI and OpenMP: mpicc –fopenmp <file.c> mpicc is simply a script

Simple Example int main(int argc, char *argv[]) { int i, j, blksz, rank, NP, tid; char *usage = "Usage: %s \n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP);

Simple Example Loop i parallelized across computers #pragma omp parallel private (tid, i, j) { tid = omp_get_thread_num(); for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%d\n", rank, tid, i, j); } Loop i parallelized across computers Loop j parallelized across threads

Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3

Simple Example Result rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4

Back to Matrix Multiplication MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &NP); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/NP); MPI_Scatter (a, N*blksz, MPI_FLOAT, a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD);

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; }

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.659652 (seconds) $ diff out MMULT.o5357 > elapsed_time= 0.626821 (seconds) $ Sequential Execution Time Hybrid Execution Time MPI-only Execution Time Hybrid did not do better than MPI only

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } Perhaps we could do better parallizing the i loop both with MPI and OpenMP But this loop is too complicated for OpenMP

Back to Matrix Multiplication #pragma omp parallel private (tid, i, j,k) { #pragma omp for nowait for (i = 0; i < blksz; i++) { if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0.0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } An if statement can simplify the loop

Matrix Multiplication Results $ diff out MMULT.o5356 1c1 < elapsed_time= 1.525183 (seconds) --- >elapsed_time= 0.688119 (seconds) Sequential Execution Time Hybrid Execution Time Still not better

Discussion Point Why does the hybrid approach not outperform MPI-only for this problem? For what kinds of problem might a hybrid approach do better?

Questions