Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Chapter 13 Control Structures. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display Control Structures Conditional.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Introduction to Openmp & openACC
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Matrix-vector Multiplication.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel Programming in C with MPI and OpenMP
© The McGraw-Hill Companies, 2006 Chapter 5 Arrays.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 2 Image Slides.
Chapter 8 Traffic-Analysis Techniques. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 8-1.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Parallel Programming in C with MPI and OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
Hybrid MPI and OpenMP Parallel Programming
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Message-passing Model.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
MPI and OpenMP.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Parallel Computing Presented by Justin Reschke
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Chapter 13 Transportation Demand Analysis. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,
Parallel Programming in C with MPI and OpenMP
Introduction to OpenMP
Shared-Memory Programming
Exploiting Parallelism
Parallel Programming with MPI and OpenMP
Hybrid Parallel Programming
CSCE569 Parallel Computing
Parallel Programming in C with MPI and OpenMP
CSCE569 Parallel Computing
Hybrid Parallel Programming
CSCE569 Parallel Computing
Parallel Programming in C with MPI and OpenMP
Hybrid Parallel Programming
Presentation transcript:

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18 Combining MPI and OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Outline Advantages of using both MPI and OpenMP Advantages of using both MPI and OpenMP Case Study: Conjugate gradient method Case Study: Conjugate gradient method Case Study: Jacobi method Case Study: Jacobi method

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. C+MPI vs. C+MPI+OpenMP C + MPIC + MPI + OpenMP

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Why C + MPI + OpenMP Can Execute Faster Lower communication overhead Lower communication overhead More portions of program may be practical to parallelize More portions of program may be practical to parallelize May allow more overlap of communications with computations May allow more overlap of communications with computations

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Case Study: Conjugate Gradient Conjugate gradient method solves Ax = b Conjugate gradient method solves Ax = b In our program we assume A is dense In our program we assume A is dense Methodology Methodology  Start with MPI program  Profile functions to determine where most execution time spent  Tackle most time-intensive function first

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Result of Profiling MPI Program Function 1 CPU 8 CPUs matrix_vector_product99.55%97.49% dot_product0.19%1.06% cg0.25%1.44% Clearly our focus needs to be on function matrix_vector_product

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Code for matrix_vector_product void matrix_vector_product (int id, int p, int n, double **a, double *b, double *c) { int i, j; double tmp; /* Accumulates sum */ for (i=0; i<BLOCK_SIZE(id,p,n); i++) { tmp = 0.0; for (j = 0; j < n; j++) tmp += a[i][j] * b[j]; piece[i] = tmp; } new_replicate_block_vector (id, p, piece, n, (void *) c, MPI_DOUBLE); }

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Adding OpenMP directives Want to minimize fork/join overhead by making parallel the outermost possible loop Want to minimize fork/join overhead by making parallel the outermost possible loop Outer loop may be executed in parallel if each thread has a private copy of tmp and j Outer loop may be executed in parallel if each thread has a private copy of tmp and j #pragma omp parallel for private(j,tmp) for (i=0; i<BLOCK_SIZE(id,p,n); i++) { for (i=0; i<BLOCK_SIZE(id,p,n); i++) {

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. User Control of Threads Want to give user opportunity to specify number of active threads per process Want to give user opportunity to specify number of active threads per process Add a call to omp_set_num_threads to function main Add a call to omp_set_num_threads to function main Argument comes from command line Argument comes from command line omp_set_num_threads (atoi(argv[3]));

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. What Happened? We transformed a C+MPI program to a C+MPI+OpenMP program by adding only two lines to our program! We transformed a C+MPI program to a C+MPI+OpenMP program by adding only two lines to our program!

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Benchmarking Target system: a commodity cluster with four dual-processor nodes Target system: a commodity cluster with four dual-processor nodes C+MPI program executes on 1, 2,..., 8 CPUs C+MPI program executes on 1, 2,..., 8 CPUs On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPU On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPU C+MPI+OpenMP program executes on 1, 2, 3, 4 processes C+MPI+OpenMP program executes on 1, 2, 3, 4 processes Each process has two threads Each process has two threads C+MPI+OpenMP program executes on 2, 4, 6, 8 threads C+MPI+OpenMP program executes on 2, 4, 6, 8 threads

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Results of Benchmarking

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Analysis of Results C+MPI+OpenMP program slower on 2, 4 CPUs because C+MPI+OpenMP threads are sharing memory bandwidth, while C+MPI processes are not C+MPI+OpenMP program slower on 2, 4 CPUs because C+MPI+OpenMP threads are sharing memory bandwidth, while C+MPI processes are not C+MPI+OpenMP programs faster on 6, 8 CPUs because they have lower communication cost C+MPI+OpenMP programs faster on 6, 8 CPUs because they have lower communication cost

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Case Study: Jacobi Method Begin with C+MPI program that uses Jacobi method to solve steady state heat distribution problem of Chapter 13 Begin with C+MPI program that uses Jacobi method to solve steady state heat distribution problem of Chapter 13 Program based on rowwise block striped decomposition of two-dimensional matrix containing finite difference mesh Program based on rowwise block striped decomposition of two-dimensional matrix containing finite difference mesh

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Methodology Profile execution of C+MPI program Profile execution of C+MPI program Focus on adding OpenMP directives to most compute-intensive function Focus on adding OpenMP directives to most compute-intensive function

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Result of Profiling Function 1 CPU 8 CPUs initialize_mesh0.01%0.03% find_steady_state98.48%93.49% print_solution1.51%6.48%

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Function find_steady_state (1/2) its = 0; for (;;) { if (id > 0) MPI_Send (u[1], N, MPI_DOUBLE, id-1, 0, MPI_COMM_WORLD); if (id < p-1) { MPI_Send (u[my_rows-2], N, MPI_DOUBLE, id+1, 0, MPI_COMM_WORLD); MPI_Recv (u[my_rows-1], N, MPI_DOUBLE, id+1, 0, MPI_COMM_WORLD, &status); } if (id > 0) MPI_Recv (u[0], N, MPI_DOUBLE, id-1, 0, MPI_COMM_WORLD, &status);

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Function find_steady_state (2/2) diff = 0.0; for (i = 1; i < my_rows-1; i++) for (j = 1; j < N-1; j++) { w[i][j] = (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1])/4.0; if (fabs(w[i][j] - u[i][j]) > diff) diff = fabs(w[i][j] - u[i][j]); } for (i = 1; i < my_rows-1; i++) for (j = 1; j < N-1; j++) u[i][j] = w[i][j]; MPI_Allreduce (&diff, &global_diff, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD); if (global_diff <= EPSILON) break; its++;

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Making Function Parallel (1/2) Except for two initializations and a return statement, function is a big for loop Except for two initializations and a return statement, function is a big for loop Cannot execute for loop in parallel Cannot execute for loop in parallel  Not in canonical form  Contains a break statement  Contains calls to MPI functions  Data dependences between iterations

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Making Function Parallel (2/2) Focus on first for loop indexed by i Focus on first for loop indexed by i How to handle multiple threads testing/updating diff ? How to handle multiple threads testing/updating diff ? Putting if statement in a critical section would increase overhead and lower speedup Putting if statement in a critical section would increase overhead and lower speedup Instead, create private variable tdiff Instead, create private variable tdiff Thread tests tdiff against diff before call to MPI_Allreduce Thread tests tdiff against diff before call to MPI_Allreduce

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Modified Function diff = 0.0; #pragma omp parallel private (i, j, tdiff) { tdiff = 0.0; #pragma omp for for (i = 1; i < my_rows-1; i++)... #pragma omp for nowait for (i = 1; i < my_rows-1; i++) #pragma omp critical if (tdiff > diff) diff = tdiff; } MPI_Allreduce (&diff, &global_diff, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD);

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Making Function Parallel (3/3) Focus on second for loop indexed by i Focus on second for loop indexed by i Copies elements of w to corresponding elements of u : no problem with executing in parallel Copies elements of w to corresponding elements of u : no problem with executing in parallel

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Benchmarking Target system: a commodity cluster with four dual-processor nodes Target system: a commodity cluster with four dual-processor nodes C+MPI program executes on 1, 2,..., 8 CPUs C+MPI program executes on 1, 2,..., 8 CPUs On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPU On 1, 2, 3, 4 CPUs, each process on different node, maximizing memory bandwidth per CPU C+MPI+OpenMP program executes on 1, 2, 3, 4 processes C+MPI+OpenMP program executes on 1, 2, 3, 4 processes Each process has two threads Each process has two threads C+MPI+OpenMP program executes on 2, 4, 6, 8 threads C+MPI+OpenMP program executes on 2, 4, 6, 8 threads

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Benchmarking Results

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Analysis of Results Hybrid C+MPI+OpenMP program uniformly faster than C+MPI program Hybrid C+MPI+OpenMP program uniformly faster than C+MPI program Computation/communication ratio of hybrid program is superior Computation/communication ratio of hybrid program is superior Number of mesh points per element communicated is twice as high per node for the hybrid program Number of mesh points per element communicated is twice as high per node for the hybrid program Lower communication overhead leads to 19% better speedup on 8 CPUs Lower communication overhead leads to 19% better speedup on 8 CPUs

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Summary Many contemporary parallel computers consists of a collection of multiprocessors Many contemporary parallel computers consists of a collection of multiprocessors On these systems, performance of C+MPI+OpenMP programs can exceed performance of C+MPI programs On these systems, performance of C+MPI+OpenMP programs can exceed performance of C+MPI programs OpenMP enables us to take advantage of shared memory to reduce communication overhead OpenMP enables us to take advantage of shared memory to reduce communication overhead Often, conversion requires addition of relatively few pragmas Often, conversion requires addition of relatively few pragmas