MPI and OpenMP (Lecture 25, cs262a)

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Threaded Programming Lecture 4: Work sharing directives.
Parallel Programming with MPI By, Santosh K Jena..
Introduction to OpenMP
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Chapter 4 Message-Passing Programming. The Message-Passing Model.
MPI and OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Message Passing Interface Using resources from
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.
Chapter 4.
Introduction to OpenMP
Introduction to parallel computing concepts and technics
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
CS4402 – Parallel Computing
Open[M]ulti[P]rocessing
Introduction to MPI.
Computer Engg, IIT(BHU)
MPI Message Passing Interface
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
Send and Receive.
MPI and comparison of models Lecture 23, cs262a
CS 584.
Cilk and OpenMP (Lecture 20, cs262a)
Send and Receive.
Message Passing Models
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
Programming with Shared Memory Introduction to OpenMP
Introduction to parallelism and the Message Passing Interface
Hybrid Parallel Programming
Introduction to OpenMP
Hybrid MPI and OpenMP Parallel Programming
Shared-Memory Paradigm & OpenMP
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Programming Parallel Computers
Presentation transcript:

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 3-D graph Checklist What we want to enable What we have today How we’ll get there

Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG Shared Memory MSG IPC Message passing: exchange data explicitly via IPC Application developers define protocol and exchanging format, number of participants, and each exchange Shared memory: all multiple processes to share data via memory Applications must locate and and map shared memory regions to exchange data

Architectures Orthogonal to programming model Uniformed Shared Memory (UMA) Cray 2 Non-Uniformed Shared Memory (NUMA) SGI Altix 3700 Massively Parallel DistrBluegene/L Orthogonal to programming model

MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers, and parallel programmers Used to create parallel programs based on message passing Portable: one standard, many implementations Available on almost all parallel machines in C and Fortran De facto standard platform for the HPC community

Groups, Communicators, Contexts Group: a fixed ordered set of k processes, i.e., 0, 1, .., k-1 Communicator: specify scope of communication Between processes in a group Between two disjoint groups Context: partition communication space A message sent in one context cannot be received in another context This image is captured from: “Writing Message Passing Parallel Programs with MPI”, Course Notes, Edinburgh Parallel Computing Centre The University of Edinburgh

Synchronous vs. Asynchronous Message Passing A synchronous communication is not complete until the message has been received An asynchronous communication completes before the message is received

Communication Modes Synchronous: completes once ack is received by sender Asynchronous: 3 modes Standard send: completes once the message has been sent, which may or may not imply that the message has arrived at its destination Buffered send: completes immediately, if receiver not ready, MPI buffers the message locally Ready send: completes immediately, if the receiver is ready for the message it will get it, otherwise the message is dropped silently

Blocking vs. Non-Blocking Blocking, means the program will not continue until the communication is completed Synchronous communication Barriers: wait for every process in the group to reach a point in execution Non-Blocking, means the program will continue, without waiting for the communication to be completed

MPI library Huge (125 functions) Basic (6 functions)

MPI Basic Many parallel programs can be written using just these six functions, only two of which are non-trivial; MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK MPI_SEND MPI_RECV

Skeleton MPI Program (C) #include <mpi.h> main(int argc, char** argv) { MPI_Init(&argc, &argv); /* main part of the program */ /* Use MPI function call depend on your data * partitioning and the parallelization architecture */ MPI_Finalize(); }

A minimal MPI program (C) #include “mpi.h” #include <stdio.h> int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); printf(“Hello, world!\n”); MPI_Finalize(); return 0; }

A minimal MPI program (C) #include “mpi.h” provides basic MPI definitions and types. MPI_Init starts MPI MPI_Finalize exits MPI Notes: Non-MPI routines are local; this “printf” run on each process MPI functions return error codes or MPI_SUCCESS

Error handling By default, an error causes all processes to abort The user can have his/her own error handling routines Some custom error handlers are available for downloading from the net

Improved Hello (C) #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); /* rank of this process in the communicator */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* get the size of the group associates to the communicator */ MPI_Comm_size(MPI_COMM_WORLD, &size); printf("I am %d of %d\n", rank, size); MPI_Finalize(); return 0; }

Tag to identify message Improved Hello (C) /* Find out rank, size */ int world_rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); MPI_Comm_size(MPI_COMM_WORLD, &world_size); int number; if (world_rank == 0) { number = -1; MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (world_rank == 1) { MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process 1 received number %d from process 0\n", number); } Tag to identify message Number of elements Rank of destination Default communicator Rank of source Status

Many other functions… MPI_Bcast: send same piece of data to all processes in the group MPI_Scatter: send different pieces of an array to different processes (i.e., partition an array across processes) From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/

Many other functions… MPI_Gather: take elements from many processes and gathers them to one single process E.g., parallel sorting, searching From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/

Many other functions… MPI_Reduce: takes an array of input elements on each process and returns an array of output elements to the root process given a specified operation MPI_Allreduce: Like MPI_Reduce but distribute results to all processes From: http://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/

MPI Discussion Gives full control to programmer Assume Exposes number of processes Communication is explicit, driven by the program Assume Long running processes Homogeneous (same performance) processors Little support for failures, no straggler mitigation Summary: achieve high performance by hand-optimizing jobs but requires experts to do so, and little support for fault tolerance

OpenMP Based on the “Introduction to OpenMP” presentation: (webcourse.cs.technion.ac.il/236370/Winter2009.../OpenMPLecture.ppt)

Motivation Multicore CPUs are everywhere: Servers with over 100 cores today Even smartphone CPUs have 8 cores Multithreading, natural programming model All processors share the same memory Threads in a process see same address space Many shared-memory algorithms developed

But… Multithreading is hard Lots of expertise necessary Deadlocks and race conditions Non-deterministic behavior makes it hard to debug

Example Parallelize the following code using threads: for (i=0; i<n; i++) { sum = sum + sqrt(sin(data[i])); } Why hard? Need mutex to protect the accesses to sum Different code for serial and parallel version No built-in tuning (# of processors?)

OpenMP A language extension with constructs for parallel programming: Critical sections, atomic access, private variables, barriers Parallelization is orthogonal to functionality If the compiler does not recognize OpenMP directives, the code remains functional (albeit single-threaded) Industry standard: supported by Intel, Microsoft, IBM, HP

OpenMP execution model Fork and Join: Master thread spawns a team of threads as needed Worker Thread Master thread Master thread FORK JOIN FORK JOIN Parallel regions

OpenMP memory model Shared memory model Threads communicate by accessing shared variables The sharing is defined syntactically Any variable that is seen by two or more threads is shared Any variable that is seen by one thread only is private Race conditions possible Use synchronization to protect from conflicts Change how data is stored to minimize the synchronization

OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1 != answer2) { … } How to parallelize?

OpenMP: Work sharing example answer1 = long_computation_1(); answer2 = long_computation_2(); if (answer1 != answer2) { … } How to parallelize? #pragma omp sections { #pragma omp section }

OpenMP: Work sharing example Sequential code for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } Launch nt threads Each thread uses id and nt variables to operate on a different segment of the arrays #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } }

OpenMP: Work sharing example Sequential code (Semi) manual parallelization Automatic parallelization of the for loop using #parallel for for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } #pragma omp parallel { int id = omp_get_thread_num(); int nt = omp_get_num_threads(); int i_start = id*N/nt, i_end = (id+1)*N/nt; for (int i=istart; i<iend; i++) { a[i]=b[i]+c[i]; } } Increment: var++, var--, var += incr, var -= incr Comparison: var op last, where op: <, >, <=, >= One signed variable in the loop #pragma omp parallel #pragma omp for schedule(static) { for (int i=0; i<N; i++) { a[i]=b[i]+c[i]; } } Initialization: var = init

Challenges of #parallel for Load balancing If all iterations execute at the same speed, the processors are used optimally If some iterations are faster, some processors may get idle, reducing the speedup We don’t always know distribution of work, may need to re-distribute dynamically Granularity Thread creation and synchronization takes time Assigning work to threads on per-iteration resolution may take more time than the execution itself Need to coalesce the work to coarse chunks to overcome the threading overhead Trade-off between load balancing and granularity

Schedule: controlling work distribution schedule(static [, chunksize]) Default: chunks of approximately equivalent size, one to each thread If more chunks than threads: assigned in round-robin to the threads Why might want to use chunks of different size? schedule(dynamic [, chunksize]) Threads receive chunk assignments dynamically Default chunk size = 1 schedule(guided [, chunksize]) Start with large chunks Threads receive chunks dynamically. Chunk size reduces exponentially, down to chunksize

OpenMP: Data Environment Shared Memory programming model Most variables (including locals) are shared by threads { int sum = 0; #pragma omp parallel for for (int i=0; i<N; i++) sum += i; } Global variables are shared Some variables can be private Variables inside the statement block Variables in the called functions Variables can be explicitly declared as private

Overriding storage attributes private: A copy of the variable is created for each thread There is no connection between original variable and private copies Can achieve same using variables inside { } firstprivate: Same, but the initial value of the variable is copied from the main copy lastprivate: Same, but the last value of the variable is copied to the main copy int i; #pragma omp parallel for private(i) for (i=0; i<n; i++) { … } int idx=1; int x = 10; #pragma omp parallel for \ firsprivate(x) lastprivate(idx) for (i=0; i<n; i++) { if (data[i] == x) idx = i; }

Reduction How to parallelize this code? Use the reduction clause for (j=0; j<N; j++) { sum = sum + a[j]*b[j]; } How to parallelize this code? sum is not private, but accessing it atomically is too expensive Have a private copy of sum in each thread, then add them up Use the reduction clause #pragma omp parallel for reduction(+: sum) Any associative operator could be used: +, -, ||, |, *, etc The private value is initialized automatically (to 0, 1, ~0 …)

#pragma omp reduction float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction(+:sum) for(int i = 0; i < N; i++) { sum += a[i] * b[i]; } return sum;

Conclusions OpenMP: A framework for code parallelization Available for C++ and FORTRAN Based on a standard Implementations from a wide selection of vendors Relatively easy to use Write (and debug!) code first, parallelize later Parallelization can be incremental Parallelization can be turned off at runtime or compile time Code is still correct for a serial machine