High Performance Parallel Programming

Slides:



Advertisements
Similar presentations
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Advertisements

MPI Collective Communications
MPI_Gatherv CISC372 Fall 2006 Andrew Toy Tom Lynch Bill Meehan.
MPI_REDUCE() Philip Madron Eric Remington. Basic Overview MPI_Reduce() simply applies an MPI operation to select local memory values on each process,
12c.1 Collective Communication in MPI UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
SOME BASIC MPI ROUTINES With formal datatypes specified.
MPI Collective Communication CS 524 – High-Performance Computing.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
Collective Communications
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Message Passing Interface. Message Passing Interface (MPI) Message Passing Interface (MPI) is a specification designed for parallel applications. The.
Parallel Programming with Java
Parallel Programming with MPI Matthew Pratola
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.
PP Lab MPI programming VI. Program 1 Break up a long vector into subvectors of equal length. Distribute subvectors to processes. Let them compute the.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Parallel Programming with MPI By, Santosh K Jena..
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
MPI Jakub Yaghob. Literature and references Books Gropp W., Lusk E., Skjellum A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface,
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
MPI Derived Data Types and Collective Communication
Message Passing Interface Using resources from
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 MPI: Message Passing Interface Prabhaker Mateti Wright State University.
Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
Collectives Reduce Scatter Gather Many more.
MPI Jakub Yaghob.
CS4402 – Parallel Computing
Introduction to MPI Programming
MPI Message Passing Interface
CS 668: Lecture 3 An Introduction to MPI
Computer Science Department
Send and Receive.
Collective Communication with MPI
CS 584.
Prof. Daniel S. Katz Department of Electrical and Computer Engineering
An Introduction to Parallel Programming with MPI
Programming with MPI.
Send and Receive.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Distributed Systems CS
CS 5334/4390 Spring 2017 Rogelio Long
Lecture 14: Inter-process Communication
MPI: Message Passing Interface
CSCE569 Parallel Computing
4. Distributed Programming
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Message-Passing Computing Message Passing Interface (MPI)
Synchronizing Computations
Computer Science Department
5- Message-Passing Programming
Parallel Processing - MPI
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Presentation transcript:

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

High Performance Parallel Programming Lecture 4: Message Passing Interface 3 High Performance Parallel Programming

High Performance Parallel Programming So Far.. Messages source, dest, data, tag, communicator Communicators MPI_COMM_WORLD Point-to-point communications different modes - standard, synchronous, buffered, ready blocking vs non-blocking Derived datatypes construct then commit High Performance Parallel Programming

Ping-pong exercise: program /********************************************************************** * This file has been written as a sample solution to an exercise in a * course given at the Edinburgh Parallel Computing Centre. It is made * freely available with the understanding that every copy of this file * must include this header and that EPCC takes no responsibility for * the use of the enclosed teaching material. * * Authors: Joel Malard, Alan Simpson * Contact: epcc-tec@epcc.ed.ac.uk * Purpose: A program to experiment with point-to-point * communications. * Contents: C source code. ********************************************************************/ High Performance Parallel Programming

#include <stdio.h> #include <mpi.h> #define proc_A 0 #define proc_B 1 #define ping 101 #define pong 101 float buffer[100000]; long float_size; void processor_A (void), processor_B (void); void main ( int argc, char *argv[] ) { int ierror, rank, size; extern long float_size; MPI_Init(&argc, &argv); MPI_Type_extent(MPI_FLOAT, &float_size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == proc_A) processor_A(); else if (rank == proc_B) processor_B(); MPI_Finalize(); }

void processor_A( void ) { int i, length, ierror; MPI_Status status; double start, finish, time; extern float buffer[100000]; extern long float_size; printf("Length\tTotal Time\tTransfer Rate\n"); for (length = 1; length <= 100000; length += 1000){ start = MPI_Wtime(); for (i = 1; i <= 100; i++){ MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping, MPI_COMM_WORLD); MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong, MPI_COMM_WORLD, &status); } finish = MPI_Wtime(); time = finish - start; printf("%d\t%f\t%f\n", length, time/200., (float)(2 * float_size * 100 * length)/time);

void processor_B( void ) { int i, length, ierror; MPI_Status status; extern float buffer[100000]; for (length = 1; length <= 100000; length += 1000) { for (i = 1; i <= 100; i++) { MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping, MPI_COMM_WORLD, &status); MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong, MPI_COMM_WORLD); }

Ping-pong exercise: results High Performance Parallel Programming

Ping-pong exercise: results 2 High Performance Parallel Programming

High Performance Parallel Programming Running ping-pong compile: mpicc ping_pong.c -o ping_pong submit: qsub ping_pong.sh where ping_pong.sh is #PBS -q exclusive #PBS -l nodes=2 cd <your sub_directory> mpirun ping_pong High Performance Parallel Programming

Collective communication Communications involving a group of processes Called by all processes in a communicator for sub-groups need to form a new communicator Examples Barrier synchronisation Broadcast, Scatter, Gather Global sum, Global maximum, etc. High Performance Parallel Programming

High Performance Parallel Programming Characteristics Collective action over a communicator All processes must communicate Synchronisation may or may not occur All collective operations are blocking No tags Recieve buffers must be exactly the right size Collective communications and point-to-point communications cannot interfere High Performance Parallel Programming

High Performance Parallel Programming MPI_Barrier Blocks each calling process until all other members have also called it. Generally used to synchronise between phases of a program Only one argument - no data is exchanged MPI_Barrier(comm) High Performance Parallel Programming

High Performance Parallel Programming Broadcast Copies data from a specified root process to all other processes in communicator all processes must specify the same root other aguments same as for point-to-point datatypes and sizes must match MPI_Bcast(buffer, count, datatype, root, comm) Note: MPI does not support a multicast function High Performance Parallel Programming

High Performance Parallel Programming Scatter, Gather Scatter and Gather are inverse operations Note that all processes partake - even root Scatter: a b c d e before after High Performance Parallel Programming

High Performance Parallel Programming Gather Gather: a b c d e before after High Performance Parallel Programming

MPI_Scatter, MPI_Gather MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) sendcount in scatter and recvcount in gather refer to the size of each individual message (sendtype = recvtype => sendcount = recvcount) total type signatures must match High Performance Parallel Programming

High Performance Parallel Programming Example MPI_Comm comm; int gsize, sendarray[100]; int root, myrank, *rbuf; MPI_Datatype rtype; ... MPI_Comm_rank(comm, myrank); MPI_Comm_size(comm, &gsize); MPI_Type_contigous(100, MPI_INT, &rtype); MPI_Type_commit(&rtype); if (myrank == root) { rbuf = (int *)malloc(gsize*100*sizeof(int)); } MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm); High Performance Parallel Programming

High Performance Parallel Programming More routines MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) a b c d e a b c d e f g h i j k l m n o p q r s t u v w x y High Performance Parallel Programming

High Performance Parallel Programming Vector routines MPI_Scatterv(sendbuf, sendcount, displs, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm) MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm) MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm) Allow send/recv to be from/to non-contiguous locations in an array Useful if sending different counts at different times High Performance Parallel Programming

Global reduction routines Used to compute a result which depends on data distributed over a number of processes Examples: global sum or product global maximum or minimum global user-defined operation Operation should be associative aside: remember floating-point operations technically aren’t associative but we usually don’t care - can affect results in parallel programs though High Performance Parallel Programming

Global reduction (cont.) MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm) combines count elements from each sendbuf using op and leaves results in recvbuf on process root e.g. MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm) r r r r r 2 3 1 1 3 2 1 1 1 2 s s s s s r r 2 2 1 1 r 3 3 r 1 1 r 1 1 s s 8 9 s s s High Performance Parallel Programming

High Performance Parallel Programming Reduction operators MPI_MAX Maximum MPI_MIN Minumum MPI_SUM Sum MPI_PROD Product MPI_LAND Logical AND MPI_BAND Bitewise AND MPI_LOR Logical OR MPI_BOR Bitwise OR MPI_LXOR Logical XOR MPI_BXOR Bitwise XOR MPI_MAXLOC Max value and location MPI_MINLOC Min value and location High Performance Parallel Programming

User-defined operators In C the operator is defined as a function of type typedef void MPI_User_function(void *invec, void *inoutvec, int *len, MPI_Datatype *datatype); In Fortran must write a function as function <user_function>(invec(*), inoutvec(*), len, type) where the function has the following schema for (i = 1 to len) inoutvec(i) = inoutvec(i) op invec(i) Then MPI_Op_create(user_function, commute, op) returns a handle op of type MPI_Op High Performance Parallel Programming

High Performance Parallel Programming Variants MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm) All processes invloved receive identical results MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm) Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result. High Performance Parallel Programming

High Performance Parallel Programming Reduce-scatter MPI_INT *s, *r, *rc; int rank, gsize; ... rc = (/ 1, 2, 0, 1, 1 /) MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm) 1 2 3 7 9 6 High Performance Parallel Programming

High Performance Parallel Programming Scan MPI_Scan(sendbuf, recvbuf, count, datatype, op, comm) Performs a prefix reduction on data across group recvbuf(myrank) = op(sendbuf((i,i=1,myrank))) MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm); 1 2 3 5 6 4 7 8 9 High Performance Parallel Programming

High Performance Parallel Programming Further topics Error-handling Errors are handled by an error handler MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD MPI_ERRORS_RETURN - MPI state is undefined MPI_Error_string(errorcode, string, resultlen) Message probing Messages can be probed Note - wildcard reads may receive a different message blocking and non-blocking Persistent communications High Performance Parallel Programming

High Performance Parallel Programming Assignment 2. Write a general procedure to multiply 2 matrices. Start with http://www.hpc.unimelb.edu.au/cs/assignment2/ This is a harness for last years assignment Last year I asked them to optimise first This year just parallelize Next Tuesday I will discuss strategies That doesn’t mean don’t start now… Ideas available in various places… High Performance Parallel Programming

High Performance Parallel Programming Tomorrow - matrix multiplication High Performance Parallel Programming