MPI - A User level Message-passing Interface K. Ramakrishna Shenai.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

MPI Basics Introduction to Parallel Programming and Cluster Computing University of Washington/Idaho State University MPI Basics Charlie Peck Earlham College.
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Reference: / MPI Program Structure.
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
CS 240A: Models of parallel programming: Distributed memory and MPI.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
1 CS 668: Lecture 2 An Introduction to MPI Fred Annexstein University of Cincinnati CS668: Parallel Computing Fall 2007 CC Some.
Parallel Programming in C with MPI and OpenMP
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
Message Passing Interface. Message Passing Interface (MPI) Message Passing Interface (MPI) is a specification designed for parallel applications. The.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.
CS 179: GPU Programming Lecture 20: Cross-system communication.
ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Director of Contra Costa College High Performance Computing Center
ECE 1747H : Parallel Programming Message Passing (MPI)
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
An Introduction to Parallel Programming and MPICH Nikolaos Hatzopoulos.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
CS 240A Models of parallel programming: Distributed memory and MPI.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Parallel Programming with MPI By, Santosh K Jena..
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Message-passing Model.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
An Introduction to MPI (message passing interface)
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Message Passing Interface Using resources from
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
1 Programming distributed memory systems Clusters Distributed computers ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 6, 2015.
Chapter 4 Message-Passing Programming. Learning Objectives Understanding how MPI programs execute Understanding how MPI programs execute Familiarity with.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Introduction to MPI Programming Ganesh C.N.
Introduction to parallel computing concepts and technics
MPI Basics.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
CS4402 – Parallel Computing
Introduction to MPI.
MPI Message Passing Interface
CS 668: Lecture 3 An Introduction to MPI
Introduction to MPI CDP.
Send and Receive.
CS 584.
Send and Receive.
Introduction to Message Passing Interface (MPI)
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
CS 5334/4390 Spring 2017 Rogelio Long
Lecture 14: Inter-process Communication
CSCE569 Parallel Computing
Introduction to parallelism and the Message Passing Interface
MPI MPI = Message Passing Interface
Introduction to Parallel Computing with MPI
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Presentation transcript:

MPI - A User level Message-passing Interface K. Ramakrishna Shenai

History of MPI Principal branches of MPI MPI 1.2 Version MPI 2.0 Version An MPI Process in normally written in C (C++) or fortran and linked with MPI Libraries.

MPI Basic functionality MPI Program consists of A set of processes occupying a separate and unshared address space Communication of data by invoking MPI procedures MPI 2, has the added ability of adding and destroying processes.

MPI Basic functionality (contd.) MPI Processes can be Can have the same executable file running in different address spaces Individual Processes are instances of different executables from different source files, linked together Pass different command line arguments to different processes Rank in a MPI_COMM_WORLD communicator Each process has a unique rank in the range of [0.. p-1]

/* Example Program: Hello world */ # include int main(int argc, char *argv[]){ int rank, size, partner; int namelen; char name[MPI_MAX_PROCESSOR_NAME]; char greeting[sizeof(name) + 100]; MPI_Init(&argc, 7argv); /* Initialize MPI */ MPI_Comm_size(MPI_COMM_WORLD, &size); /* Which one am I */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* Where am I running*/ MPI_Get_processor(name, &namelen); sprintf(greeting, “hello world %d of %d running on %s \n”, rank, size, name); if(rank==0) { fputs(greeting, stdout); for(partner=1; partner < size; partner++) { MPI_Status stat; MPI_Recv(greeting, sizeof(greeting), MPI_BYTE, partner, 1, MPI_COMM_WORLD, &stat); fputs(greeting, stdout); } }else { MPI_Send(greeting, strlen(greeting)+1, MPI_BYTE, 0, 1, MPI_COMM_WORLD); } MPI_Finalize(); exit(0); }

Basic Concepts  MPI_Init and MPI_finalize In C, MPI_init should be passed the address of the argc and argv. This procedure must be called before any other MPI Procedure MPI_Finalize: This is a terminating procedure after which any other MPI procedure must not be implemented

Basic Concepts Communicators Requirement  the caller specify a communicator argument, which define the context in which communication takes place In C, Communicators have the type MPI_Comm Every process within a communicator has size MPI_Comm_size and the process rank with MPI_Comm_rank. MPI_Get_processor_name A procedure that is used to determine what physical processor the current process is running. Usually returns the system’s Internet Hostname

Basic Concepts: MPI_Send and MPI_Receive They are point to point communication primitives Transmission of messages directly from sender to receiver’s memory. c prototypes int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Receive(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Mainly Blocking Mode of Communication, However in certain cases a non-blocking mode of Communication is possible. Other communication modes (buffered, synchronous and ready exist)

Basic Concepts MPI_Status Structure  can be examined to determine the source and tag of any message Loosely Synchronous Operations  Basically means that all processes engaged in communication operation must call the corresponding MPI procedure at the same logical point in execution, with respect to all other MPI communication calls.

Basic Concepts Balancing Sends and Receives /* InCorrect MAY Deadlock depending upon buffering */ MPI_Send(out, count, type, partner, tag, comm); MPI_Recv(in, count, type, partner, tag, comm, &stat); /* Correct Send must match receive exactly */ If(rank > partner) { MPI_Send(out, count, type, partner, tag, comm); MPI_Recv(in, count, type, partner, tag, comm, &stat);} else if(rank < partner){ MPI_Recv(in, count, type, partner, tag, comm, &stat); MPI_Send(out, count, type, partner, tag, comm);} else { /* rank==partner */ MPI_Type_size(type, &sz); memcpy(in, out, count, * sz); /*content of stat not set */ /* Also Correct use only MPI_Sendrecv Procedure */ Mpi_Sendrecv(out, count, type, partner, tag, in, count, type, partner, tag, comm, &stat);

Basic Concepts MPI Data types: Elementary MPI data types in C MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNISIGNED_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE Use of Standard Libraries MPI does not alter any Standard Libraries. Calls to external libraries (especially I/O libraries) should be made from a distinguished process, the rank=0 process. LAM and MPICH are both quite flexible, when it comes to handling such libraries. Errors in MPI

Compiling using MPI on the Beowulf MPICH is installed in a public location like /usr/local/mpich, also called All user oriented utilities are stored in /bin, header files in /include and libraries in /lib/LINUX/ch_p4/ /bin contain mpicc, the script files for compiling C programs with certain options. Ways of Compiling a C program to give the executable MPIROOT=/usr/local/mpi mpicc hello.c –o hello cc –o hello –I$MPIROOT/include hello.c – L$MPIROOT/LINUX/ch_p4 -lmpich

Running an MPI Program mpirun utility Options -np : starts the program with the specifies number of processors -machinefile ; explicitly specify the list of processors. -nolocal force the first process to execute on the first processor specified in the processor_list mpirun will start process 0 on the processor that called mpirun.

Parallel Data Structures with MPI MPI provides a simple memory model over which Parallel data structures can be created. Every symbol is actually a parallel object that exists on every process. The symbol can take different values for different processes. Example Implementation: A Parallel Array parray structure Typedef struct { int nelem_local; /*the number in local memory */ void *data; MPI_Comm comm; int first_index_local; /*global index of the first element of the local data (whose local index is 0) */ int nelem_global; /* aggregate number of elements */ size_t elem_size; int commrank, commsize; }parray;

/* procedure to create a self consistent parallel array from information provided by each process about its own local data */ Void pacreate(parray *pa, void *data, int nelem, size_t size, MPI_Comm comm) { int last_index; pa->nelem_local = nelem; pa->data = data; pa->comm = comm; pa->elem_size = size; /* My first element is the sum of the number of elements in lower ranked procs */ MPI_Scan(&pa->nelem_local, &last_index, 1, MPI_INT, MPI_SUM, pa->comm); pa->first_index_local = last_index – pa->nelem_local; MPI_Comm_size(pa->comm, &pa->commsize); MPI_Comm_rank(pa->comm, &pa->commrank); /* The global element count is the last index in the highest rank processor (commsize – 1) use bcast to distribute it */ Mpi_Bcast(&last_index, 1, MPI_INT, pa->commsize-1, pa->comm); Pa->nelem_global = last_index; }

Example: A Parallel Array The parray structure automatically exists on every process, when defined. Collective Communication MPI_Broadcast  sends the content of the buffer on one process MPI_Scan performs operation (like ‘+’) on values supplied by each processor, like the result returned on process I is the result of the operation applied to values supplied by every process of rank 0 to i. MPI predefined Reductions/scan operations  MPI_MAX, MPI_SUM, MPI_PROD, MPI_MAXLOC (returns extremum, like an array index or rank) etc.

Example: A Parallel Array (contd.) Use of MPI_Allreduce  computes the result of an operation applied to a value supplied by every process. The result is returned to every process. Can be equivalent to MPI_Scan MPI_Gather, MPI_Scatter are combination of two or more operations like MPI_Allreduce  MPI_Reduce + MPI_Broadcast

/* procedure to compute the mean of the elements in a parallel array */ double parraymean(struct parray *pa) { double sum, sumall; int i; /* Compute the local sum of elements */ sum = 0; for(i=0; i nelem_mine; i++) { sum = sum + pa->data[i]; } /* Use all reduce to get the global sum */ MPI_Allreduce(&sum, &sumall, 1, MPI_DOUBLE, MPI_SUM, pa->comm); return sumall/pa->nelem_global; }

A One dimensional Cellular Automaton One Dimensional Cellular Automata (CA), are simple dynamic systems, where the placement of data is meaningful, requiring point to point communication. A CA with half-width hw --> It is an array of values with update rule that states that the next value in location i depends only on the previous value in locations (i-hw, ,i ,i_hw). The values of a Cellular Automaton can be integers, or bits etc. The update rule can be an arbitrary function of the 2hw+1 input values. Special cases include linear functions and functions like “parity” which count the number of values in an input domain. Structure definition of the the one dimension cellular automaton typedef struct ca_s { unsigned char *state; /* size ncells */ int A; int hw; int ncells; unsigned char *old; /* size ncells + 2*hw */ /* other fields that control updateRule */ }CA_t;

/* updating a cellular automaton */ void Caiterate(CA_t *ca, int ntime){ int i; for(i=0;i<ntimes; i++) { Cacopystate(ca); Caupdate(ca); } } static void Caupdate(CA_t *ca) { int n = ca->ncells; unsigned char *oldcenter = ca->old; unsigned char *new = ca->state; while(n-->0) { *new++ = updateRule(ca, oldcenter++); } } static void { Cacopystate(CA_t *ca){ memcpy(ca->old, ca->state, ca->ncells); /* Now for periodic boundary conditions */ memcpy(ca->old-ca->hw, ca->state+(ca->ncells - ca->hw), ca->hw); memcpy(ca->old+ca->ncells, ca->state, ca->hw); }

A One dimensional Cellular Automaton (contd) The two operations in the CAiterate The CAcopystate involves copying the contents of the state array to the old array so that we can write new values into the state. The second phase involves computing the new state depending on the values of the old array. The periodic boundary conditions are imposed by padding the old array on both the ends by hw values, copied from the opposite end of the state. A B C D E F G H F G H A B C D E F G H A B C The arrow indicates a call to memcpy

Parallelizing the problem ! General guideline : Keep a maximum amount of the sequential code intact. The order of the elements and the relationship between elements stored in different processors is important. Boundary data should be exchanged between neighbours Modification Adding a MPI_COMM Element to the CA Structure calls to memcpy has been replaced to MPI_Sendrecv in CA Copystate typedef struct ca_s { unsigned char *state; /* size ncells */ int A; int hw; int ncells; unsigned char *old; /* size ncells + 2*hw */ MPI_Comm comm; /* other fields that control updateRule */ }CA_t;

Parallelizing the problem ! static void { Cacopystate(CA_t *ca){ int up_neighbor, down_neighbor, myrank, nproc;_ memcpy(ca->old, ca->state, ca->ncells); MPI_Comm_rank (ca->comm, &myrank); MPI_Comm_size (ca->comm, &nproc); up_neighbor = (myrank + 1)% nproc; down_neighbor = (myrank+nproc-1)%nproc; MPI_Sendrecv(ca->state + (ca->ncells - ca->hw), ca->hw, MPI_BYTE, up_neighbor, CA_UPTAG, ca->old, ca->hw, MPI_BYTE, down_neighbor, CA_UPTAG, ca->comm, &stat); MPI_Sendrecv(ca->state, ca->hw, MPI_BYTE, down_neighbor, CA_DOWNTAG, ca->old+(ca->ncells+ca->hw), ca->hw, MPI_BYTE, up_neighbor, CA_DOWNTAG, ca->comm, &stat);

A B C D E F G HI J K L MN O P X Y Z A B C D E F G H I J KF G H I J K L MN O P Q R S In the Parallel version, when the state of CA is copied to old, it is padded at both ends by data from neighboring processors. Two processors are shown, The diagonal arrow represent data transfers by MPI_Sendrecv, and the dotted arrow represents a memcpy

Domain Decomposition and boundaries: Make a parallel version of a problem as a collection of near-copies of a sequential implementation with special boundary conditions derived by communication with neighbors. Starting with a good sequential Implementation --> Fundamental Algorithms are basically designed sequentially by default, hence changes required for parallelization should be as small as possible. For the CA, it is the matter of obtaining boundary conditions from neighboring processors. Avoid changing fundamental sequential parts of a code as far as possible. CONSIDERATIONS

Communication should be treated as a separate phase from local computation. It has a major effect on performance Performance Considerations --> Communication is Slow. Communication latency will dominate the time consideration. Despite our local computation being fast, eventual performance will reduce due to time spent in communication, CONSIDERATIONS

Time taken to complete an iteration in N steps with P processors t step = 2 t latency + (N/P)t update = (N t update /P) + ( 1 + (2 t latency /(N/P)t update ) ) first Term is just the time taken on one processor, divided by P, a perfect speedup result. Last term represents how worse than perfect is the actual implementation --> it is the ration of the time spent on one processor to communication, 2 t latency to the time spent in CAupdate, (N/P)t update CONSIDERATIONS

MPI Advanced Features Blocking and Non-blocking Calls and Alternative Sending Modes Blocking Calls always wait for the requested action to complete before returning control to the caller. Non blocking calls are Initiate background process, only used to avoid unnecessary copying between system buffers and user memory. Standard Mode sends Buffered Mode sends Synchronous mode sends

MPI Advanced Features Derived data types MPI includes a powerful set of procedures that can be used to refer no-contiguous data; example--> refer data of heterogeneous type, and integer followed by a set of floating point values. Latency is a major problem on Beowulf systems, specially in transfer of data, an alternative is to pack multiple logical messages into a single physical message and to incur the startup cost only once. Intercommunicators Basically the MPI_COMM_WORLD, the group of all active processes and communicators are derived by selecting subsets from this communicators, the resulting logical structures are insufficient. It is desirable to construct a new communicator, from the union of the two existing communicators and these are called INTERCOMMUNICATORS. Used in application relevant to functional parallelism.

MPI 2 Process creation and management = Processes can be dynamically created and destroyed. One sided communication = Allowing either the sender or the receiver set the parameters for communication. Reduces complexity and can exploit fast communication mechanisms. Extended collective operations = New procedures for creating collective operations and Intercommunicators. External interfaces = Users and library designers can access internals of MPI’s objects, allows for new functionality to be layered atop MPI’s internal data structures. New Language bindings for C++ and fortran. I/0 MPI-I0 describes the application side of an abstract model of a parallel I/O, how to define requests to a parallel abstract I/O device from a distributed memory parallel program.