National Center for Supercomputing Applications MPI for better scalability & application performance Byoung-Do Kim, Ph.D. National Center for Supercomputing.

Slides:

Advertisements

Similar presentations

MPI Message Passing Interface Portable Parallel Programs.

Advertisements

Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed

MPI Collective Communications

Reference: / MPI Program Structure.

1 Friday, October 20, 2006 “Work expands to fill the time available for its completion.” -Parkinson’s 1st Law.

CS 240A: Models of parallel programming: Distributed memory and MPI.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Message Passing Interface (MPI) Part I NPACI Parallel.

Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)

Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.

Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.

Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.

Supercomputing in Plain English Distributed Multiprocessing Blue Waters Undergraduate Petascale Education Program May 23 – June

Paul Gray, University of Northern Iowa Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Tuesday October University of Oklahoma.

Parallel & Cluster Computing MPI Basics Paul Gray, University of Northern Iowa David Joiner, Shodor Education Foundation Tom Murphy, Contra Costa College.

Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

Supercomputing in Plain English Distributed Multiprocessing PRESENTERNAME PRESENTERTITLE PRESENTERDEPARTMENT PRESENTERINSTITUTION DAY MONTH DATE YEAR Your.

Parallel & Cluster Computing MPI Introduction Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.

Parallel Programming and Algorithms – MPI Collective Operations David Monismith CS599 Feb. 10, 2015 Based upon MPI: A Message-Passing Interface Standard.

1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.

ECE 1747H : Parallel Programming Message Passing (MPI)

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

Introduction to Parallel Programming with C and MPI at MCSR Part 1 The University of Southern Mississippi April 8, 2010.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Parallel Programming & Cluster Computing Distributed Multiprocessing Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Tuesday October.

Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

Message Passing Interface (MPI) 1 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.

Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.

Parallel Programming with MPI By, Santosh K Jena..

Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.

Parallel Programming & Cluster Computing MPI Collective Communications Dan Ernst Andrew Fitz Gibbon Tom Murphy Henry Neeman Charlie Peck Stephen Providence.

CSCI-455/522 Introduction to High Performance Computing Lecture 4.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Supercomputing in Plain English An Introduction to High Performance Computing Part VI: Distributed Multiprocessing Henry Neeman, Director OU Supercomputing.

Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.

Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.

2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.

MPI Groups, Communicators and Topologies. Groups and communicators In our case studies, we saw examples where collective communication needed to be performed.

Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN

Message Passing Interface Using resources from

MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Introduction to MPI Programming Ganesh C.N.

Introduction to parallel computing concepts and technics

CS4402 – Parallel Computing

MPI Message Passing Interface

MPI: The Message-Passing Interface

CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Mary Hall November 3, /03/2011 CS4961.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Message Passing Models

Lecture 14: Inter-process Communication

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes

Message Passing Programming Based on MPI

Hello, world in MPI #include <stdio.h> #include "mpi.h"

Hello, world in MPI #include <stdio.h> #include "mpi.h"

Parallel Processing - MPI

MPI Message Passing Interface

CS 584 Lecture 8 Assignment?.

Presentation transcript:

National Center for Supercomputing Applications MPI for better scalability & application performance Byoung-Do Kim, Ph.D. National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Seungdo Hong Dept. Of Mechanical Engineering Pusan National University, Pusan, Korea

National Center for Supercomputing Applications Outline MPI basic MPI collective communication MPI datatype Data parallelism: domain decomposition Algorithm Implementation Examples Conclusion

National Center for Supercomputing Applications MPI Basics MPI_Init starts up the MPI runtime environment at the beginning of a run. MPI_Finalize shuts down the MPI runtime environment at the end of a run. MPI_Comm_size gets the number of processes in a run, N p (typically called just after MPI_Init ). MPI_Comm_rank gets the process ID that the current process uses, which is between 0 and N p -1 inclusive (typically called just after MPI_Init ).

National Center for Supercomputing Applications PROGRAM my_mpi_program IMPLICIT NONE INCLUDE "mpif.h" [other includes] INTEGER :: my_rank, num_procs, mpi_error_code [other declarations] CALL MPI_Init(mpi_error_code) !! Start up MPI CALL MPI_Comm_Rank(my_rank, mpi_error_code) CALL MPI_Comm_size(num_procs, mpi_error_code) [actual work goes here] CALL MPI_Finalize(mpi_error_code) !! Shut down MPI END PROGRAM my_mpi_program MPI example code in Fortran

National Center for Supercomputing Applications MPI example code in C #include #include "mpi.h" [other includes] int main (int argc, char* argv[]) { /* main */ int my_rank, num_procs, mpi_error; [other declarations] MPI_Init(&argc, &argv); /* Start up MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &num_procs); [actual work goes here] MPI_Finalize(); /* Shut down MPI */ } /* main */

National Center for Supercomputing Applications How an MPI Run Works Every process gets a copy of the executable: Single Program, Multiple Data (SPMD). They all start executing it. Each looks at its own rank to determine which part of the problem to work on. Each process works completely independently of the other processes, except when communicating.

National Center for Supercomputing Applications Send & Receive MPI_SEND(buf,count,datatype,dest,tag,comm) MPI_SEND(buf,count,datatype,source,tag,comm,status) When MPI sends a message, it doesn’t just send the contents; it also sends an “envelope” describing the contents: Buf: initial address of send buffer Count: number of entries to send Data type: datatype of each entry Source: rank of sending process Dest: rank of process to receive Tag (message ID) Comm: communicator (e.g., MPI_COMM_WORLD )

National Center for Supercomputing Applications MPI_SENDRECV MPI_SendRecv(sendbuf,sendcount,send type,dest,sendtag,recvbuf,recvcoun t,recvtype,source,recvtag,comm,sta tus) Useful for communications patterns where each node both sends and receives messages. Executes a blocking send & receive operation Both function use the same communicator, but have distinct tag argument

National Center for Supercomputing Applications Collective Communication Broadcast ( MPI_Bcast ) –A single proc sends the same data to every proc Reduction ( MPI_Reduce ) –All the procs contribute data that is combined using a binary operation (min, max, sum, etc.): One proc obtains the final answer Allreduce ( MPI_Allreduce ) –Same as MPI_Reduce, but every proc contains the final answer Gather ( MPI_Gather ) –Collect the data from every proc and store the data on proc root Scatter ( MPI_Scatter ) –Split the data on proc root into np segment

National Center for Supercomputing Applications

MPI Datatype CFortran 90 charMPI_CHARCHARACTERMPI_CHARACTER intMPI_INTINTEGERMPI_INTEGER floatMPI_FLOATREALMPI_REAL doubleMPI_DOUBLEDOUBLE PRECISION MPI_DOUBLE_PRECISION MPI supports several other data types, but most are variations of these, and probably these are all you’ll use.

National Center for Supercomputing Applications Data packaging Use MPI derived datatype constructor if data to be transmitted consists of a subset of the entries in an array MPI_type_contiguous : builds a derived type whose elements are contiguous entries in an array MPI_Type_vector : for equally spaced entries MPI_Type_indexed : for binary entries of an array

National Center for Supercomputing Applications MPI_Type_Vector MPI_TYPE_VECTOR(count,blocklength,stride, oldtype, newtype) IN countnumber of blocks (int) IN blocklengthnumber of elements in each block (int) INstride spacing between start of each block, measured as number of elements (int) IN oldtypeold datatype (handle) OUT newtype new datatype (handle) oldtypeblocklength stride = = count

National Center for Supercomputing Applications Virtual Topology MPI_cart_creat(comm_old,ndims,dims,peri od,reorder,comm,cart) –Describe Cartesian structure of arbitrary dimension –Create a new communicator, contains information on the structure of the Cartesian topology. –Returns a handle to a new communicator with the topology information. MPI_cart_rank(comm,coords,rank) MPI_cart_coords(comm,rank,maxdims,coord s) Mpi_cart_shift(comm,direction,disp,rank _source,rank_dest)

National Center for Supercomputing Applications Application: 3-D Heat Conduction Problem Solving heat conduction equation by TDMA (Tri-Diagonal Matrix Algorithm)

National Center for Supercomputing Applications Domain Decomposition Data parallelization: Extensibility, Portability Divide computational domain into many sub-domains based on number of processors Solves the same problem on the sub-domians, need to transfer the b.c. information of overlapping boundary area Requires communication between the subdomains in every time step Major parallelization method in CFD applications In order to get a good scalability, need to implement algorithms carefully.

National Center for Supercomputing Applications 1-D decomposition ! ! MPI Cartesian Coordinate Communicator ! CALL MPI_CART_CREATE (MPI_COMM_WORLD, NDIMS, DIMS, PERIODIC,REORDER,CommZ,ierr) CALL MPI_COMM_RANK (CommZ,myPE,ierr) CALL MPI_CART_COORDS (CommZ,myPE, NDIMS,CRDS,ierr) CALL MPI_CART_SHIFT (CommZ,0,1,PEb,PEt,ierr) ! ! MPI Datatype creation ! CALL MPI_TYPE_CONTIGUOUS (Nx*Ny,MPI_DOUBLE_PRECISION,XY_p,ierr) CALL MPI_TYPE_COMMIT(XY_p,ierr)

National Center for Supercomputing Applications 2-D decomposition CALL MPI_CART_CREATE (MPI_COMM_WORLD, NDIMS, DIMS, PERIODIC,REORDER,CommXY,ierr) CALL MPI_COMM_RANK (CommXY,myPE,ierr) CALL MPI_CART_COORDS (CommXY,myPE,NDIMS,CRDS,ierr) CALL MPI_CART_SHIFT (CommXY,1,1,PEw,PEe,ierr) CALL MPI_CART_SHIFT (CommXY,0,1,PEs,PEn,ierr) ! ! MPI Datatype creation ! CALL MPI_TYPE_VECTOR (cnt_yz,block_yz,strd_yz,MPI_DOUBLE_PRECISION, YZ_p,ierr) CALL MPI_TYPE_COMMIT (YZ_p,ierr) CALL MPI_TYPE_VECTOR (cnt_xz,block_xz,strd_xz,MPI_DOUBLE_PRECISION, XZ_p,ierr) CALL MPI_TYE_COMMIT (XZ_p,ierr)

National Center for Supercomputing Applications 3-D decomposition CALL MPI_CART_CREATE (MPI_COMM_WORLD,…,commXYZ,ierr) CALL MPI_COMM_RANK (CommXYZ,myPE,ierr) CALL MPI_CART_COORDS (CommXYZ,myPE,NDIMS,CRDS,ierr) CALL MPI_CART_SHIFT (CommXYZ,2,1,PEw,PEe,ierr) CALL MPI_CART_SHIFT (CommXYZ,1,1,PEs,PEn,ierr) CALL MPI_CART_SHIFT (CommXYZ,0,1,PEb,PEt,ierr) ! CALL MPI_TYPE_VECTOR (cnt_yz,block_yz,strd_yz, MPI_DOUBLE_PRECISION,YZ_p,ierr) CALL_MPI_TYPE_COMMIT (YZ_p,ierr) CALL MPI_TYPE_VECTOR (cnt_xz,block_xz,strd_xz, MPI_DOUBLE_PRECISION,XZ_p,ierr) CALL MPI_TYEP_COMMIT (XZ_p,ierr) CALL MPI_TYPE_CONTIGUOUS (cnt_xy, MPI_DOUBLE_PRECISION,XY_p,ierr) CALL MPI_TYPE_COMMIT (XY_p,ierr)

National Center for Supercomputing Applications Scalability : 1-D Good Scalability up to small number of processors (16) After choke point, communication overhead becomes dominant. Performance degrade with large number of processors

National Center for Supercomputing Applications Scalability: 2-D Strong Scalability up to large number of processors Actual runtime larger than 1- D case in the case of small number of processors Sweep direction of TDMA solver affects the parallel performance due to communication overhead

National Center for Supercomputing Applications Scalability: 3-D Superior scalability behavior over the other two cases No choke point observed up to 512 processors Communication overhead ignorable compared to total runtime.

National Center for Supercomputing Applications SpeedUps

National Center for Supercomputing Applications Superlinear Speedup of 3-D Parallel Case Benefit from Intel Itanium chip architecture (Large L3 cache, bypassing L1 for floating point calculation) Small message size per communication due to good scalability

National Center for Supercomputing Applications Conclusion 1-D decomposition is OK for small application size, but has communication overhead problem when the size increases 2-D shows strong scaling behavior, but need to be careful when apply due to influences from numerical solvers’ characteristics. 3-D demonstrates superior scalability over the other two, have superlinear problem due to hardware architecture. There is no one-size-fit-all magic solution. In order to get the best scalability & application performance, the MPI algorithm, application characteristics, and hardware architectures are in harmony for the best possible solution.