MPI User-defined Datatypes Techniques for describing non- contiguous and heterogeneous data.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Programming and Data Structure
Dynamic Memory Allocation I Topics Basic representation and alignment (mainly for static memory allocation, main concepts carry over to dynamic memory.
MPI Collective Communications
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
Chapter 7: User-Defined Functions II
Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Reference: / MPI Program Structure.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2007.
Derived Datatypes and Related Features. Introduction In previous sections, you learned how to send and receive messages in which all the data was of a.
4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
Chapter 6 Floyd’s Algorithm. 2 Chapter Objectives Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Its.unc.edu 1 Derived Datatypes Research Computing UNC - Chapel Hill Instructor: Mark Reed
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Edgar Gabriel MPI derived datatypes Edgar Gabriel.
1 MPI Datatypes l The data in a message to sent or received is described by a triple (address, count, datatype), where l An MPI datatype is recursively.
Parallel Programming with MPI Matthew Pratola
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 3 Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco.
1 Why Derived Data Types  Message data contains different data types  Can use several separate messages  performance may not be good  Message data.
Parallel Processing1 Parallel Processing (CS 676) Lecture: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
MPI-2 Sathish Vadhiyar Using MPI2: Advanced Features of the Message-Passing.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Pointers: Basics. 2 What is a pointer? First of all, it is a variable, just like other variables you studied  So it has type, storage etc. Difference:
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
MPI Jakub Yaghob. Literature and references Books Gropp W., Lusk E., Skjellum A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface,
MPI: the last episode By: Camilo A. Silva. Topics Modularity Data Types Buffer issues + Performance issues Compilation using MPICH2 Other topics: MPI.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
MA354 An Introduction to Math Models (more or less corresponding to 1.0 in your book)
Programming distributed memory systems: Message Passing Interface (MPI) Distributed memory systems: multiple processing units working on one task (e.g.
MPI Workshop - III Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3.
An Introduction to MPI (message passing interface)
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
ICOM 4035 – Data Structures Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 – August 23, 2001.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.
Grouping Data and Derived Types in MPI. Grouping Data Messages are expensive in terms of performance Grouping data can improve the performance of your.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
MPI Derived Data Types and Collective Communication
Lecture 7 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco
Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.
Pointers: Basics. 2 Address vs. Value Each memory cell has an address associated with it
Introduction to MPI Programming Ganesh C.N.
High Altitude Low Opening?
Introduction to MPI.
MPI Message Passing Interface
User-Defined Functions
Lecture 14: Inter-process Communication
Message Passing Programming Based on MPI
A Cell-by-Cell AMR Method for the PPM Hydrodynamics Code
Presentation transcript:

MPI User-defined Datatypes Techniques for describing non- contiguous and heterogeneous data

Derived Datatypes Communication mechanisms studied to this point allow send/recv of a contiguous buffer of identical elements of predefined datatypes. Often want to send non-homogenous elements (structure) or chunks that are not contiguous in memory MPI allows derived datatypes for this purpose.

MPI type-definition functions MPI_Type_Contiguous: a replication of datataype into contiguous locations MPI_Type_vector: replication of datatype into locations that consist of equally spaced blocks MPI_Type_create_hvector: like vector, but successive blocks are not multiple of base type extent MPI_Type_indexed: non-contiguous data layout where displacements between successive blocks need not be equal MPI_Type_create_struct: most general – each block may consist of replications of different datatypes Note: the inconsistent naming convention is unfortunate but carries no deeper meaning. It is a compatibility issue between old and new version of MPI.

MPI_Type_contiguous MPI_Type_contiguous (int count, MPI_Datatype oldtype, MPI_Datatype *newtype) –IN count (replication count) –IN oldtype (base data type) –OUT newtype (handle to new data type) Creates a new type which is simply a replication of oldtype into contiguous locations

MPI_Type_contiguous example /* create a type which describes a line of ghost cells */ /* buf[1..nxl] set to ghost cells */ int nxl; MPI_Datatype ghosts; MPI_Type_contiguous (nxl, MPI_DOUBLE, &ghosts); MPI_Type_commit(&ghosts) MPI_Send (buf, 1, ghosts, dest, tag, MPI_COMM_WORLD);.. MPI_Type_free(&ghosts);

Typemaps Each MPI derived type can be described with a simple Typemap, which specifies –a sequence of primitive types –A sequence of integer displacements Typemap = {(type 0, disp 0 ), …,(type n-1, disp n-1 )} –i’th entry has type type i and displacement buf + disp i –Typemap need not be in any particular order –A handle to a derived type can appear in a send or recv operation instead of a predefined data type (includes collectives)

Question What is typemap of MPI_INT, MPI_DOUBLE, etc.? –{(int,0)} –{(double, 0)} –Etc.

Typemaps, cont. Additional definitions –lower_bound(Typemap) = min disp j, j = 0, …, n-1 –upper_bound(Typemap) = max(disp j + sizeof(type j )) +  –extent(Typemap) = upper_bound(Typemap) - lower_bound(Typemap ) If type i requires alignment to byte address that is a multiple of k i then  is least increment to round extent to next multiple of max k i

Question Assume that Type = {(double, 0), (char, 8)} where doubles have to be strictly aligned at addresses that are multiples of 8. What is the extent of this datatype? ans: 16 What is extent of type {(char, 0), (double, 8)}? ans: 16 Is this a valid type: {(double, 8), (char, 0)}? ans: yes, order does not matter

Detour: Type-related functions MPI_Type_get_extent (MPI_Datatype datatype, MPI_Aint *lb, MPI_Aint *extent) –IN datatype (datatype you are querying) –OUT lb (lower bound of datatype) –OUT extent (extent of datatype) Returns the lower bound and extent of datatype. Question: what is upper bound? –lower_bound + extent

MPI_Type_size MPI_Type_size(MPI_Datatype datatype, int *size) –IN datatype (datatype) –OUT size (datatype size) Returns number of bytes actually occupied by datatype, excluding strided areas. Question: what is size of {(char,0), (double, 8)}?

MPI_Type_vector MPI_Type_vector (int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype); –IN count (number of blocks) –IN blocklength (number of elements per block) –IN stride (spacing between start of each block, measured in # elements) –IN oldtype (base datatype) –OUT newtype (handle to new type) –Allows replication of old type into locations of equally spaced blocks. Each block consists of same number of copies of oldtype with a stride that is multiple of extent of old type.

MPI_Type_vector, cont Example: Imagine you have an local 2d array of interior size mxn with n g ghostcells at each edge. If you wish to send the interior (non ghostcell) portion of the array, how would you describe the datatype to do this in a single MPI call? Ans: MPI_Type_vector (n, m, m+2*ng, MPI_DOUBLE, &interior); MPI_Type_commit (&interior); MPI_Send (f, 1, interior, dest, tag, MPI_COMM_WORLD)

Typemap view Start with Typemap = {(double, 0), (char, 8)} What is Typemap of newtype? MPI_Type_vector(2,3,4,oldtype,&newtype) Ans: {(double, 0), (char, 8),(double,16),(char,24),(double,32),(char,40), (double,64),(char,72),(double,80),(char,88),(double,96),(char,104 )}

Question Express MPI_Type_contiguous(count, old, &new); as a call to MPI_Type_vector. Ans: –MPI_Type_vector (count, 1, 1, old, &new) –MPI_Type_vector (1, count, num, old, &new)

MPI_Type_create_hvector MPI_Type_create_hvector (int count, int blocklength, MPI_Aint stride, MPI_Datatype old, MPI_Datatype *new) –IN count (number of blocks) –IN blocklength (number of elements/block) –IN stride (number of bytes between start of each block) –IN old (old datatype) –OUT new (new datatype) Same as MPI_Type_vector, except that stride is given in bytes rather than in elements (‘h’ stands for ‘heterogeneous).

Question What is the MPI_Type_create_hvector equivalent of MPI_Type_vector (2,3,4,old,&new), with Typemap={(double,0),(char,8)}? Answer MPI_Type_create_hvector(2,3,4*16,old,&new)

Question For the following oldtype: Sketch the newtype created by a call to: MPI_Type_create_hvector(3,2,7,old,&new) Answer:

Example 1 – sending checkered region Use MPI_type_vector and MPI_Type_create_hvector together to send the shaded segments of the following memory layout:

Example, cont. double a[6][5], e[3][3]; MPI_Datatype oneslice, twoslice MPI_Aint lb, sz_dbl int mype, ierr MPI_Comm_rank (MPI_COMM_WORLD, &mype); MPI_Type_get_extent (MPI_DOUBLE, &lb, &sz_dbl); MPI_Type_vector (3,1,2,MPI_DOUBLE, &oneslice); MPI_Type_create_hvector (3,1,10*sz_dbl, oneslice, &twoslice); MPI_Type_commit (&twoslice);

Example 2 – matrix transpose double a[100][100], b[100][100] int mype MPI_Status *status; MPI_Aint row, xpose, lb, sz_dbl MPI_Comm_rank (MPI_COMM_WORLD, &mype); MPI_Type_get_extent (MPI_DOUBLE, &lb, &sz_dbl); MPI_Type_vector (100, 1, 100, MPI_DOUBLE, &row); MPI_Type_create_hvector (100, 1, 100*sz_dbl, row, &xpose); MPI_Type_commit (&xpose); MPI_Sendrecv (&a[0][0], 1, xpose, mype, 0, &b[0][0], 100*100, MPI_DOUBLE, mype, 0, MPI_COMM_WORLD, &status);

Example 3 -- particles Given the following datatype: Struct Partstruct{ char class; /* particle class */ double d[6]; /* particle x,y,z,u,v,w */ char b[7]; /* some extra info */ }; We want to send just the locations (x,y,z) in a single message. Struct Partstruc particle[1000]; int dest, tag; MPI_Datatype locationType; MPI_Type_create_hvector (1000, 3, sizeof(struct Partstruct), MPI_DOUBLE, &locationType);

MPI_Type_indexed MPI_Type_indexed (int count, int *array_of_blocklengths, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype); –IN count (number of blocks) –IN array_of_blocklengths (number of elements/block) –IN array_of_displacements (displacement for each block, measured as number of elements) –IN oldtype –OUT newtype Displacements between successive blocks need not be equal. This allows gathering of arbitrary entries from an array and sending them in a single message.

Example Given the following oldtype: Sketch the newtype defined by a call to MPI_Type_indexed with: count = 3, blocklength = [2,3,1], displacement = [0,3,8] Answer:

Example: upper triangular transfer [0][0][0][1] Consecutive memory

Upper-triangular transfer double a[100][100]; Int disp[100], blocklen[100], i, dest, tag; MPI_Datatype upper; /* compute start and size of each row */ for (i = 0; i < 100; ++i){ disp[i] = 100*i + i; blocklen[i] = 100 – i; } MPI_Type_indexed(100, blocklen, disp, MPI_DOUBLE, &upper); MPI_Type_commit(&upper); MPI_Send(a, 1, upper, dest, tag, MPI_COMM_WORLD);

MPI_Type_create_struct MPI_Type_create_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype); –IN count (number of blocks) –IN array_of_blocklengths (number of elements in each block) –IN array_of_displacements (byte displacement of each block) –IN array_of_types (type of elements in each block) –OUT newtype Most general type constructor. Further generalizes MPI_Type_create_indexed in that it allows each block to consist of replications of different datatypes. The intent is to allow descriptions of arrays of structures as a single datatype.

Example Given the following oldtype: Sketch the newtype created by a call to MPI_Type_create_struct with the count = 3, blocklength = [2,3,4], displacement = [0,7,16] Answer:

Example Struct Partstruct{ char class; double d[6]; char b[7]; } Struct Partstruct particle[1000]; Int dest, tag; MP_Comm comm; MPI_Datatype particletype; MPI_Datatype type[3] = {MPI_CHAR, MPI_DOUBLE, MPI_CHAR}; int blocklen[3] = {1, 6, 7}; MPI_Aint disp[3] = {0, sizeof(double), 7*sizeof(double)}; MPI_Type_create_struct(3, blocklen, disp, type, &Particletype); MPI_Type_commit(&Particletype); MPI_Send(particle, 1000, Particletype, dest, tag, comm);

Alignment Note, this example assumes that a double is double-word aligned. If double’s are single- word aligned, then disp would be initialized as (0, sizeof(int), sizeof(int) + 6*sizeof(double)) MPI_Get_address allows us to write more generally correct code.

MPI_Type_commit Every datatype constructor returns an uncommited datatype. Think of commit process as a compilation of datatype description into efficient internal form. Must call MPI_Type_commit (&datatype). Once commited, a datatype can be repeatedly reused. If called more than once, subsequence call has no effect.

MPI_Type_free Call to MPI_Type_free (&datatype) sets the value of datatype to MPI_DATATYPE_NULL. Datatypes that were derived from the defined datatype are unaffected.

MPI_Get_elements MPI_Get_elements (MPI_Status *status, MPI_datatype type, int *count); –IN status (status of receive) –IN datatype –OUT count (number of primitive elements received)

MPI_Get_address MPI_Get_address (void *location, MPI_Aint *address); –IN location (locatioin in caller memory) –OUT address (address of location) Question: Why is this necessary for C?

Additional useful functions MPI_Create_subarray MPI_Create_darray Will study these next week

Some common applications with more sophisticated parallelization issues

Example: n-body problem

Two-body Gravitational Attraction m1m1 m2m2 F = Gm 1 m 2 r/r 3 F: Force between bodies G: universal constant m 1 : mass of first body m 2 : mass of second body r: position vector = (x,y) r: scalar distance a = m/F a:acceleration  v = a  t + v o v: velocity  x = v  t + x 0 x: position This is a completely integrable, non-chaotic system.

Three-body problem m1m1 m2m2 F 1 = Gm 1 m 2 r 1,2 /r 2 + Gm 1 m 3 r 1,3 /r 2 m3m3 F 2 = Gm 2 m 1 r 2,1 /r 2 + Gm 2 m 3 r 2,3 /r 2 F 3 = Gm 3 m 1 r 3,1 /r 2 + Gm 3 m 2 r 3,2 /r 2 F n =  k Gm n m k r n,k /r 2 General case for n-bodies Case for three-bodies

Schematic numerical solution to system Begin with n-particles with following properties initial positions: [x0 1, x0 2, …, x0 n ] initial velocities: [v0 1, v0 2, …, v0 n ] masses: [m 1, m 2, …, m n ] Step 1: calculate acceleration of each particle as: a n = F n / m n =  m Gm n m m r n,m /r 2 Step 2: calculate velocity of each particle over interval dt as: v n = a n dt + v0 n Step 3: calculate new position of each particle over interval dt as: x n = v0 n dt + x0 n

Solving ODE’s In practice, numerical techniques for solving ODE’s would be a little more sophisticated. For example, to get velocity we really have to solve: dv n /dt = a n Our discretization was the simplest possible, knows as Euler: [v n (t+dt) - v n (t)]/dt = a n v n (t+dt) = a n dt +v n (t) Runge-Kutta, leapfrog, etc. have better stability properties. Still very simple. Euler ok for first try.

Collapsing galaxy

Parallelization of n-body What are main issues for performance in general, even for serial code? –Algorithm scales as n 2 –Forces become large as small distances – dynamic timestep adjustment needed –Others? What are additional issues for parallel performance? –Load balancing –High communication overhead

Survey of solution techniques Particle-Particle (PP) Particle-Mesh (PM) Particle-Particle/Particle-Mesh (P3M) Particle Multiple-Mesh (PM2) Nested Grid Particle-Mesh (NGPM) Tree-Code (TC) Top Down Tree-Code (TC) Bottom Up Fast-Multipole-Method (FMM) Tree-Code Particle Mesh (TPM) Self-Consistent Field (SCF) Symplectic Method

Spatial grid refinement

Example – Spatially uneven grids You know apriori that there will be lots of activity here high accuracy necessary Here, grid spacing dx is a pre-determined function of x

Sample Application A good representative application for a spatially refined grid is an Ocean Basin Circulation Model A typical ocean basin (e.g. North Atlantic) has length scale scale O[1000km]. State-of-the art grids can solve problems on grids of size 10 3 *10 3 ( *10 in vertical). This implies a horizontal grid spacing O[1km] Near coast, horizontal velocities change from 0 to free- stream value over very small length-scales. This is crucial for energetics of general simulation. Require high-resolution.

Ocean circulation -- temperature

Sea-surface height

Spatially refined grid What are key parallelization issues? –More bookkeeping required in distributing points across proc grid –Smaller dx usually means smaller timestep – load imbalance? –How to handle fine-coarse boundaries? –What if one proc needs both fine and coarse mesh components for good load balancing?

Spatio-temporal grid refinement

In other applications, grid refinement is also necessary for accurate simulation of dynamical “hot zones”. However, the location of these zones may not be known apriori. Furthermore, they will typically change with time throughout the course of the simulation.

Example – stellar explosion In many astrophysical phenomena such as stellar explosions, fluid velocities are extremely high and shock fronts form. To accurately capture dynamics of explosion, very high resolution grid is required at shock front. This grid must be moved in time to follow the shock.

Stellar explosion

Spatio-temporal refinement What are additional main parallelization issues? –Dynamic load balancing

Neuron firing