Parallel I/O. 2 Common Ways of Doing I/O in Parallel Programs Sequential I/O: –All processes send data to rank 0, and 0 writes it to the file.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

MPI Message Passing Interface
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
File Consistency in a Parallel Environment Kenin Coloma
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
The Linux Kernel: Memory Management
Reference: / MPI Program Structure.
MPI Program Structure Self Test with solution. Self Test 1.How would you modify "Hello World" so that only even-numbered processors print the greeting.
Introduction to MPI-IO. 2 Common Ways of Doing I/O in Parallel Programs Sequential I/O: –All processes send data to rank 0, and 0 writes it to the file.
Parallel I/O. Objectives The material covered to this point discussed how multiple processes can share data stored in separate memory spaces (See Section.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Its.unc.edu 1 Derived Datatypes Research Computing UNC - Chapel Hill Instructor: Mark Reed
PPC MPI Parallel File I/O1 CSCI-4320/6360: Parallel Programming & Computing Tues./Fri. 12-1:20 p.m. MPI File I/O Prof. Chris Carothers Computer.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy H. Kaiser, Ph.D. San Diego.
Introduction to MPI-IO Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Edgar Gabriel MPI derived datatypes Edgar Gabriel.
MPI3 Hybrid Proposal Description
1 MPI Datatypes l The data in a message to sent or received is described by a triple (address, count, datatype), where l An MPI datatype is recursively.
1 Outline l Performance Issues in I/O interface design l MPI Solutions to I/O performance issues l The ROMIO MPI-IO implementation.
Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak Ridge National Laboratory June 4, 2008.
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
1 Why Derived Data Types  Message data contains different data types  Can use several separate messages  performance may not be good  Message data.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
MPI-2 Sathish Vadhiyar Using MPI2: Advanced Features of the Message-Passing.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Message Passing Programming with MPI Introduction to MPI Basic MPI functions Most of the MPI materials are obtained from William Gropp and Rusty Lusk’s.
Hybrid MPI and OpenMP Parallel Programming
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
I/O on Clusters Rajeev Thakur Argonne National Laboratory.
CS 591 x I/O in MPI. MPI exists as many different implementations MPI implementations are based on MPI standards MPI standards are developed and maintained.
Parallel Programming with MPI By, Santosh K Jena..
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
Introduction to MPI CDP 1. Shared Memory vs. Message Passing Shared Memory Implicit communication via memory operations (load/store/lock) Global address.
An Introduction to MPI (message passing interface)
Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory
Project18 Communication Design + Parallelization Camilo A Silva BIOinformatics Summer 2008.
MPI-2 Sathish Vadhiyar Using MPI2: Advanced Features of the Message-Passing.
Chapter 5. Nonblocking Communication MPI_Send, MPI_Recv are blocking operations Will not return until the arguments to the functions can be safely modified.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
MPI Groups, Communicators and Topologies. Groups and communicators In our case studies, we saw examples where collective communication needed to be performed.
Message Passing Interface Using resources from
Lecture 3: Today’s topics MPI Broadcast (Quinn Chapter 5) –Sieve of Eratosthenes MPI Send and Receive calls (Quinn Chapter 6) –Floyd’s algorithm Other.
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Computational Methods in Astrophysics Dr Rob Thacker (AT319E)
MPI IO Parallel Distributed Systems File Management I/O Peter Collins, Khadouj Fikry.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
Introduction to MPI.
MPI Message Passing Interface
Prof. Thomas Sterling Department of Computer Science
Lecture 14: Inter-process Communication
Message Passing Programming Based on MPI
Prof. Chris Carothers Computer Science Department MRC 309a
Parallel Processing - MPI
Presentation transcript:

Parallel I/O

2 Common Ways of Doing I/O in Parallel Programs Sequential I/O: –All processes send data to rank 0, and 0 writes it to the file

3 Pros and Cons of Sequential I/O Pros: –parallel machine may support I/O from only one process (e.g., no common file system) –Some I/O libraries (e.g. HDF-4, NetCDF) not parallel –resulting single file is handy for ftp, mv –big blocks improve performance –short distance from original, serial code Cons: –lack of parallelism limits scalability, performance (single node bottleneck)

4 Another Way Each process writes to a separate file Pros: –parallelism, high performance Cons: –lots of small files to manage –difficult to read back data from different number of processes

5 What is Parallel I/O? Multiple processes of a parallel program accessing data (reading or writing) from a common file FILE P0P1P2 P(n-1)

6 Why Parallel I/O? Non-parallel I/O is simple but –Poor performance (single process writes to one file) or –Awkward and not interoperable with other tools (each process writes a separate file) Parallel I/O –Provides high performance –Can provide a single file that can be used with other tools (such as visualization programs)

7 Why is MPI a Good Setting for Parallel I/O? Writing is like sending a message and reading is like receiving. Any parallel I/O system will need a mechanism to –define collective operations (MPI communicators) –define noncontiguous data layout in memory and file (MPI datatypes) –Test completion of nonblocking operations (MPI request objects) i.e., lots of MPI-like machinery

8 MPI-IO Background Marc Snir et al (IBM Watson) paper exploring MPI as context for parallel I/O (1994) MPI-IO discussion group led by J.-P. Prost (IBM) and Bill Nitzberg (NASA), 1994 MPI-IO group joins MPI Forum in June 1996 MPI-2 standard released in July 1997 MPI-IO is Chapter 9 of MPI-2

9 Using MPI for Simple I/O FILE P0P1P2 P(n-1) Each process needs to read a chunk of data from a common file

10 Using Individual File Pointers #include #include "mpi.h" #define FILESIZE 1000 int main(int argc, char **argv){ int rank, nprocs; MPI_File fh; MPI_Status status; int bufsize, nints; int buf[FILESIZE]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); bufsize = FILESIZE/nprocs; nints = bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, "datafile", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); MPI_File_seek(fh, rank * bufsize, MPI_SEEK_SET); MPI_File_read(fh, buf, nints, MPI_INT, &status); MPI_File_close(&fh); }

11 Using Explicit Offsets #include #include "mpi.h" #define FILESIZE 1000 int main(int argc, char **argv){ int rank, nprocs; MPI_File fh; MPI_Status status; int bufsize, nints; int buf[FILESIZE]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); bufsize = FILESIZE/nprocs; nints = bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, "datafile", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh); MPI_File_read_at(fh, rank*bufsize, buf, nints, MPI_INT, &status); MPI_File_close(&fh); }

12 Function Details MPI_File_open(MPI_Comm comm, char *file, int mode, MPI_Info info, MPI_File *fh) (note: mode = MPI_MODE_RDONLY, MPI_MODE_RDWR, MPI_MODE_WRONLY, MPI_MODE_CREATE, MPI_MODE_EXCL, MPI_MODE_DELETE_ON_CLOSE, MPI_MODE_UNIQUE_OPEN, MPI_MODE_SEQUENTIAL, MPI_MODE_APPEND) MPI_File_close(MPI_File *fh) MPI_File_read(MPI_File fh, void *buf, int count, MPI_Datatype type, MPI_Status *status) MPI_File_read_at(MPI_File fh, int offset, void *buf, int count, MPI_Datatype type, MPI_Status *status) MPI_File_seek(MPI_File fh, MPI_Offset offset, in whence); (note: whence = MPI_SEEK_SET, MPI_SEEK_CUR, or MPI_SEEK_END) MPI_File_write(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status) Note: Many other functions to get/set properties (see Gropp et al)

13 Writing to a File Use MPI_File_write or MPI_File_write_at Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags to MPI_File_open If the file doesn’t exist previously, the flag MPI_MODE_CREATE must also be passed to MPI_File_open We can pass multiple flags by using bitwise-or ‘|’ in C, or addition ‘+” in Fortran

14 MPI Datatype Interlude Datatypes in MPI –Elementary: MPI_INT, MPI_DOUBLE, etc everything we’ve used to this point Contiguous –Next easiest: sequences of elementary types Vector –Sequences separated by a constant “stride”

15 MPI Datatypes, cont Indexed: more general –does not assume a constant stride Struct –General mixed types (like C structs)

16 Creating simple datatypes Let’s just look at the simplest types: contiguous and vector datatypes. Contiguous example –Let’s create a new datatype which is two ints side by side. The calling sequence is MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype); MPI_Datatype newtype; MPI_Type_contiguous(2, MPI_INT, &newtype); MPI_Type_commit(newtype); /* required */

17 Creating vector datatype MPI_Type_vector( int count, int blocklen, int stride, MPI_Datattype oldtype, MPI_Datatype &newtype);

18 Using File Views Processes write to shared file MPI_File_set_view assigns regions of the file to separate processes

19 File Views Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view displacement = number of bytes to be skipped from the start of the file etype = basic unit of data access (can be any basic or derived datatype) filetype = specifies which portion of the file is visible to the process

20 File View Example MPI_File thefile; for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, "testfile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank * BUFSIZE, MPI_INT, MPI_INT, "native", MPI_INFO_NULL); MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile);

21 MPI_Set_view MPI_Set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info) MPI_Set_view defines which portion of a file is accessible to a given process Extremely important fucntion -- make sure that you understand arguments When using previous functions (MPI_Write, etc.) we did not set a file view but rather used the default one (explained below) disp: a byte offset to define the starting point etype: this is the base (think native) type of data in the file (ints, doubles, etc. are common). It is better not to simply use bytes as an etype for heterogeneous environments, e.g. when mapping ints between big and little endian archs filetype: like etype this is any MPI_type. It is used to define regions of non-contiguous access in a file.

22 MPI_Set_view, cont. For individual file pointers each process can specify its own view View can be changed any number of times during a program All file access done in unites of etype! filetype must be equal to or be derived from etype

23 A Simple File View Example etype = MPI_INT filetype = two MPI_INTs followed by a gap of four MPI_INTs displacement filetype and so on... FILE head of file

24 Example non-contiguous access MPI_Aint lb, extent; MPI_Datatype etype, filetype, contig; MPI_Offset disp; MPI_File fh; int buf[1000]; MPI_File_open(MPI_COMM_WORLD, “filename”, MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_Type_contiguous(2, MPI_INT, &contig); lb = 0; MPI_Type_create_resized(contig, lb, extent, &filetype); MPI_Type_commit(&filetype); disp = 5*sizeof(int); etype = MPI_INT; MPI_File_set_view(fh, disp, etype, filetype, “native”, MPI_INFO_NULL); MPI_File_write(fh, buf, 1000, MPI_INT, MPI_STATUS_IGNORE);

25 Ways to Write to a Shared File MPI_File_seek MPI_File_read_at MPI_File_write_at MPI_File_read_shared MPI_File_write_shared Collective operations combine seek and I/O for thread safety use shared file pointer good when order doesn’t matter like Unix seek

26 Collective I/O in MPI A critical optimization in parallel I/O Allows communication of “big picture” to file system Framework for 2-phase I/O, in which communication precedes I/O (can use MPI machinery) Basic idea: build large blocks, so that reads/writes in I/O system will be large Small individual requests Large collective access

27 Noncontiguous Accesses Common in parallel applications Example: distributed arrays stored in files A big advantage of MPI I/O over Unix I/O is the ability to specify noncontiguous accesses in memory and file within a single function call by using derived datatypes Allows implementation to optimize the access Collective IO combined with noncontiguous accesses yields the highest performance.

28 Collective I/O MPI_File_read_all, MPI_File_read_at_all, etc _all indicates that all processes in the group specified by the communicator passed to MPI_File_open will call this function Each process specifies only its own access information -- the argument list is the same as for the non-collective functions

29 Collective I/O By calling the collective I/O functions, the user allows an implementation to optimize the request based on the combined request of all processes The implementation can merge the requests of different processes and service the merged request efficiently Particularly effective when the accesses of different processes are noncontiguous and interleaved

30 Collective non-contiguous MPI-IO examples #define “mpi.h” #define FILESIZE #define INTS_PER_BLK 16 int main(int argc, char **argv){ int *buf, rank, nprocs, nints, bufsize; MPI_File fh; MPI_Datatype filetype; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); bufsize = FILESIZE/nprocs; buf = (int *) malloc(bufsize); nints = bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, “filename”, MPI_MODE_RD_ONLY, MPI_INFO_NULL, &fh); MPI_Type_vector(nints/INTS_PER_BLK, INTS_PER_BLK, INTS_PER_BLK*nprocs, MPI_INT, &filetype); MPI_Type_commit(&filetype); MPI_File_set_view(fh, INTS_PER_BLK*sizeof(int)*rank, MPI_INT, filetype, “native”, MPI_INFO_NULL); MPI_File_read_all(fh, buf, nints, MPI_INT, MPI_STATUS_IGNORE); MPI_Type_free(&filetype); free(buf) MPI_Finalize(); return(0); }

31 More on MPI_Read_all Note that the _all version has the same argument list Difference is that all processes involved in MPI_Open must call this the read Contrast with the non-all version where any subset may or may not call it Allows for many optimizations

32 Array-specific datatypes We can take this a level higher for the (common) case of distributed arrays There are two very convenient datatype constructors -- darray and subarray -- that can be used to read/write distributed arrays with less work. These are just to create MPI_Datatype. All other machinery just described is identical.

33 Accessing Arrays Stored in Files P0 P5 P4 P2P1 P3 coords = (0,0) coords = (1,0) coords = (0,1) coords = (1,1)coords = (1,2) coords = (0,2) m n columns nproc(1) = 2, nproc(2) = 3 rows

34 Using the “Distributed Array” (Darray) Datatype int gsizes[2], distribs[2], dargs[2], psizes[2]; gsizes[0] = m; /* no. of rows in global array */ gsizes[1] = n; /* no. of columns in global array*/ distribs[0] = MPI_DISTRIBUTE_BLOCK; distribs[1] = MPI_DISTRIBUTE_BLOCK; dargs[0] = MPI_DISTRIBUTE_DFLT_DARG; dargs[1] = MPI_DISTRIBUTE_DFLT_DARG; psizes[0] = 2; /* no. of processes in vertical dimension of process grid */ psizes[1] = 3; /* no. of processes in horizontal dimension of process grid */

35 MPI_Type_create_darray MPI_Type_create_darray(int size, int rank, int ndims, int array_of_gsizes[], int array_of_distribs[], int array_of_dargs[], int array_of_psizes[], int order, MPI_Datatype old, MPI_Datatype &new); size: number of procs over which array is distributed rank: Rank of the process whose array is being described ndims: number of dimensions in global (and local) array array_of_gsizes: size of global array in each dimension array_of_distribs: way in which array is distributed in each dimension -- can be block, cyclic, or block-cyclic. Here it is block so we use MPI_DISTRIBUTE-BLOCK array_of_dargs: k in cyclic(k). Not necessary of block or cyclic distributions, so we use MPI_DISTRIBUTE-DFLT-DARG array_of_psizes: number of procs in each dimension Order: row-major (MPI_ORDER_C) or column-major (MPI_ORDER_FORTRAN)

36 Darray Continued MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Type_create_darray(6, rank, 2, gsizes, distribs, dargs, psizes, MPI_ORDER_C, MPI_FLOAT, &filetype); MPI_Type_commit(&filetype); MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native", MPI_INFO_NULL); local_array_size = num_local_rows * num_local_cols; MPI_File_write_all(fh, local_array, local_array_size, MPI_FLOAT, &status); MPI_File_close(&fh);

37 A Word of Warning about Darray The darray datatype assumes a very specific definition of data distribution -- the exact definition as in HPF For example, if the array size is not divisible by the number of processes, darray calculates the block size using a ceiling division (20 / 6 = 4 ) darray assumes a row-major ordering of processes in the logical grid, as assumed by cartesian process topologies in MPI-1 If your application uses a different definition for data distribution or logical grid ordering, you cannot use darray. Use subarray instead.

38 MPI_Type_create_subarray MPI_Type_create_subarray(int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype old, MPI_Datatype *new); ndims: number of dimensions in array array_of_sizes: size of global array in each dimension array_of_subsizes: size of subarray in each dimension int array_of_starts: starting coordintes in each dimension (zero-based) order: same as for darray (C or Fortran ordering)

39 Using the Subarray Datatype gsizes[0] = m; /* no. of rows in global array */ gsizes[1] = n; /* no. of columns in global array*/ psizes[0] = 2; /* no. of procs. in vertical dimension */ psizes[1] = 3; /* no. of procs. in horizontal dimension */ lsizes[0] = m/psizes[0]; /* no. of rows in local array */ lsizes[1] = n/psizes[1]; /* no. of columns in local array */ dims[0] = 2; dims[1] = 3; periods[0] = periods[1] = 1; MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm); MPI_Comm_rank(comm, &rank); MPI_Cart_coords(comm, rank, 2, coords);

40 Subarray Datatype contd. /* global indices of first element of local array */ start_indices[0] = coords[0] * lsizes[0]; start_indices[1] = coords[1] * lsizes[1]; MPI_Type_create_subarray(2, gsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &filetype); MPI_Type_commit(&filetype); MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native", MPI_INFO_NULL); local_array_size = lsizes[0] * lsizes[1]; MPI_File_write_all(fh, local_array, local_array_size, MPI_FLOAT, &status);

41 Local Array with Ghost Area in Memory (0,0) (107,0)(107,107) (0,107) (103,4) (4,103)(4,4) (103,103) local data ghost area for storing off-process elements Use a subarray datatype to describe the noncontiguous layout in memory Pass this datatype as argument to MPI_File_write_all

42 Local Array with Ghost Area memsizes[0] = lsizes[0] + 8; /* no. of rows in allocated array */ memsizes[1] = lsizes[1] + 8; /* no. of columns in allocated array */ start_indices[0] = start_indices[1] = 4; /* indices of the first element of the local array in the allocated array */ MPI_Type_create_subarray(2, memsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &memtype); MPI_Type_commit(&memtype); /* create filetype and set file view exactly as in the subarray example */ MPI_File_write_all(fh, local_array, 1, memtype, &status);

43 Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process 0’s map arrayProcess 1’s map arrayProcess 2’s map array The map array describes the location of each element of the data array in the common file

44 Accessing Irregularly Distributed Arrays integer (kind=MPI_OFFSET_KIND) disp call MPI_FILE_OPEN(MPI_COMM_WORLD, '/pfs/datafile', & MPI_MODE_CREATE + MPI_MODE_RDWR, & MPI_INFO_NULL, fh, ierr) call MPI_TYPE_CREATE_INDEXED_BLOCK(bufsize, 1, map, & MPI_DOUBLE_PRECISION, filetype, ierr) call MPI_TYPE_COMMIT(filetype, ierr) disp = 0 call MPI_FILE_SET_VIEW(fh, disp, MPI_DOUBLE_PRECISION, & filetype, 'native', MPI_INFO_NULL, ierr) call MPI_FILE_WRITE_ALL(fh, buf, bufsize, & MPI_DOUBLE_PRECISION, status, ierr) call MPI_FILE_CLOSE(fh, ierr)

45 Nonblocking I/O MPI_Request request; MPI_Status status; MPI_File_iwrite_at(fh, offset, buf, count, datatype, &request); for (i=0; i<1000; i++) { /* perform computation */ } MPI_Wait(&request, &status);

46 Split Collective I/O MPI_File_write_all_begin(fh, buf, count, datatype); for (i=0; i<1000; i++) { /* perform computation */ } MPI_File_write_all_end(fh, buf, &status); A restricted form of nonblocking collective I/O Only one active nonblocking collective operation allowed at a time on a file handle Therefore, no request object necessary

47 Shared File Pointers #include "mpi.h" // C++ example int main(int argc, char *argv[]) { int buf[1000]; MPI::File fh; MPI::Init(); MPI::File fh = MPI::File::Open(MPI::COMM_WORLD, "/pfs/datafile", MPI::MODE_RDONLY, MPI::INFO_NULL); fh.Write_shared(buf, 1000, MPI_INT); fh.Close(); MPI::Finalize(); return 0; }

48 Passing Hints to the Implementation MPI_Info info; MPI_Info_create(&info); /* no. of I/O devices to be used for file striping */ MPI_Info_set(info, "striping_factor", "4"); /* the striping unit in bytes */ MPI_Info_set(info, "striping_unit", "65536"); MPI_File_open(MPI_COMM_WORLD, "/pfs/datafile", MPI_MODE_CREATE | MPI_MODE_RDWR, info, &fh); MPI_Info_free(&info);

49 Examples of Hints (used in ROMIO) striping_unit striping_factor cb_buffer_size cb_nodes ind_rd_buffer_size ind_wr_buffer_size start_iodevice pfs_svr_buf direct_read direct_write MPI-2 predefined hints New Algorithm Parameters Platform-specific hints

50 I/O Consistency Semantics The consistency semantics specify the results when multiple processes access a common file and one or more processes write to the file MPI guarantees stronger consistency semantics if the communicator used to open the file accurately specifies all the processes that are accessing the file, and weaker semantics if not The user can take steps to ensure consistency when MPI does not automatically do so

51 Example 1 File opened with MPI_COMM_WORLD. Each process writes to a separate region of the file and reads back only what it wrote. MPI_File_open(MPI_COMM_WORLD,…) MPI_File_write_at(off=0,cnt=100) MPI_File_read_at(off=0,cnt=100) MPI_File_open(MPI_COMM_WORLD,…) MPI_File_write_at(off=100,cnt=100) MPI_File_read_at(off=100,cnt=100) Process 0Process 1 MPI guarantees that the data will be read correctly

52 Example 2 Same as example 1, except that each process wants to read what the other process wrote (overlapping accesses) In this case, MPI does not guarantee that the data will automatically be read correctly In the above program, the read on each process is not guaranteed to get the data written by the other process! /* incorrect program */ MPI_File_open(MPI_COMM_WORLD,…) MPI_File_write_at(off=0,cnt=100) MPI_Barrier MPI_File_read_at(off=100,cnt=100) /* incorrect program */ MPI_File_open(MPI_COMM_WORLD,…) MPI_File_write_at(off=100,cnt=100) MPI_Barrier MPI_File_read_at(off=0,cnt=100) Process 0Process 1

53 Example 2 contd. The user must take extra steps to ensure correctness There are three choices: –set atomicity to true –close the file and reopen it –ensure that no write sequence on any process is concurrent with any sequence (read or write) on another process

54 Example 2, Option 1 Set atomicity to true MPI_File_open(MPI_COMM_WORLD,…) MPI_File_set_atomicity(fh1,1) MPI_File_write_at(off=0,cnt=100) MPI_Barrier MPI_File_read_at(off=100,cnt=100) MPI_File_open(MPI_COMM_WORLD,…) MPI_File_set_atomicity(fh2,1) MPI_File_write_at(off=100,cnt=100) MPI_Barrier MPI_File_read_at(off=0,cnt=100) Process 0Process 1

55 Example 2, Option 2 Close and reopen file MPI_File_open(MPI_COMM_WORLD,…) MPI_File_write_at(off=0,cnt=100) MPI_File_close MPI_Barrier MPI_File_open(MPI_COMM_WORLD,…) MPI_File_read_at(off=100,cnt=100) MPI_File_open(MPI_COMM_WORLD,…) MPI_File_write_at(off=100,cnt=100) MPI_File_close MPI_Barrier MPI_File_open(MPI_COMM_WORLD,…) MPI_File_read_at(off=0,cnt=100) Process 0Process 1

56 Example 2, Option 3 Ensure that no write sequence on any process is concurrent with any sequence (read or write) on another process a sequence is a set of operations between any pair of open, close, or file_sync functions a write sequence is a sequence in which any of the functions is a write operation

57 Example 2, Option 3 MPI_File_open(MPI_COMM_WORLD,…) MPI_File_write_at(off=0,cnt=100) MPI_File_sync MPI_Barrier MPI_File_sync /*collective*/ MPI_Barrier MPI_File_sync MPI_File_read_at(off=100,cnt=100) MPI_File_close MPI_File_open(MPI_COMM_WORLD,…) MPI_File_sync /*collective*/ MPI_Barrier MPI_File_sync MPI_File_write_at(off=100,cnt=100) MPI_File_sync MPI_Barrier MPI_File_sync /*collective*/ MPI_File_read_at(off=0,cnt=100) MPI_File_close Process 0Process 1

58 Example 3 Same as Example 2, except that each process uses MPI_COMM_SELF when opening the common file The only way to achieve consistency in this case is to ensure that no write sequence on any process is concurrent with any write sequence on any other process.

59 Example 3 MPI_File_open(MPI_COMM_SELF,…) MPI_File_write_at(off=0,cnt=100) MPI_File_sync MPI_Barrier MPI_File_sync MPI_File_read_at(off=100,cnt=100) MPI_File_close MPI_File_open(MPI_COMM_SELF,…) MPI_Barrier MPI_File_sync MPI_File_write_at(off=100,cnt=100) MPI_File_sync MPI_Barrier MPI_File_read_at(off=0,cnt=100) MPI_File_close Process 0Process 1

60 File Interoperability Users can optionally create files with a portable binary data representation “datarep” parameter to MPI_File_set_view native - default, same as in memory, not portable internal - impl. defined representation providing an impl. defined level of portability external32 - a specific representation defined in MPI, (basically 32-bit big-endian IEEE format), portable across machines and MPI implementations

61 General Guidelines for Achieving High I/O Performance Buy sufficient I/O hardware for the machine Use fast file systems, not NFS-mounted home directories Do not perform I/O from one process only Make large requests wherever possible For noncontiguous requests, use derived datatypes and a single collective I/O call

62 Achieving High I/O Performance Any application as a particular “I/O access pattern” based on its I/O needs The same access pattern can be presented to the I/O system in different ways depending on what I/O functions are used and how We demonstrate how the user’s choice of level affects performance

63 Example: Distributed Array Access P0 P12 P4 P8 P2 P14 P6 P10 P1 P13 P5 P9 P3 P15 P7 P11 P0P1P2P3P0P1P2 P4P5P6P7P4P5P6 P8P9P8P9 Large array distributed among 16 processes Access Pattern in the file Each square represents a subarray in the memory of a single process P10P11P10 P15P13P12 P13P14

64 Level-0 Access (C) Each process makes one independent read request for each row in the local array (as in Unix) MPI_File_open(..., file,..., &fh) for (i=0; i<n_local_rows; i++) { MPI_File_seek(fh,...); MPI_File_read(fh, &(A[i][0]),...); } MPI_File_close(&fh);

65 Level-1 Access (C) Similar to level 0, but each process uses collective I/O functions MPI_File_open(MPI_COMM_WORLD, file,..., &fh); for (i=0; i<n_local_rows; i++) { MPI_File_seek(fh,...); MPI_File_read_all(fh, &(A[i][0]),...); } MPI_File_close(&fh);

66 Level-2 Access (C) Each process creates a derived datatype to describe the noncontiguous access pattern, defines a file view, and calls independent I/O functions MPI_Type_create_subarray(..., &subarray,...); MPI_Type_commit(&subarray); MPI_File_open(..., file,..., &fh); MPI_File_set_view(fh,..., subarray,...); MPI_File_read(fh, A,...); MPI_File_close(&fh);

67 Level-3 Access (C) Similar to level 2, except that each process uses collective I/O functions MPI_Type_create_subarray(..., &subarray,...); MPI_Type_commit(&subarray); MPI_File_open(MPI_COMM_WORLD, file,..., &fh); MPI_File_set_view(fh,..., subarray,...); MPI_File_read_all(fh, A,...); MPI_File_close(&fh);

68 Optimizations Given complete access information, an implementation can perform optimizations such as: –Data Sieving: Read large chunks and extract what is really needed –Collective I/O: Merge requests of different processes into larger requests –Improved prefetching and caching

69 Distributed Array Access: Read Bandwidth 64 procs 8 procs32 procs256 procs Array size: 512 x 512 x 512

70 Distributed Array Access: Write Bandwidth 64 procs 8 procs32 procs256 procs Array size: 512 x 512 x 512

71 Unstructured Code: Read Bandwidth 64 procs 8 procs32 procs256 procs

72 Unstructured Code: Write Bandwidth 64 procs 8 procs32 procs256 procs

73 Independent Writes On Paragon Lots of seeks and small writes Time shown = 130 seconds

74 Collective Write On Paragon Computation and communication precede seek and write Time shown = 2.75 seconds

75 Independent Writes with Data Sieving On Paragon Access data in large “blocks” and extract needed data Requires lock, read, modify, write, unlock for writes 4 MB blocks Time = 16 sec.

76 Changing the Block Size Smaller blocks mean less contention, therefore more parallelism 512 KB blocks Time = 10.2 seconds

77 Data Sieving with Small Blocks If the block size is too small, however, the increased parallelism doesn’t make up for the many small writes 64 KB blocks Time = 21.5 seconds

78 Common Errors Not defining file offsets as MPI_Offset in C and integer (kind=MPI_OFFSET_KIND) in Fortran (or perhaps integer*8 in Fortran 77) In Fortran, passing the offset or displacement directly as a constant (e.g., 0) in the absence of function prototypes (F90 mpi module) Using darray datatype for a block distribution other than the one defined in darray (e.g., floor division) filetype defined using offsets that are not monotonically nondecreasing, e.g., 0, 3, 8, 4, 6. (happens in irregular applications)

79 Summary MPI-IO has many features that can help users achieve high performance The most important of these features are the ability to specify noncontiguous accesses, the collective I/O functions, and the ability to pass hints to the implementation Users must use the above features! In particular, when accesses are noncontiguous, users must create derived datatypes, define file views, and use the collective I/O functions

80 Related technologies Hierarchical Data Format Network Common Data Format