High Performance Computing Course Notes Parallel I/O
2 Computer Science, University of Warwick Aims To learn how to achieve higher I/O performance To use a concrete implementation (MPI-IO): Some concepts, including: etypes, displacement and views Collective vs. non-collective I/O Contiguous vs. non-contiguous I/O High Performance Parallel I/O
3 Computer Science, University of Warwick Why are we looking at parallel I/O? I/O is a major bottleneck in many parallel applications I/O subsystems for parallel machines may be designed for high performance, however many applications achieve < 10 th of the peak I/O bandwidth Parallel-I/O systems designed for large data transfer (MB of data) However, many parallel applications make many smaller I/O requests (<kB)
4 Computer Science, University of Warwick Parallel I/O – version 1.0 Phase 1: All processes send data to proc d0d0 d1d1 d2d2 d3d3 Early solutions: All processes send data to process 0, which then writes to file
5 Computer Science, University of Warwick Parallel I/O – version 1.0 Phase 1: All processes send data to proc. 0 Phase 2: Proc. 0 writes to file d0d0 d1d1 d2d2 d3d3 d0d0 d1d1 d2d2 d3d3 File Early solutions: All processes send data to process 0, which then writes to file
6 Computer Science, University of Warwick Bad things about version Single node bottleneck 2. Poor performance 3. Poor scalability 4. Single point of failure Good things about version The parallel machine needs only support I/O from one process 2. Do not need specialized I/O library 3. If you are converting from sequential code then this parallel version (of program) is close to original 4. Results in a single file which is easy to manage Parallel I/O – version 1.0
7 Computer Science, University of Warwick Parallel I/O – version 2.0 All processes can now write in one phase d0d0 d1d1 d2d2 d3d3 Each process writes to a separate file d0d0 d1d1 d2d2 d3d3 File 1 File 2File 3File 4
8 Computer Science, University of Warwick Parallel I/O – version 2.0 All processes can now write in one phase d0d0 d1d1 d2d2 d3d3 Each process writes to a separate file d0d0 d1d1 d2d2 d3d3 File 1 File 2File 3File 4 Good things about version Now we are doing things in parallel 2.High performance
9 Computer Science, University of Warwick Parallel I/O – version 2.0 All processes can now write in one phase d0d0 d1d1 d2d2 d3d3 Each process writes to a separate file d0d0 d1d1 d2d2 d3d3 File 1 File 2File 3File 4 Bad things about version We now have lots of small files to manage 2.How do we read the data back when #procs changes? 3.Does not interoperate well with other applications
10 Computer Science, University of Warwick Parallel I/O – version 3.0 All processes can now write in one Phase, to one common file d0d0 d1d1 d2d2 d3d3 Multiple processes of parallel program access (read/write) data from a common file d0d0 d1d1 d2d2 d3d3 File
11 Computer Science, University of Warwick Parallel I/O – version 3.0 Good things about version 3.0 Simultaneous I/O from any number of processes Maps well onto collective operations Excellent performance and scalability Results in a single file which is easy to manage and interoperates well with other applications Bad things about version 3.0 Requires more complex I/O library support
12 Computer Science, University of Warwick What is Parallel I/O? Multiple processes of a parallel program accessing data (reading or writing) from a common file FILE P0P1P2 P(n-1)
13 Computer Science, University of Warwick Non-parallel I/0 Simple Poor performance – if a single process is writing to one file Hard to interoperate with other applications – if writing to more than one file Parallel I/O Provides high performance Provides a single file with which it is easy to interoperate with other tools (e.g. visualization systems) If you design it right then can use existing features of parallel libraries such as collectives and derived datatypes Why Parallel I/O?
14 Computer Science, University of Warwick We are going to be looking at parallel I/O in the context of MPI, why? Because writing is like sending a message, reading is like receiving Because collective-like operations are important in parallel I/O Because non-contiguous data layout is important (if we are using a single file), supported by MPI datatypes Parallel I/O is now integral part of MPI-2 Why Parallel I/O?
15 Computer Science, University of Warwick Parallel I/O example Consider an example of a 2D array distributed among 16 processors Array stored in row-major order P0P0 P1P1 P4P4 P5P5 P8P8 P2P2 P3P3 P6P6 P7P7 P 12 P9P9 P 10 P 11 P 13 P 14 P 15 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P4P4 P5P5 P6P6 P7P7 etc… Array Corresponding file
16 Computer Science, University of Warwick Access pattern 1: MPI_File_seek Updates the individual file pointer int MPI_File_seek( MPI_File mpi_fh, MPI_Offset offset, int whence ); Parameters mpi_fh : [in] file handle (handle) offset : [in] file offset (integer) whence : [in] update mode (state) MPI_FILE_SEEK updates the individual file pointer according to whence, which has the following possible values: MPI_SEEK_SET: the pointer is set to offset MPI_SEEK_CUR: the pointer is set to the current pointer position plus offset MPI_SEEK_END: the pointer is set to the end of file plus offset
17 Computer Science, University of Warwick Access pattern 1: MPI_File_read Read using individual file pointer int MPI_File_read( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status ); Parameters mpi_fh: [in] file handle (handle) buf: [out] initial address of buffer count: [in] number of elements in buffer (nonnegative integer) datatype: [in] datatype of each buffer element (handle) status: [out] status object (Status)
18 Computer Science, University of Warwick Access pattern 1 We could do a UNIX-style access pattern in MPI-IO One independent read request is done for each row in the local array Many independent, contiguous requests MPI_File_open(…, “filename”, …, &fh) for(i=0; i < n_local_rows; i++) { MPI_File_seek (fh, offset, …) MPI_File_read (fh, row[i], …) } MPI_File_close (&fh) Individual file pointers per process per file handle Each process sets the file pointer with some suitable offset The data is then read into the local array This is not a collective operation (non-blocking)
19 Computer Science, University of Warwick Access pattern 2: Access pattern 2: MPI_File_read_all Collective read using individual file pointer int MPI_File_read_all( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status ); Parameters fh : [in] file handle (handle) buf : [out] initial address of buffer (choice) count : [in] number of elements in buffer (nonnegative integer) datatype : [in] datatype of each buffer element (handle) status : [out] status object (Status) MPI_FILE_READ_ALL is a collective version of the blocking MPI_FILE_READ interface.
20 Computer Science, University of Warwick Access pattern 2 Similar to access pattern 1 but using collectives All processes that opened file will read data together (with own access information) Many collective, contiguous requests MPI_File_open(…, “filename”, …, &fh) for(i=0; i < n_local_rows; i++) { MPI_File_seek (fh, offset, …) MPI_File_read_all (fh, row[i], …) } MPI_File_close (&fh) read_all is a collective version of the read operation This is blocking Each process accesses the file at the same time This may be useful as independent I/O operations do not convey what other procs are doing at the same time
21 Computer Science, University of Warwick File Ordered collection of typed data items MPI supports random or sequential access Opened collectively by a group of processes All collective I/O calls on file are done over this group Displacement Absolute byte position relative to the beginning of a file Defines the location where a view begins etype (elementary datatype) Unit of data access and positioning Can be a predefined or derived datatype Offsets are expressed as multiples of etypes Access pattern 3: Definitions
22 Computer Science, University of Warwick Filetype Basis for partitioning the file among processes and defines a template for accessing the file (based on etype) View Current set of data visible and accessible from an open file (as an ordered set of etypes) Each process has its own view based on - a displacement, etype and filetype Pattern defined by filetype is repeated (in units of etypes) beginning at the displacement Access pattern 3: Definitions
23 Computer Science, University of Warwick Access pattern 3: File Views Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view displacement = number of bytes to be skipped from the start of the file etype = basic unit of data access (can be any basic or derived datatype) filetype = specifies which portion of the file is visible to the process
24 Computer Science, University of Warwick Access pattern 3: A Simple Noncontiguous File View Example etype = MPI_INT filetype = two MPI_INTs followed by a gap of four MPI_INTs displacement filetype and so on... FILE head of file
25 Computer Science, University of Warwick Access pattern 3: How do views relate to multiple processes? proc. 0 filetype file… displacement proc. 1 filetype proc. 2 filetype Group of processes using complementary views to achieve global data distribution Partitioning a file among parallel processes
26 Computer Science, University of Warwick MPI_File_set_view Describes that part of the file accessed by a single MPI process. int MPI_File_set_view( MPI_File mpi_fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info ); Parameters mpi_fh :[in] file handle (handle) disp :[in] displacement (nonnegative integer) etype :[in] elementary datatype (handle) filetype :[in] filetype (handle) datarep :[in] data representation (string) info :[in] info object (handle)
27 Computer Science, University of Warwick Access pattern 3: File View Example MPI_File thefile; for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, "testfile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int), MPI_INT, MPI_INT, "native", MPI_INFO_NULL); MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile);
28 Computer Science, University of Warwick MPI_Type_create_subarray Create a datatype for a subarray of a regular, multidimensional array int MPI_Type_create_subarray( int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype ); Parameters ndims :[in] number of array dimensions (positive integer) array_of_sizes :[in] number of elements of type oldtype in each dimension of the full array (array of positive integers) array_of_subsizes :[in] number of elements of type oldtype in each dimension of the subarray (array of positive integers) array_of_starts :[in] starting coordinates of the subarray in each dimension (array of nonnegative integers) order :[in] array storage order flag (state) oldtype :[in] array element datatype (handle) newtype :[out] new datatype (handle)
29 Computer Science, University of Warwick Using the Subarray Datatype gsizes[0] = 16; /* no. of rows in global array */ gsizes[1] = 16; /* no. of columns in global array*/ psizes[0] = 4; /* no. of procs. in vertical dimension */ psizes[1] = 4; /* no. of procs. in horizontal dimension */ lsizes[0] = 16/psizes[0]; /* no. of rows in local array */ lsizes[1] = 16/psizes[1]; /* no. of columns in local array*/ dims[0] = 4; dims[1] = 4; periods[0] = periods[1] = 1; MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm); MPI_Comm_rank(comm, &rank); MPI_Cart_coords(comm, rank, 2, coords);
30 Computer Science, University of Warwick Subarray Datatype contd. Subarray Datatype contd. /* global indices of first element of local array */ start_indices[0] = coords[0] * lsizes[0]; start_indices[1] = coords[1] * lsizes[1]; MPI_Type_create_subarray(2, gsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &filetype); MPI_Type_commit(&filetype);
31 Computer Science, University of Warwick Access pattern 3 Each process creates a derived datatype to describe the non-contiguous access pattern We thus have a file view and independent access Single independent, non-contiguous request MPI_Type_create_subarray (…, &subarray, …) MPI_Type_commit (&subarray) MPI_File_open(…, “filename”, …, &fh) MPI_File_set_view (fh, …, subarray, …) MPI_File_read (fh, local_array, …) MPI_File_close (&fh) Creates a datatype describing a subarray of a multi-dimentional array Commits the datatype (must be done before comms) System may compile at commit time an internal representation for the datatype
32 Computer Science, University of Warwick Access pattern 3 Each process creates a derived datatype to describe the non-contiguous access pattern We thus have a file view and independent access Single independent, non-contiguous request MPI_Type_create_subarray (…, &subarray, …) MPI_Type_commit (&subarray) MPI_File_open(…, “filename”, …, &fh) MPI_File_set_view (fh, …, subarray, …) MPI_File_read (fh, local_array, …) MPI_File_close (&fh) Opens the file as before Now changes the processes view of the data in the file using set_view set_view is collective Although the reads are still independent
33 Computer Science, University of Warwick Note here that we are reading the whole sub-array despite the non-contiguous storage P0P0 P1P1 P4P4 P5P5 P8P8 P2P2 P3P3 P6P6 P7P7 P 12 P9P9 P 10 P 11 P 13 P 14 P 15 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P4P4 P5P5 P6P6 P7P7 etc… Processes {4,5,6,7}, {8,9,10,11}, {12,13,14,15} will have file views based on the same filetypes but with different displacements Access pattern 3 proc. 0 filetype proc. 1 filetype proc. 2 filetype proc. 3 filetype
34 Computer Science, University of Warwick Access pattern 4 Each process creates a derived datatype to describe the non-contiguous access pattern We thus have a file view and collective access Single collective, non-contiguous request MPI_Type_create_subarray (…, &subarray, …) MPI_Type_commit (&subarray) MPI_File_open(…, “filename”, …, &fh) MPI_File_set_view (fh, …, subarray, …) MPI_File_read_all (fh, local_array, …) MPI_File_close (&fh) Creates and commits datatype as before Now changes the processes view of the data in the file using set_view set_view is collective Reads are now collective
35 Computer Science, University of Warwick These access patterns express four different style of parallel I/O that are available You should choose your access pattern depending on the application Larger the size of the I/O request, the better performance Collectives are going to do better than individual reads Pattern 4 therefore offers (potentially) the best performance Access patterns
36 Computer Science, University of Warwick I/O optimization: Data Sieving Data sieving is used to combine lots of small accesses into a single larger one Remote file systems (parallel or not) tend to have high latencies Reducing the number of operations important
37 Computer Science, University of Warwick I/O optimization: Data Sieving Writes Using data sieving for writes is more complicated Must read the entire region first Then make our changes Then write the block back Requires locking in the file system Can result in false sharing
38 Computer Science, University of Warwick I/O optimization: Two-Phase Collective I/O Problems with independent, noncontiguous access Lots of small accesses Independent data sieving reads lots of extra data Idea: Reorganize access to match layout on disks Single processes use data sieving to get data for many Often reduces total I/O through sharing of common blocks Second ``phase'' moves data to final destinations
39 Computer Science, University of Warwick I/O optimization: Collective I/O Collective I/O is coordinated access to storage by a group of processes Collective I/O functions must be called by all processes participating in I/O Allows I/O layers to know more about access as a whole