MPI IO Parallel Distributed Systems File Management I/O Peter Collins, Khadouj Fikry.

MPI IO Parallel Distributed Systems File Management I/O Peter Collins, Khadouj Fikry

File Management I/O Overview
Problem Statement: as the size of data is excessively increasing, the need of I/O parallelization is becoming a necessity to avoid scalability bottleneck on almost every application. General: File management I/O allows the distribution of large amount of data files among multiple computing nodes where instructions will be executed. The goal: to use parallelism to increase bandwidth and reduce execution time when dealing with large data sets. 1 We can Provide an overview of File management, the solution neede, and then focus on MPI IO NFS: network File System. PVFS: Parallel Virtual File System AFS: Andrew File System (service) 2 [1] [2]

File Management I/O Challenge: I/O can be challenging to implement, to coordinate, and to optimize, especially when dealing directly with the File System or the Network Protocols layer. Solution: Specialist implemented infrastructures that provides intermediate layer for coordination of data access and mapping from application layer to I/O layer. Examples: MPI –IO, NFS, PVFS, Hadoop, Parallel HDF5, Parallel netCDF, T3PIO,… [1]

Sequential I/0 Very Simple Bottle Neck! 1 2 3 Data 0 File Data 1
1 2 3 Data 0 File Data 1 Data 2 Data 3 Data Performance Bottle Neck! Very Simple Bottle neck : it limits performance and scalability of many applications Doesn’t scale, and performance decreases linearly as the number of processes increase. Old fashion way, before parallelism

Parallel I/O: Multiple Files
1 2 3 Data 0 File 1 Data 1 Data 2 Data 3 File 2 File 3 File 4 Better Improvement compared to sequential + This provides best I/O bandwidth + parallelism In reality IO servers are shared, so data have to be moved to other tasks. A lot of small files to manage The need to manage the files and aggregate the results.

[1]: http://www.mcs.anl.gov/~thakur/papers/mpi-io-noncontig.pdf
About: MPI-IO The objective of MPI-IO is to read and write to a single file in parallel. It interoperates with the file system to improve the performance with I/O in distributed memory applications. Function calls are similar to POSIX commands. It can result in potentially good performance, especially when dealing with small, distinct, and non-contiguous IO requests [1]. It is relatively ease to use - utilizing the existing MPI data structures. Portable: MPI-IO code can be run on any computer node supporting MPI V The binaries are not portable. [1]:

Parallel I/O: Single File
Data 3 1 2 3 Single Common File Data 0 Data 1 Data 2 High Complexity M 1 Performance improvement Optimization opportunity Scalability Multiple processes participate in reading data from or writing data to a common file in parallel. It improves performance and provides a single file for storage and transfer purposes. From program level: concurrent reads and writes from multiple processes to a common file. From system level: a parallel file system and hardware that supports such concurrent access.

3 keys to MPI-IO Positioning Explicit (Non-Contiguous)
Function Positioning Coordination Synchronization MPI_File_read Contiguous Non-Collective Blocking MPI_File_read_at Non Contiguous MPI_File_read_all Collective MPI_File_read_at_all MPI_File_iread() Non-Blocking Positioning Explicit (Non-Contiguous) Implicit (Contiguous) Coordination Collective Non-Collective Synchronization Blocking (Synchronous) Non-Blocking (Asynchronous) Writing is like sending Reading is like receiving

MPI-IO Components File Management Components: File Handler:
Usually it is an ADT that is used to access the files. File pointer The position of the file of which we read and write. Managed by the file handler. File View Defines the portions that are visible by each processors. It can enable efficient non-contiguous access patterns to the file. IO Components: Collective/ non-collective IO Collective: all the processes in the communicator are forced to read/ write data collectively and wait for each other - MPI_File_read_all() Non-Collective IO: no coordination by MPI infrastructure - MPI_File_read() Contiguous/ Non- Contiguous Contiguous: MPI-IO default; the entire file is visible to the process, and data will be read/write contiguously starting from location specified by read/write functions MPI_File_read() Non-Contiguous : MPI-IO allows non-contiguous data access for read and write with a single I/O function call. MPI_File_read_at() Asynchronous /Synchronous Asynchronous : It allows you to continue your computation while data is being sent on the background. And uses MPI_Test or MPI_wait to check if the data transfer is completed. It is non blocking MPI_File_iread() Synchronous: when MPI_read or MPI_write are called, the call returns only when the data on the read or write buffer are being sent, only then it is safe to perform other operations on the send/receive buffer; It is blocking MPI_File_read() • Collective I/O is a critical optimization strategy for reading from, and writing to, the parallel file system • The collective read and write calls force all processes in the communicator to read/write data simultaneously and to wait for each other • The MPI implementation optimizes the read/write request based on the combined requests of all processes and can merge the requests of different processes for efficiently servicing the requests • This is particularly effective when the accesses of different processes are noncontiguous Non contiguous can be in file, in memory, or both. Each process describes the part of the file for which it is responsible – file views with offsets. Only that part is visible to the process. This provides a very efficient way to perform non-contiguous access, example: distributed arrays stored in files. Collective/ non collective -> sometimes the overhead of using collective calls outweighs their benefits. Example small IO during header reads.

File Views Defines the portions that are visible by each processors.
They describe where in memory and in the file the current process can read/write too It can enable efficient non-contiguous access patterns to the file. Not overlapping allows files to be accessed in parallel without data being corrupted Basic /Derived File views can either use basic existing MPI data types or create Derived data types . Process # 0 Process # … Process # n-1 Process # n File views are a powerful part of MPI I/O They allow you to write to different parts of the file without being able to access parts it shouldn't Derived data types: allows you to define custom data types that describe both memory and file layout File File pointer

File View Example Proc 0 Proc 1 Proc 2 Proc 3
The filetype is a group of elementary types ` This could be a simple MPI_INT or a complex struct of ints, doubles, floats int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,MPI_Datatype filetype, char *datarep, MPI_Info info) 1 2 3 MPI_File_set_view(fh, 2 * etype, MPI_INT, filetype, “native”, MPI_INFO_NULL); 1 2 3 Iteration 3 Iteration 1 Iteration 2 Filetype is a group of elemtary types Each process accesses a different part of the filetype When we advance the pointer we advance by filetype File views automate process of moving file pointers itself Never step on toes Access data as structs File pointers advanced by structs Called file views View defined by e type and filetype Elemetary type because its what file is made of the e type could be primitive or complex File partitioned among processes by filetype Template for accessing the file by the processes Filetype should be constructed from multiple instances of e type Every process defines file type differently Displacement

Sample MPI-IO Program Sequence
#include <stdio.h> #include "mpi.h" int main(int argc, char **argv) { int rank, rankSize, bufferSize, numberInts; MPI_File fh; //File handler MPI_Status status; //Stores status of operation MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, & rankSize); bufferSize = fileSize / rankSize; //amount of file each node has numberInts = bufferSize / sizeof(int); //# ints that goes in the buff int buffer[numberInts]; //used by file views int offset= numberInts * rank; //offset for each rank MPI_File_open(MPI_COMM_WORLD, "filename", MPI_MODE_WRDWR, MPI_INFO, &fh); MPI_File_set_view(fh, offset, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); MPI_File_write(fh, buffer, numberInts, MPI_INT, &status); MPI_File_seek(fh, rank * bufferSize, MPI_SEEK_SET); //Pos of file pointer MPI_File_read(fh, buffer, numberInts, MPI_INT, &status); MPI_File_close(&fh); MPI_Finalize(); return 0; } Start File Handler, communicates file position for each node. MPI Initialization File open 1- Set up MPI Opens the file in a specific mode. Open is broadcasted for all nodes 2- Open File Set Views Process 0 Process 1 Process 3 Process n 3- Set File Views Communication Update the mode to read and write - done Change the set_view File open is collective over the communicator. Modes are similar to Unix open modes File close is collective. Within file views: displaycement: # of bytes to skip from the start of the file. etype: unit of data access, can be basic or derived. filetype: which portion of the file is visible to the process. File 4- Read/ Write File Close File pointer an implicit, blocking, non-collective Unix I/O style 5- Close Finalize Synchronizes the file state then closes it Complete

Collective I/O Small individual requests The many I/O requests across all processes are merged into larger I/O All processes I/O occur together This is effective when we have many non-contiguous I/O at once Gather the small non-contiguous requests and optimize the requests With many non-contiguous combine them in such a way that we read/write efficiently This is determined by the MPI implementation Can increase performance by only reading/writing one time and parsing the data to the correct processes Large Collective access Advantages: the ability to specify non-contiguous access in memory and file within a single function call by using derived data types. Collective IO + non contiguous access = highest performance. Source:

MPI IO Best Uses Perform operations on a large data file.
High Performance parallel applications that requires I/O. Need to make many small I/O request to non-contiguous parts of the file. Describe non-contiguous file access patterns Common Mistakes Attempting to write to multiple files Miscalculating the offset/ displacement Making frequent meta-data accesses. Advantages: the ability to specify non-contiguous access in memory and file within a single function call by using derived data types. Collective IO + non contiguous access = highest performance. MPI IO is precise and provides high performance; consistency points guided by users.

Example - Large Distributed Array
Advantages: the ability to specify non-contiguous access in memory and file within a single function call by using derived data types. Collective IO + non contiguous access = highest performance. Source:

MPI IO 4 Levels of Access and Use Cases
level 0: MPI_File_read a single element from its sub array (unix style) Level 1: MPI_File_read_all – collectively read their elements from the sub array Level 2: Create a derived type for the subarray; create a file view describing the non-contiguous access; Independent I/O for each Advantages: the ability to specify non-contiguous access in memory and file within a single function call by using derived data types. Collective IO + non contiguous access = highest performance. Level 3: Same as Level 2 but using collective function calls Source:

MPI I/O Command List MPI_File_get_type_extent MPI_File_get_view
MPI_File_iread MPI_File_iread_all MPI_File_iread_at MPI_File_iread_at_all MPI_File_iread_shared MPI_File_iwrite MPI_File_iwrite_all MPI_File_iwrite_at MPI_File_iwrite_at_all MPI_File_iwrite_shared MPI_File_open MPI_File_preallocate MPI_File_read MPI_File_read_all MPI_File_read_all_begin MPI_File_read_all_end MPI_File_read_at MPI_File_read_at_all MPI_File_read_at_all_begin MPI_File_read_at_all_end MPI_File_read_ordered MPI_File_read_ordered_begin MPI_File_read_ordered_end MPI_File_read_shared MPI_File_seek MPI_File_seek_shared MPI_File_set_atomicity MPI_File_set_errhandler MPI_File_set_info MPI_File_set_size MPI_File_set_view MPI_File_sync MPI_File_write MPI_File_write_all MPI_File_write_all_begin MPI_File_write_all_end MPI_File_write_at MPI_File_write_at_all MPI_File_write_at_all_begin MPI_File_write_at_all_end MPI_File_write_ordered MPI_File_write_ordered_begin MPI_File_write_ordered_end MPI_File_write_shared MPI_File_c2f MPI_File_call_errhandler MPI_File_close MPI_File_create_errhandler MPI_File_delete MPI_File_f2c MPI_File_get_amode MPI_File_get_atomicity MPI_File_get_byte_offset MPI_File_get_errhandler MPI_File_get_group MPI_File_get_info MPI_File_get_position MPI_File_get_position_shared MPI_File_get_size This is a good spot to summarize.

References Overview of MPI IO es/cs598- s16/lectures/lecture32.pdf latest/ apers/mpi-io-noncontig.pdf ments/13601/900558/MPI-IO- Final.pdf/eea9d7d3-4b81-471c- b e35d hpc/MPI+IO Ref to MPI-IO Commands Different MPI Performance Levels Parallel IO, MPI IO, HDF5, T3PIO, and Strategies Wiki Page

MPI IO Levels Read/Write Performance
Source:

MPI IO Parallel Distributed Systems File Management I/O Peter Collins, Khadouj Fikry.

Similar presentations

Presentation on theme: "MPI IO Parallel Distributed Systems File Management I/O Peter Collins, Khadouj Fikry."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MPI IO Parallel Distributed Systems File Management I/O Peter Collins, Khadouj Fikry.

Similar presentations

Presentation on theme: "MPI IO Parallel Distributed Systems File Management I/O Peter Collins, Khadouj Fikry."— Presentation transcript:

Similar presentations

About project

Feedback