Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, May 4, 2015 Implementation and Evaluation of MPI Nonblocking Collective I/O PPMM 2015
File I/O in HPC File I/O becomes more important as many HPC applications deal with larger datasets The well-known gap between relative CPU speeds and storage bandwidth results in the need for new strategies for managing I/O demands PPMM data source: CPU Performance HDD Performance CPU Storage Gap
MPI I/O Supports parallel I/O operations Has been included in the MPI standard since MPI 2.0 Proposed many I/O optimizations to improve the I/O performance and to help application developers optimize their I/O use cases – Blocking individual I/O – Nonblocking individual I/O – Collective I/O – Restrictive nonblocking collective I/O Missing part? – General nonblocking collective (NBC) I/O Proposed for the upcoming MPI 3.1 standard This paper presents our initial work on the implementation of the MPI NBC I/O operations PPMM
Outline Background and motivation Nonblocking collective (NBC) I/O operations Implementation of NBC I/O operations – Collective I/O in ROMIO – State machine-based implementation Evaluation Conclusions and future work PPMM
Split Collective I/O The current MPI standard provides split collective I/O routine to support NBC I/O A single collective operation is divided into two parts – A begin routine and an end routine – For example, MPI_File_read_all = MPI_File_read_all_begin + MPI_File_read_all_end At most one active split collective operation is possible on each file handle at any time – The user has to wait until the preceding operation is completed PPMM MPI_File_read_all_begin MPI_File_read_all_end MPI_File_read_all_begin MPI_File_read_all_end wait
Another Limitation of Split Collective I/O MPI_Request is not used in split collective I/O routines – MPI_Test cannot be used – May be difficult to implement efficiently if collective I/O algorithms require more than two steps Example: ROMIO – A widely used MPI I/O implementation – Does not provide a true immediate return implementation of split collective I/O routines – Performs all I/O in the “begin” step and only a small amount of bookkeeping in the “end” step – Cannot overlap computation and split collective I/O operations 5 MPI_File_read_all_begin MPI_File_read_all_end computation Overlap I/O and computation? PPMM 2015
NBC I/O Proposal for MPI 3.1 Standard The upcoming MPI 3.1 standard will include immediate nonblocking versions of collective I/O operations for individual file pointers and explicit offsets – MPI_File_iread_all(..., MPI_Request *req) – MPI_File_iwrite_all(..., MPI_Request *req) – MPI_File_iread_at_all(..., MPI_Request *req) – MPI_File_iwrite_at_all(..., MPI_Request *req) These will replace the current split collective I/O routines PPMM
Implications for Applications Provide benefits of both collective I/O operations and nonblocking operations Enable different collective I/O operations to be overlapped PPMM MPI_File_iread_all MPI_Waitall multiple posts wait all MPI_File_iread_all MPI_Test/Wait computation Overlapping I/O and computation Optimized performance of collective I/O
Collective I/O in ROMIO Implemented using a generalized version of the extended two-phase method – Any noncontiguous I/O requests can be handled Two-phase I/O method – Basically splits a collective I/O operation into two phases Example of the write operation In the first phase, each process sends its noncontiguous data to other processes in order for each process to rearrange the data for a large contiguous region in a file In the second phase, each process writes a big contiguous regions of a file with collected data – Combine a large number of noncontiguous requests into a small number of contiguous I/O operations Can improve performance PPMM
Example: Collective File Write in ROMIO If we handle requests of all processes independently – Each process needs three individual write operations PPMM P0 Request P1P2 Write to a file: A to 1, B to 4, C to 7 Write to a file: D to 2, E to 5, F to 8 Write to a file: G to 3, H to 6, I to 9 Data DEFGHI File ABC
Example: Collective File Write in ROMIO PPMM P0 Request P1P2 Write to a file: A to 1, B to 4, C to 7 Write to a file: D to 2, E to 5, F to 8 Write to a file: G to 3, H to 6, I to 9 Data Communication Buffer File Write Send B to P1 Send C to P2 Recv D from P1 Recv G from P2 Send D to P0 Send F to P2 Recv B from P0 Recv H from P2 Send G to P0 Send H to P1 Recv C from P0 Recv F from P1 BEHCFI ADGDEFGHIABC Each process can write its buffer of three blocks to the contiguous region in the file
Implementation of NBC I/O Operations Use the same general algorithm for the blocking collective I/O operations in ROMIO Replace all blocking communication or I/O operations with nonblocking counterparts – Use request handles to make progress or keep track of progress Divide the original routine into separate routines when the blocking operation is changed Manage the progress of NBC I/O operations using – The extended generalized request – A state machine PPMM
Extended Generalized Request Standard generalized requests – Allow users to add new nonblocking operations to MPI while still using many pieces of MPI infrastructure such as request objects and the progress notification routines – Unable to make progress with the test or wait routines Their progress must occur completely outside the underlying MPI implementation (typically via pthreads or signal handlers) Extended generalized requests – Add poll and wait routines – Enable users to utilize the test and wait routines of MPI in order to check progress on or make progress on user- defined nonblocking operations 12 PPMM 2015
Using the Extended Generalized Request Exploit the extended generalized request to mange the progress of NBC I/O operations PPMM MPI_File_write_all(..., MPI_Status *status) {... MPI_Alltoall(...);... ADIO_WriteStrided(...);... } MPI_File_iwrite_all(..., MPI_Request *req) { MPI_Request nio_req; MPIX_Grequest_class_create(..., iwrite_all_poll_fn,..., &greq_class); MPIX_Grequest_class_allocate(greq_class, nio_status, &nio_req); memcpy(req, &nio_req, sizeof(MPI_Request));... MPI_Alltoall(...);... ADIO_WriteStrided(...);... MPI_Grequest_complete(nio_req); }
State Machine-Based Implementation (1/3) Use the same general algorithm for the blocking collective I/O operations in ROMIO Replace all blocking communication or I/O operations with nonblocking counterparts – Use request handles to make progress or keep track of progress PPMM MPI_File_iwrite_all(..., MPI_Request *req) { … MPI_Request cur_req;... MPI_Ialltoall(..., &cur_req); MPI_Wait(&cur_req, &status)... ADIO_IwriteStrided(..., &cur_req); MPI_Wait(&cur_req, &status)... MPI_Grequest_complete(nio_req); } MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Alltoall(...);... ADIO_WriteStrided(...);... MPI_Grequest_complete(nio_req); }
State Machine-Based Implementation (2/3) Divide the original routine into separate routines when the blocking operation is changed PPMM MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Ialltoall(..., &cur_req); } iwrite_all_fileop(...) {... ADIO_IwriteStrided(..., &cur_req); } MPI_File_iwrite_all(..., MPI_Request *req) { … MPI_Request cur_req;... MPI_Ialltoall(..., &cur_req); MPI_Wait(&cur_req, &status)... ADIO_IwriteStrided(..., &cur_req); MPI_Wait(&cur_req, &status)... MPI_Grequest_complete(nio_req); } iwrite_all_fini(...) {... MPI_Grequest_complete(nio_req); }
State Machine-Based Implementation (3/3) Manage the progress of NBC I/O operations by using a state machine PPMM MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Ialltoall(..., &cur_req); state = IWRITE_ALL_STATE_COMM; } iwrite_all_fileop(...) {... ADIO_IwriteStrided(..., &cur_req); state = IWRITE_ALL_STATE_FILEOP; } iwrite_all_fini(...) {... state = IWRITE_ALL_STATE_COMPLETE; MPI_Grequest_complete(nio_req); } IWRITE_ALL_STATE_COMM IWRITE_ALL_STATE_FILEOP IWRITE_ALL_STATE_COMPLETE Implemented in the poll function of the extended generalized request MPI_Test complete not yet MPI_Test complete not yet
Progress of NBC I/O Operations The NBC I/O routine initiates the I/O operation and returns a request handle, which must be passed to a completion call – [MPI standard] All nonblocking calls are local and return immediately irrespective of the status of other processes Progress of NBC I/O operations? – Implicit or explicit depending on the implementation Our implementation currently requires explicit progress – The user has to call MPI_Test or MPI_Wait – Currently a common practice in implementing nonblocking operations Alternative? – Exploit progress threads to support asynchronous progress PPMM
Evaluation Methodology Target platform – Blues cluster at Argonne National Laboratory 310 compute nodes + GPFS file system – Each compute node has two Intel Xeon E (16 cores) MPI implementation – Implemented the NBC I/O routines inside ROMIO – Integrated into MPICH 3.2a2 or later as MPIX routines Benchmarks – coll_perf benchmark in the ROMIO test suite and its modifications To use NBC I/O operations or to overlap collective operations and computation – A microbenchmark to overlap multiple I/O operations PPMM
I/O Bandwidth The coll_perf benchmark – Measures the I/O bandwidth for writing and reading a 3D block- distributed array to a file Array size used: 2176 x 1152 x 1408 integers (about 14 GB) – Has the noncontiguous file access pattern – For NBC I/O, blocking collective I/O routines are replaced with their corresponding NBC I/O routines followed by MPI_Wait Measure the I/O bandwidth of blocking collective I/O and NBC I/O What do we expect? – The NBC I/O routines ideally should have more overhead only from additional function calls and memory management. PPMM MPI_File_write_all(...) MPI_Request req; MPI_File_iwrite_all(..., &req); MPI_Wait(&req,...)
I/O Bandwidth (cont’d) PPMM Our NBC I/O implementation does not cause significant overhead!
Overlapping I/O and Computation Insert some synthetic computation code into coll_perf 21 Blocking I/O with computation MPI_File_write_all(...); Computation(); NBC I/O with computation MPI_File_iwrite_all(..., &req); for (...) { Small_Computation(); MPI_Test(&req, &flag,...); if (flag) break; } Remaining_Computation(); MPI_Wait(&req,...); MPI_File_iwrite_all(..., &req); Computation(); MPI_Wait(&req,...); Why not this? Because we need to make progress explicitly PPMM 2015
Overlapping I/O and Computation (cont’d) PPMM % of write time and 83% of read time is overlapped, respectively. The entire execution time is reduced by 36% for write and 34% for read. 84% 83%
Overlapping Multiple I/O Operations PPMM Multiple collective I/O operations can be overlapped by using NBC I/O routines! Initiate multiple collective I/O operations at a time and wait for the completion of all posted operations 59% reduction 13% reduction
Conclusions and Future Work MPI NBC I/O operations – Can take advantage of both nonblocking operations and collective operations – Will be part of the upcoming MPI 3.1 standard Initial work on the implementation of MPI NBC I/O operations – Done in the MPICH MPI library – Based on the extended two-phase algorithm – Utilizes the state machine and the extended generalized request – Performs as well as blocking collective I/O in terms of I/O bandwidth – Capable of overlapping I/O and other operations – Can help users try nonblocking collective I/O operations in their applications Future work – Asynchronous progress of NBC I/O operations To overcome the shortcomings of the explicit progress requirement – Real applications study – Comparison with other approaches PPMM
Acknowledgment This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract DE- AC02- 06CH We gratefully acknowledge the computing resources provided on Blues, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. PPMM
Q&A Thank you for your attention! Questions? PPMM
Related Work NBC I/O implementation – Open MPI I/O library using the libNBC library [Venkatesan et al., EuroMPI ’11] Leverages the concept of collective operations schedule in libNBC Requires modification of the progress engine of libNBC – Our implementation Exploits the state machine and the extended generalized request Does not need to modify the progress engine – If the extended generalized request interface is provided – Plan to compare the performance and efficiency of two implementations Collective I/O research – The two-phase method and its extensions Have been studied by many researchers Widely used in collective I/O implementations Our work is based on [Thakur et al., Frontiers ’99] – View-based collective I/O [Blas et al., CCGrid ’08] – MPI collective I/O implementation for better research platform [Coloma et al., Cluster ’06] – Collective I/O library with POSIX-like interfaces [Yu et al., IPDPS ’13] 27 PPMM 2015