Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, May 4, 2015 Implementation and.

Slides:

Advertisements

Similar presentations

Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.

Advertisements

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.

Non-Blocking Collective MPI I/O Routines Ticket #273.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

File Consistency in a Parallel Environment Kenin Coloma

Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.

Extensibility, Safety and Performance in the SPIN Operating System Brian Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.

1 I/O Management in Representative Operating Systems.

The hybird approach to programming clusters of multi-core architetures.

Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Emalayan Vairavanathan

Operating System Concepts Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.

High Performance I/O and Data Management System Group Seminar Xiaosong Ma Department of Computer Science North Carolina State University September 12,

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.

Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory {wbland, huiweilu, sseo, May 5, 2015 Lessons Learned Implementing.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Bronis R. de Supinski and John May Center for Applied Scientific Computing March 18, 1999 Benchmarking pthreads.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.

LAIO: Lazy Asynchronous I/O For Event Driven Servers Khaled Elmeleegy Alan L. Cox.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

2015 Storage Developer Conference. © Intel Corporation. All Rights Reserved. RDMA with PMEM Software mechanisms for enabling access to remote persistent.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Chapter 9 – Real Memory Organization and Management

Supporting Fault-Tolerance in Streaming Grid Applications

Chapter 4: Threads.

University of Wisconsin-Madison

Presentation transcript:

Sangmin Seo, Robert Latham, Junchao Zhang, Pavan Balaji Argonne National Laboratory {sseo, robl, jczhang, May 4, 2015 Implementation and Evaluation of MPI Nonblocking Collective I/O PPMM 2015

File I/O in HPC File I/O becomes more important as many HPC applications deal with larger datasets The well-known gap between relative CPU speeds and storage bandwidth results in the need for new strategies for managing I/O demands PPMM data source: CPU Performance HDD Performance CPU Storage Gap

MPI I/O Supports parallel I/O operations Has been included in the MPI standard since MPI 2.0 Proposed many I/O optimizations to improve the I/O performance and to help application developers optimize their I/O use cases – Blocking individual I/O – Nonblocking individual I/O – Collective I/O – Restrictive nonblocking collective I/O Missing part? – General nonblocking collective (NBC) I/O Proposed for the upcoming MPI 3.1 standard This paper presents our initial work on the implementation of the MPI NBC I/O operations PPMM

Outline Background and motivation Nonblocking collective (NBC) I/O operations Implementation of NBC I/O operations – Collective I/O in ROMIO – State machine-based implementation Evaluation Conclusions and future work PPMM

Split Collective I/O The current MPI standard provides split collective I/O routine to support NBC I/O A single collective operation is divided into two parts – A begin routine and an end routine – For example, MPI_File_read_all = MPI_File_read_all_begin + MPI_File_read_all_end At most one active split collective operation is possible on each file handle at any time – The user has to wait until the preceding operation is completed PPMM MPI_File_read_all_begin MPI_File_read_all_end MPI_File_read_all_begin MPI_File_read_all_end wait

Another Limitation of Split Collective I/O MPI_Request is not used in split collective I/O routines – MPI_Test cannot be used – May be difficult to implement efficiently if collective I/O algorithms require more than two steps Example: ROMIO – A widely used MPI I/O implementation – Does not provide a true immediate return implementation of split collective I/O routines – Performs all I/O in the “begin” step and only a small amount of bookkeeping in the “end” step – Cannot overlap computation and split collective I/O operations 5 MPI_File_read_all_begin MPI_File_read_all_end computation Overlap I/O and computation? PPMM 2015

NBC I/O Proposal for MPI 3.1 Standard The upcoming MPI 3.1 standard will include immediate nonblocking versions of collective I/O operations for individual file pointers and explicit offsets – MPI_File_iread_all(..., MPI_Request *req) – MPI_File_iwrite_all(..., MPI_Request *req) – MPI_File_iread_at_all(..., MPI_Request *req) – MPI_File_iwrite_at_all(..., MPI_Request *req) These will replace the current split collective I/O routines PPMM

Implications for Applications Provide benefits of both collective I/O operations and nonblocking operations Enable different collective I/O operations to be overlapped PPMM MPI_File_iread_all MPI_Waitall multiple posts wait all MPI_File_iread_all MPI_Test/Wait computation Overlapping I/O and computation Optimized performance of collective I/O

Collective I/O in ROMIO Implemented using a generalized version of the extended two-phase method – Any noncontiguous I/O requests can be handled Two-phase I/O method – Basically splits a collective I/O operation into two phases Example of the write operation In the first phase, each process sends its noncontiguous data to other processes in order for each process to rearrange the data for a large contiguous region in a file In the second phase, each process writes a big contiguous regions of a file with collected data – Combine a large number of noncontiguous requests into a small number of contiguous I/O operations Can improve performance PPMM

Example: Collective File Write in ROMIO If we handle requests of all processes independently – Each process needs three individual write operations PPMM P0 Request P1P2 Write to a file: A to 1, B to 4, C to 7 Write to a file: D to 2, E to 5, F to 8 Write to a file: G to 3, H to 6, I to 9 Data DEFGHI File ABC

Example: Collective File Write in ROMIO PPMM P0 Request P1P2 Write to a file: A to 1, B to 4, C to 7 Write to a file: D to 2, E to 5, F to 8 Write to a file: G to 3, H to 6, I to 9 Data Communication Buffer File Write Send B to P1 Send C to P2 Recv D from P1 Recv G from P2 Send D to P0 Send F to P2 Recv B from P0 Recv H from P2 Send G to P0 Send H to P1 Recv C from P0 Recv F from P1 BEHCFI ADGDEFGHIABC Each process can write its buffer of three blocks to the contiguous region in the file

Implementation of NBC I/O Operations Use the same general algorithm for the blocking collective I/O operations in ROMIO Replace all blocking communication or I/O operations with nonblocking counterparts – Use request handles to make progress or keep track of progress Divide the original routine into separate routines when the blocking operation is changed Manage the progress of NBC I/O operations using – The extended generalized request – A state machine PPMM

Extended Generalized Request Standard generalized requests – Allow users to add new nonblocking operations to MPI while still using many pieces of MPI infrastructure such as request objects and the progress notification routines – Unable to make progress with the test or wait routines Their progress must occur completely outside the underlying MPI implementation (typically via pthreads or signal handlers) Extended generalized requests – Add poll and wait routines – Enable users to utilize the test and wait routines of MPI in order to check progress on or make progress on user- defined nonblocking operations 12 PPMM 2015

Using the Extended Generalized Request Exploit the extended generalized request to mange the progress of NBC I/O operations PPMM MPI_File_write_all(..., MPI_Status *status) {... MPI_Alltoall(...);... ADIO_WriteStrided(...);... } MPI_File_iwrite_all(..., MPI_Request *req) { MPI_Request nio_req; MPIX_Grequest_class_create(..., iwrite_all_poll_fn,..., &greq_class); MPIX_Grequest_class_allocate(greq_class, nio_status, &nio_req); memcpy(req, &nio_req, sizeof(MPI_Request));... MPI_Alltoall(...);... ADIO_WriteStrided(...);... MPI_Grequest_complete(nio_req); }

State Machine-Based Implementation (1/3) Use the same general algorithm for the blocking collective I/O operations in ROMIO Replace all blocking communication or I/O operations with nonblocking counterparts – Use request handles to make progress or keep track of progress PPMM MPI_File_iwrite_all(..., MPI_Request *req) { … MPI_Request cur_req;... MPI_Ialltoall(..., &cur_req); MPI_Wait(&cur_req, &status)... ADIO_IwriteStrided(..., &cur_req); MPI_Wait(&cur_req, &status)... MPI_Grequest_complete(nio_req); } MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Alltoall(...);... ADIO_WriteStrided(...);... MPI_Grequest_complete(nio_req); }

State Machine-Based Implementation (2/3) Divide the original routine into separate routines when the blocking operation is changed PPMM MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Ialltoall(..., &cur_req); } iwrite_all_fileop(...) {... ADIO_IwriteStrided(..., &cur_req); } MPI_File_iwrite_all(..., MPI_Request *req) { … MPI_Request cur_req;... MPI_Ialltoall(..., &cur_req); MPI_Wait(&cur_req, &status)... ADIO_IwriteStrided(..., &cur_req); MPI_Wait(&cur_req, &status)... MPI_Grequest_complete(nio_req); } iwrite_all_fini(...) {... MPI_Grequest_complete(nio_req); }

State Machine-Based Implementation (3/3) Manage the progress of NBC I/O operations by using a state machine PPMM MPI_File_iwrite_all(..., MPI_Request *req) {... MPI_Ialltoall(..., &cur_req); state = IWRITE_ALL_STATE_COMM; } iwrite_all_fileop(...) {... ADIO_IwriteStrided(..., &cur_req); state = IWRITE_ALL_STATE_FILEOP; } iwrite_all_fini(...) {... state = IWRITE_ALL_STATE_COMPLETE; MPI_Grequest_complete(nio_req); } IWRITE_ALL_STATE_COMM IWRITE_ALL_STATE_FILEOP IWRITE_ALL_STATE_COMPLETE Implemented in the poll function of the extended generalized request MPI_Test complete not yet MPI_Test complete not yet

Progress of NBC I/O Operations The NBC I/O routine initiates the I/O operation and returns a request handle, which must be passed to a completion call – [MPI standard] All nonblocking calls are local and return immediately irrespective of the status of other processes Progress of NBC I/O operations? – Implicit or explicit depending on the implementation Our implementation currently requires explicit progress – The user has to call MPI_Test or MPI_Wait – Currently a common practice in implementing nonblocking operations Alternative? – Exploit progress threads to support asynchronous progress PPMM

Evaluation Methodology Target platform – Blues cluster at Argonne National Laboratory 310 compute nodes + GPFS file system – Each compute node has two Intel Xeon E (16 cores) MPI implementation – Implemented the NBC I/O routines inside ROMIO – Integrated into MPICH 3.2a2 or later as MPIX routines Benchmarks – coll_perf benchmark in the ROMIO test suite and its modifications To use NBC I/O operations or to overlap collective operations and computation – A microbenchmark to overlap multiple I/O operations PPMM

I/O Bandwidth The coll_perf benchmark – Measures the I/O bandwidth for writing and reading a 3D block- distributed array to a file Array size used: 2176 x 1152 x 1408 integers (about 14 GB) – Has the noncontiguous file access pattern – For NBC I/O, blocking collective I/O routines are replaced with their corresponding NBC I/O routines followed by MPI_Wait Measure the I/O bandwidth of blocking collective I/O and NBC I/O What do we expect? – The NBC I/O routines ideally should have more overhead only from additional function calls and memory management. PPMM MPI_File_write_all(...) MPI_Request req; MPI_File_iwrite_all(..., &req); MPI_Wait(&req,...)

I/O Bandwidth (cont’d) PPMM Our NBC I/O implementation does not cause significant overhead!

Overlapping I/O and Computation Insert some synthetic computation code into coll_perf 21 Blocking I/O with computation MPI_File_write_all(...); Computation(); NBC I/O with computation MPI_File_iwrite_all(..., &req); for (...) { Small_Computation(); MPI_Test(&req, &flag,...); if (flag) break; } Remaining_Computation(); MPI_Wait(&req,...); MPI_File_iwrite_all(..., &req); Computation(); MPI_Wait(&req,...); Why not this? Because we need to make progress explicitly PPMM 2015

Overlapping I/O and Computation (cont’d) PPMM % of write time and 83% of read time is overlapped, respectively. The entire execution time is reduced by 36% for write and 34% for read. 84% 83%

Overlapping Multiple I/O Operations PPMM Multiple collective I/O operations can be overlapped by using NBC I/O routines! Initiate multiple collective I/O operations at a time and wait for the completion of all posted operations 59% reduction 13% reduction

Conclusions and Future Work MPI NBC I/O operations – Can take advantage of both nonblocking operations and collective operations – Will be part of the upcoming MPI 3.1 standard Initial work on the implementation of MPI NBC I/O operations – Done in the MPICH MPI library – Based on the extended two-phase algorithm – Utilizes the state machine and the extended generalized request – Performs as well as blocking collective I/O in terms of I/O bandwidth – Capable of overlapping I/O and other operations – Can help users try nonblocking collective I/O operations in their applications Future work – Asynchronous progress of NBC I/O operations To overcome the shortcomings of the explicit progress requirement – Real applications study – Comparison with other approaches PPMM

Acknowledgment This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract DE- AC02- 06CH We gratefully acknowledge the computing resources provided on Blues, a high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. PPMM

Q&A Thank you for your attention! Questions? PPMM

Related Work NBC I/O implementation – Open MPI I/O library using the libNBC library [Venkatesan et al., EuroMPI ’11] Leverages the concept of collective operations schedule in libNBC Requires modification of the progress engine of libNBC – Our implementation Exploits the state machine and the extended generalized request Does not need to modify the progress engine – If the extended generalized request interface is provided – Plan to compare the performance and efficiency of two implementations Collective I/O research – The two-phase method and its extensions Have been studied by many researchers Widely used in collective I/O implementations Our work is based on [Thakur et al., Frontiers ’99] – View-based collective I/O [Blas et al., CCGrid ’08] – MPI collective I/O implementation for better research platform [Coloma et al., Cluster ’06] – Collective I/O library with POSIX-like interfaces [Yu et al., IPDPS ’13] 27 PPMM 2015