Parallel I/O Optimizations Sources/Credits: R. Thakur, W. Gropp, E. Lusk. A Case for Using MPI's Derived Datatypes to Improve I/O Performance. Supercomputing 98 (bibliography) Xiaosong Ma, Marianne Winslett, Jonghyun Lee, and Shengke Yu. Improving MPI IO output performance with active buffering plus threads. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, April 2003.Improving MPI IO output performance with active buffering plus threads
High Performance with Derived Data Types (Thakur et. al: SC 98) Potential of parallel file systems not fully utilized because of application’s I/O access patterns a.Many small requests to non-contiguous blocks b.Most parallel file systems access single large chunk Thus motivation for making a single call using derived data types ROMIO (MPICH’s I/O) performs 2 optimizations – data sieving and collective I/O
Datatype Constructors in MPI 1.contiguous 2.vector/hvector 3.indexed/hindexed/indexed_block 4.struct 5.subarray 6.darray IIII IIIIIIIIII IIIIIIIII IIIDDDDCC
Different levels of access
Optimizations in ROMIO for derived-datatype noncontiguous access 1.Data sieving Make a few, large contiguous requests to the file system even if the user’s requests consists of several, small, nocontiguous requests Extract (pick out data) in memory that is really needed This is ok for read? For write? Use small buffer for writing with data sieving than for reading with data sieving. Why? Read-modify-write along with locking Greater the size of the write buffer, greater the contention among processes for locks
Optimizations in ROMIO for derived-datatype noncontiguous access 1.Data sieving 2.Collective I/O During collective-I/O functions, the implementation can analyze and merge the requests of different processes The merged request can be large and continuous although the individual requests were noncontiguous. Perform I/O in 2 phases: I/O phase – processes perform I/O for the merged request. Some data may belong to other processes. If the merged request is not contiguous, use data sieving Communication phase – processes redistribute data to obtain the desired distribution Additional cost of communication phase can be offset by performance gain due to contiguous access. Data sieving and collective-I/O also help improve caching and prefetching in underlying file system
Collective I/O Illustration P0P1P0P1 P0P1 P0P1P0P1 P0P1P0P1
Active Buffering with Threads (Xiaosong Ma et al.: IPDPS 2003) Above optimizations alone are not enough. Active Buffering – use of separate I/O nodes Overlapping I/O access with computation by threads Buffer space automatically adjusted to available memory
Original Scheme (Ma: IPDPS 2002) Hierarchical buffering scheme Dedicated I/O server nodes During I/O: if(not overflow in compute nodes) compute nodes -> local buffers else if(not overflow in server nodes) compute nodes ->server buffers (using MPI) else server nodes -> I/O system During computation: Server nodes clear local buffers and I/O write Fetch data from compute nodes (one-sided communication) and I/O write
Current Scheme I/O threads collective I/O overlapped with main threads computation and communication Uses pthreads with kernel-level scheduling Interception of ROMIO’s I/O calls Main threads and I/O threads coordinate by buffer queue Producer-consumer and bounded-buffer problem
Execution Timeline
Bibliography Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and Rajeev Thakur. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages , Atlanta, GA, October USENIX Association.PVFS: A parallel file system for linux clusters Jose Aguilar. A graph theoretical model for scheduling simultaneous I/O operations on parallel and distributed environments. Parallel Processing Letters, 12(1): , March 2002.A graph theoretical model for scheduling simultaneous I/O operations on parallel and distributed environments Rajesh Bordawekar. Implementation of collective I/O in the Intel Paragon parallel file system: Initial experiences. In Proceedings of the 11th ACM International Conference on Supercomputing, pages ACM Press, July 1997.Implementation of collective I/O in the Intel Paragon parallel file system: Initial experiences Peter Brezany, Marianne Winslett, Denis A. Nicole, and Toni Cortes. Parallel I/O and storage technology. In Proceedings of the Seventh International Euro-Par Conference, volume 2150 of Lecture Notes in Computer Science, pages , Manchester, UK, August Springer- Verlag. Parallel I/O and storage technology Bradley Broom, Rob Fowler, and Ken Kennedy. KelpIO: A telescope- ready domain-specific I/O library for irregular block-structured applications. In Proceedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, pages , Brisbane, Australia, May IEEE Computer Society PressKelpIO: A telescope- ready domain-specific I/O library for irregular block-structured applications
Bibliography J. Carretero, F. Pérez, P. de Miguel, F. Garc\'\ia, and L. Alonso. I/O data mapping in \em ParFiSys: support for high-performance I/O in parallel and distributed systems. In Euro-Par '96, volume 1123 of Lecture Notes in Computer Science, pages Springer-Verlag, August 1996I/O data mapping in \em ParFiSys: support for high-performance I/O in parallel and distributed systems Ying Chen, Marianne Winslett, Y. Cho, and S. Kuo. Automatic parallel I/O performance optimization using genetic algorithms. In Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, pages IEEE Computer Society Press, July 1998.Automatic parallel I/O performance optimization using genetic algorithms Ying Chen, Ian Foster, Jarek Nieplocha, and Marianne Winslett. Optimizing collective I/O performance on parallel computers: A multisystem study. In Proceedings of the 11th ACM International Conference on Supercomputing, pages ACM Press, July 1997.Optimizing collective I/O performance on parallel computers: A multisystem study Avery Ching, Alok Choudhary, Kenin Coloma, Wei keng Liao, Robert Ross, and William Gropp. Noncontiguous I/O accesses through MPI-IO. In Proceedings of the Third IEEE/ACM International Symposium on Cluster Computing and the Grid, pages , Tokyo, Japan, May IEEE Computer Society Press.Noncontiguous I/O accesses through MPI-IO Phillip M. Dickens and Rajeev Thakur. Evaluation of collective I/O implementations on parallel architectures. Journal of Parallel and Distributed Computing, 61(8): , August 2001.Evaluation of collective I/O implementations on parallel architectures
Bibliography Félix Garcia-Carballeira, Alejandro Calderon, Jesus Carretero, Javier Fernandez, and Jose M. Perez. The design of the Expand parallel file system. The International Journal of High Performance Computing Applications, 17(1):21-38, 2003The design of the Expand parallel file system Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pages , Bolton Landing, NY, October ACM Press.The Google file system James V. Huber, Jr., Christopher L. Elford, Daniel A. Reed, Andrew A. Chien, and David S. Blumenthal. PPFS: A high performance portable parallel file system. In Hai Jin, Toni Cortes, and Rajkumar Buyya, editors, High Performance Mass Storage and Parallel {I/O}: Technologies and Applications, chapter 22, pages IEEE Computer Society Press and Wiley, New York, NY, 2001.PPFS: A high performance portable parallel file system Meenakshi A. Kandaswamy, Mahmut Kandemir, Alok Choudhary, and David Bernholdt. An experimental evaluation of I/O optimizations on different applications. IEEE Transactions on Parallel and Distributed Systems, 13(7): , July 2002.An experimental evaluation of I/O optimizations on different applications Mahmut Kandemir. Compiler-directed collective I/O. IEEE Transactions on Parallel and Distributed Systems, 12(12): , December 2001.Compiler-directed collective I/O
Bibliography Xiaosong Ma, Marianne Winslett, Jonghyun Lee, and Shengke Yu. Improving MPI IO output performance with active buffering plus threads. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, April Improving MPI IO output performance with active buffering plus threads Tara M. Madhyastha and Daniel A. Reed. Learning to classify parallel input/output access patterns. IEEE Transactions on Parallel and Distributed Systems, 13(8): , August 2002.Learning to classify parallel input/output access patterns Ethan L. Miller and Randy H. Katz. RAMA: An easy-to-use, high- performance parallel file system. Parallel Computing, 23(4- 5): , June 1997.RAMA: An easy-to-use, high- performance parallel file system Bill Nitzberg and Virginia Lo. Collective buffering: Improving parallel I/O performance. In Proceedings of the Sixth IEEE International Symposium on High Performance Distributed Computing, pages , Portland, OR, August IEEE Computer Society Press. See also later version nitzberg:bcollective.nitzberg:bcollective Huseyin Simitci and Daniel Reed. A comparison of logical and physical parallel I/O patterns. The International Journal of High Performance Computing Applications, 12(3): , Fall 1998.
Bibliography Domenico Talia and Pradip K. Srimani. Parallel data- intensive algorithms and applications. Parallel Computing, 28(5): , May 2002.Parallel data- intensive algorithms and applications Len Wisniewski, Brad Smisloff, and Nils Nieuwejaar. Sun MPI I/O: Efficient I/O for parallel applications. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November ACM Press and IEEE Computer Society PressSun MPI I/O: Efficient I/O for parallel applications K. K. Lee, M. Kallahalla, B. S. Lee, and P. J. Varman. Performance comparison of prefetching and placement policies for parallel I/O. International Journal of Parallel and Distributed Systems and Networks, 5(2):76-84, Performance comparison of prefetching and placement policies for parallel I/O M. Kallahalla and P. J. Varman. PC-OPT: Optimal offline prefetching and caching for parallel I/O systems. IEEE Transactions on Computers, 51(11): , November 2002.PC-OPT: Optimal offline prefetching and caching for parallel I/O systems