Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber National Center for Supercomputing Applications University of Illinois, Urbana-Champaign
2 Outline HDF5 and PnetCDF performance comparison Performance of collective and independent I/O operations Collective I/O optimizations inside HDF5
3 Introduction HDF5 and netCDF provide data formats and programming interfaces HDF5 Hierarchical file structures High flexibility for data management NetCDF Linear data layout Self-descriptive and minimizes overhead HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO
4 HDF5 and PnetCDF performance comparison Previous Study: PnetCDF achieves higher performance than HDF5 NCAR Bluesky IBM Cluster 1600 AIX 5.1, GPFS Power4 LLNL uP IBM SP AIX 5.3, GPFS Power5 PnetCDF vs. HDF
5 HDF5 and PnetCDF performance comparison Benchmark is the I/O kernel of FLASH. FLASH I/O generates 3D blocks of size 8x8x8 on Bluesky and 16x16x16 on uP. Each processor handles 80 blocks and writes them into 3 output files. The performance metric given by FLASH I/O is the parallel execution time. The more processors, the larger the amount of data. Collective vs. Independent I/O operations
6 HDF5 and PnetCDF performance comparison BlueskyuP
7 HDF5 and PnetCDF performance comparison BlueskyuP
8 Performance of collective and independent I/O operations Effects of using contiguous and chunked layout storage. Four test cases: Writing a rectangular selection of double floats. Different access patterns and I/O types. Data is stored in row-major order. The tests executed on NCAR Bluesky using HDF5 version
9 Contiguous storage (Case 1) Every processor has a noncontiguous selection. Access requests are interleaved. Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.) Independent I/O: 1, s. Collective I/O: 4.33 s. P0P1P2P3P0P1P2P3 ……. P0P1P2P3 Row 1Row 2
10 Contiguous storage (Case 2) No interleaved requests to combine. Contiguity is already established. Write operation with 32 processors where each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: 3.64 s. Collective I/O: 3.87 s. P0 P1 P2 P3 P0P1P2P3
11 Chunked storage (Case 3) The chunk size does not divide exactly the array size. If info about access pattern is given, MPI-IO can performs efficient independent access. Write operation with 32 processors, each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: s. Collective I/O: s. P0 P1 P2 P3 chunk 0chunk 1 dataset P0 …. P1 ….P2 ….P3 …. P0
12 Chunked storage (Case 4) Array partitioned exactly into two chunks. Write operation with 32 processors, each chunk is selected by 4 processors, each processor selection has 256K rows and 32 columns (64 MB/proc.) Independent I/O: 7.66 s. Collective I/O: 8.68 s. P0 P1 P2 P3 chunk 0chunk 1 P4 P5 P6 P7 P0P1P2P3
13 Conclusions Performance is quite comparable when the two libraries are used in similar manners. Collective access should be preferred in most situations when using HDF Use of chunked storage may affect the contiguity of the data selection.
14 Improvements of collective IO supports inside HDF5 Advanced HDF5 feature: non-regular selections Performance optimizations: chunked storage Provide several IO options to achieve good collective IO performance Provide APIs for applications to participate the optimization process
15 Improvement 1- non-regular selections Only one HDF5 IO call Good for collective IO Any such applications?
16 Performance issues for chunked storage Previous support Collective IO can only be done per chunk Problem Severe performance penalties with many small chunk IOs
17 P0 chunk 1chunk 2 chunk 3chunk 4 P1 MPI-IO Collective View Improvement 2: One linked chunk IO
18 Improvement 3: Multi-chunk IO Optimization Have to keep the option to do collective IO per chunk Collective IO bugs inside different MPI-IO packages Limitation of system memory Problem Bad performance caused by improper use of collective IO P0 P1 P5 P6 P3 P4 P7 P8 P0 P1 P5 P6 P3 P4 P7 P8
19 Improvement 4 Problem HDF5 may make wrong decision about the way to do collective IO Solution Provide APIs for applications to participate the decision-making process
20 Decision-making about One linked-chunk IO Multi-chunk IO Decision-making about Collective IO One Collective IO call for all chunksIndependent IO Optional User Input Flow chart of Collective Chunking IO improvements inside HDF5 Collective chunk mode Yes NO Yes NO Collective IO per chunk
21 Conclusions HDF5 provides collective IO supports for non- regular selections Supporting collective IO for chunked storage is not trivial. Several IO options are provided. Users can participate the decision-making process that selects different IO options.
22 Acknowledgments This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.