Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.

Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1

Introduction Parallel performance affected by the I/O access pattern, file system, and MPI communication modes. Determination of interaction of these elements provides hints for improving performance. Study presents four test cases using h5perf and h5perf_serial. h5perf has been extended to support parallel testing of 2D datasets. h5perf_serial, based on h5perf, allows serial testing of n-dimensional datasets and various file drivers. Testing includes various combinations of MPI communication modes and HDF5 storage layouts. Finally, we make recommendations that can improve the I/O performance for specific patterns. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial2

Testing Systems and Configuration Syste m ArchitectureFile SystemMPI Implementation abeLinux Cluster with Intel 64 LustreMVAPICH2 1.0.2p1 Message Passing with Intel compiler cobaltccNUMA with Itanium 2 CXFSSGI Message Passing Toolkit 1.16 mercury Linux Cluster with Itanium 2 GPFSMPICH Myrinet 1.2.5..10, GM 2.0.8, Intel 8.0 Processors4 Dataset Size64K×64K (4GB) I/O Selection64MB per processor (shape depends on test case) APIHDF5 v181 (default building options) Iterations3 MPI/IO TypeCollective / Independent Storage LayoutContiguous / Chunked (chunk size depend on test case) September 9, 2008SPEEDUP Workshop - HDF5 Tutorial3

HDF5 Storage Layouts Contiguous HDF5 assigns a static contiguous region of storage for raw data. Dataset Dataset storage September 9, 2008SPEEDUP Workshop - HDF5 Tutorial4

HDF5 Storage Layouts Chunked HDF5 define separate regions of storage for raw data named chunks, which are pre-allocated in row-major order when a file is created in parallel. This layout is only valid when a file is created and the chunks are pre-allocated. Further modification of the file may cause the chunks to be arranged differently. C0C0 C1 C2C3 C0 C1 C2 C3 September 9, 2008SPEEDUP Workshop - HDF5 Tutorial5

Test Cases Case A The transfer selections extend over the entire columns with a size of 64K×1K. If the storage is chunked, the size of the chunks is 1K×1K. The selections are interleaved horizontally with respect to the processors. P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3 64K 1K1K September 9, 2008SPEEDUP Workshop - HDF5 Tutorial6

Test Cases Case B The transfer selection only spans half the columns with a size of 32K×2K. If the storage is chunked, the size of the chunks is 2K×2K. The selections are interleaved horizontally with respect to the processors. P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3 32K 2K 64K September 9, 2008SPEEDUP Workshop - HDF5 Tutorial7

Test Cases Case C The transfer selections only span half the rows with a size of 2K×32K. If the storage is chunked, the size of the chunks is 2K×2K. The lower dimension (column) is evenly divided among the processors. P0 … P0 P1 … P1 P2 … P2 P3 … P3 P0 … P0 P1 … P1 P2 … P2 P3 … P3 2K 32K 64K September 9, 2008SPEEDUP Workshop - HDF5 Tutorial8

Test Cases Case D The transfer selection extends over the entire rows with a size of 1K×64K. If the storage is chunked, the size of the chunks is 1K×1K. The lower dimension (column) is evenly divided among the processors. P0 … P0 P1 … P1 P2 … P2 P3 … P3 64K 1K 64K September 9, 2008SPEEDUP Workshop - HDF5 Tutorial9

Access Patterns Contiguous Each processor retrieves a separate region of contiguous storage. An example of this pattern is case D using contiguous storage. Non-contiguous Separate regions are still assigned to each processor but such regions contain gaps. Examples of this pattern include case C using contiguous storage, and collective cases C-D using chunked storage. P0P1P2P3 P0 … P1 …P2 …P3... P0 September 9, 2008SPEEDUP Workshop - HDF5 Tutorial10

Access Patterns Interleaved (or overlapped) Each processor writes into many portions that are interleaved with respect to the other processors. For example, using contiguous storage along with cases A-B generates Another instance results from using chunked storage with collective cases A-B P0P1P2P3P0P1P2P3 … P0P1P2P3P0P1P2P3 … September 9, 2008SPEEDUP Workshop - HDF5 Tutorial11

Performance Results and Analysis The results correspond to maximum throughput values of Write Open-Close operations during 3 iterations. Serial throughput is the performance baseline since our objective is to determine how parallel access can improve performance. Unlike GPFS and CXFS, Lustre does not stripe files by default. To enable parallel access, the directory / file must be striped using the command lfs. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial12

I/O Performance in Lustre September 9, 2008SPEEDUP Workshop - HDF5 Tutorial13

I/O Performance in Lustre Striping partitions the file space into stripes and assigns them to several Object Storage Targets (OSTs) in round- robin fashion. Since each OST stores portions of the file that are different from the other OSTs, they all can access the file in parallel. The default configuration on abe uses a stripe size of 4MB and a stripe count of 16. Striping improves performance when the I/O request of each processor spans several stripes (and OSTs) after MPI aggregations, if any. When the processors make small independent I/O requests that are practically contiguous as cases A-B using chunked storage, a single OST can provide better performance due to asynchronous operations. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial14

I/O Performance September 9, 2008SPEEDUP Workshop - HDF5 Tutorial15

Performance of Serial I/O Access using contiguous storage has the steepest performance trend as the cases change from A to D. When using chunked storage, the throughput remains almost constant at the upper bound. The allocation of chunks at the time they are written causes the access pattern to be virtually contiguous regardless of the test cases. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial18

Performance of Independent I/O Processors perform their I/O requests independently from each other. For contiguous storage, performance improves as the tests move from A to D. For chunked storage, throughput is high for interleaved cases A-B since writing blocks (chunks) become larger and caching is exploited. For cases C-D, the many writing requests (one per chunk) multiply the overhead due to unnecessary locking and caching in Lustre and CXFS. Unlike these file systems, GPFS has shown better scalability [1,2]. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial19

Performance of Collective I/O The participating processors coordinate and combine their many requests into fewer I/O operations reducing latency. Since the file space is evenly divided among the processors, no need for locking which reduces overhead [3]. For contiguous storage, performance is overall high but there is still an increasing trend as the cases change from A to D. For chunked storage, the performance is even higher with minor variations among the tests cases because several chunks can be written with a single I/O operation. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial20

Conclusion Important to determine the access pattern by analyzing the I/O requirements of the application and the storage implementation. For contiguous access patterns, independent access is preferable because it omits unnecessary overhead of collective calls. For non-contiguous patterns, there is little difference between independent and collective access. However, writing many chunks in independent mode may be expensive in Lustre and CXFS if caching is not exploited. For interleaved access pattern, collective mode is usually faster. For all the access patterns, collective mode and chunk storage provide the combination that yields the highest average performance. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial21

References 1.J. Borrill, L. Oliker, J. Shalf, and H. Shan. Investigation of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark. In Proceedings of SC’07: High Performance Networking and Computing, Reno, NV, November 2007. 2.W. Liao, A. Ching, K. Coloma, A. Choudhary, and L. Ward. An Implementation and Evaluation of Client-Side File Caching for MPI-IO. In Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2007, IEEE International Volume, Issue 26-30, pages 1-10, March 2007. 3.R. Thakur, W. Gropp, and E. Lusk. Data Sieving and Collective I/O in ROMIO. In Proceedings of the 7 th Symposium of the Frontiers of Massively Parallel Computation. IEEE Computer Society Press, February 1999. September 9, 2008SPEEDUP Workshop - HDF5 Tutorial22

Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.

Similar presentations

Presentation on theme: "Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.

Similar presentations

Presentation on theme: "Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1."— Presentation transcript:

Similar presentations

About project

Feedback