Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

1 Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010

2 Recap –why I/O is such a problem POMPA Workshop May 3rd 2010 2 The I/O problem: –I/O is the limiting factor for scaling of COSMO Limiting factor for many data intensive applications –Speed of I/O subsystems for writing data is not keeping up with increases in speed of compute engines Idealised 2D grid layout: Increasing the number of processors by 4 leads to each processor having one quarter the number of grid points to compute one half the number of halo points to communicate The same amount of total data needs to be output at each time step. P processors, each with … MxN Grid points 2M+2N Halo points 4P processors, each with … (M/2)x(N/2) Grid points M+N Halo points

3 I/O reaches a scaling limit POMPA Workshop May 3rd 2010 3 Computation:Scales O(P) for P processors Minor scaling problem – issues of halo memory bandwidth, vector lengths, efficiency of software pipeline etc. Communication:Scales O(√P) for P processors Major scaling problem – the halo region decreases slowly as you increase the number of processors I/O (mainly “O”):No scaling Limiting factor in scaling– the same amount of total data is output at each time step

4 Current I/O strategies in COSMO Two types of output format – GRIB and NetCDF –Grib dominant in operational weather forecasting –NetCDF is the main format used in climate research GRIB output has the possibility of using asynchronous I/O processes to improve parallel performance NetCDF is always ultimately serialised through process zero of the simulation Actually in each case of GRIB and NetCDF the output is a multi-level data collection approach 4 POMPA Workshop May 3rd 2010

5 Multi-level approach POMPA Workshop May 3rd 2010 5 A 3 x 6 grid of processes, each with 3 atmospheric levels Collect on atmospheric levels Proc 0 Proc 1 Proc 2 I/O Proc Storage Data is sent to I/O proc level by level

6 Performance limitations and constraints Both Grib and NetCDF formats carry out the gather on levels stage For Grib-based weather simulations the final collect-and- store stage can deploy multiple I/O processes to deal with the data. –Allows improved performance where real storage bandwidth is the bottleneck –Produces multiple files (one per I/O process) that can easily be concatenated together Only process 0 can currently act as an I/O proc for the collect-and-store stage with NetCDF –Serialises the I/O through one compute process POMPA Workshop May 3rd 2010 6

7 Possible strategies for fast NetCDF I/O 1.Use a version of parallel NetCDF to have all compute processes write to disk –eliminate both the gather-on- levels and collect-and-store stages 2.Use a version of parallel NetCDF on the subset of compute processes that are needed for the gather stage –Eliminate the collect-and-store stage 3.Use a set of asynchronous processes as is currently done in the Grib implementation –If more than one asynchronous process is employed this would require parallel NetCDF or post- processing POMPA Workshop May 3rd 2010 7

8 Full parallel strategy A simple micro-benchmark of 3D data distributed on a 2D process grid showed reasonable results This was implemented in the RAPS code and tested with the IPCC benchmark at ~ 900 cores –No smoothing operations in this benchmark or in the code The results were poor –Much of the I/O in this benchmark is 2D fields –Not much data is written at each timestep –The current I/O performance is not bad –The parallel strategy became dominated by metadata operations File writes for 3D fields were reasonably fast (~0.025s for 50 Mbytes) Opening the file took a long time (0.4 to 0.5 seconds) The strategy may be useful for high-resolution simulations writing large 3D blocks of data –Originally this strategy was expected to target 2000 x 1000 x 60+ grids POMPA Workshop May 3rd 2010 8

9 Slowdown from metadata The first strategy has problems related to metadata scalability Most modern high- performance file systems use POSIX I/O to open/close/seek etc. This is not scalable as it reduces file access operations to the time taken for Metadata operations POMPA Workshop May 3rd 2010 9

10 Non-scalable metadata – file open speeds Opening a file is not a scalable operation on modern parallel file systems –See graph of two CSCS filesystems!! There are some mitigation strategies in MPI’s Romio/Adio layer –“Delayed open” only makes the POSIX open call when actually needed For MPI-IO collective operations, only a subset of processes actually write the data –No mitigation strategies for specific file systems (Lustre and GPFS) With current file systems using POSIX I/O calls full parallel I/O is not scalable We need to pursue the other strategies for COSMO, unless large blocks of data are being written POMPA Workshop May 3rd 2010 10 Time in seconds to open a file against number of MPI processes involved in file open

11 Next steps We are looking at all 3 strategies for improving NetCDF I/O We are investigating the current state of Metadata accesses in the MPI-IO layer and in file systems in general –Particularly Lustre and GPFS, but others (e.g. OrangeFS) … but for some jobs the individual I/O operations might not be large enough to allow much speedup POMPA Workshop May 3rd 2010 11

