SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei Li ANL Collaborators:Bill Gropp, Rob Ross, Rajeev Thakur Rob Latham Progress in Storage Efficient Access PnetCDF and MPI-I/O
Outline Parallel netCDF –Building blocks –Status report –Users and applications –Future works MPI I/O file caching sub-system –Enhance client-side file caching for parallel applications –Scalable approach for enforcing file consistency and atomicity
Parallel netCDF Goals –Design parallel APIs –Keep the same file format Backward compatible, easy to migrate from serial netCDF Similar API names and argument lists but with parallel semantics Tasks –Built on top of MPI for portability and high performance Take advantage of existing MPI-IO optimization (collective I/O, etc.) –Additional functionality for sophisticated I/O patterns A new set of flexible APIs incorporate MPI derived data type to address the mapping between memory and file data layout –Support C and Fortran interfaces –Support external data representations across platforms Parallel netCDF Compute node switch network I/O Server I/O Server I/O Server ROMIO ADIO User space File system space
PnetCDF Current Status High level APIs (mimicking serial netCDF API) –Fully supported both in C and Fortran Flexible APIs (extended to utilize MPI derived datatype) –Allow complex memory layout for mapping between I/O buffer and file space –Support varm routines (strided memory layout) ported from serial netCDF –Support array shuffles, e.g. transposition Test suites –C and Fortran self test codes ported from Unidata netCDF package to validate against single-process results –Parallel test codes for both sets of APIs Latest release is v0.9.4 –Pre-release v1.0 Sync with netCDF v3.6.0 (newest release from UniData) Parallel API user manual
PnetCDF Users and Applications FLASH – Astrophysical Thermonuclear application from ASCI/Alliances Center at University of Chicago ACTM – Atmospheric Chemical Transport Model from LLNL ROMS – Regional Ocean Model System from NCSA HDF group ASPECT – data understanding infrastructure from ORNL pVTK – parallel visualization toolkit from ORNL PETSc – Portable, Extensible Toolkit for Scientific Computation from ANL PRISM– PRogram for Integrated Earth System Modeling
PnetCDF Future Works Data type conversion for external data representation –Reducing intermediate memory copy operations int64 int32, little-endian big-endian, int double –Data type caching at PnetCDF level (w/o repeated decoding) I/O hints –Currently, only MPI hints (MPI file info) are supported –Need netCDF level hints, eg. patterns of access sequence for multiple arrays Non-blocking I/O Large array support (dimensionality > ) More flexible and extendable file format –Allow adding new objects dynamically –Store arrays of structured data types, such as C structure
Client-side File Caching for MPI I/O Traditional client-side file caching –Treats each client independently, targeting for distributed environment –Inadequate for parallel environment where clients are most likely related with each other (eg. read/write shared files) Collective caching –Application processes cooperate with each other to perform data caching, coherence control (leaving I/O servers out of the task) client processors I/O servers global cache pool local cache buffers network interconnect
Design of Collective Caching Caching sub-system is implemented at user space –Built at the MPI I/O level Portable across different file systems Distributed management –For cache metadata and lock control (vs. centralized) Two designs: –Using an I/O thread –Using the MPI remote-memory-access (RMA) facility system space user space server-side file system network client-side file system MPI I/O MPI library application process collective caching processes 1 P 2 P 3 P 0 P File logical parititioning Distributed cache meta data processes block 9 status block 5 status block 1 status block 10 status block 6 status block 2 status block 11 status block 7 status block 3 status block 8 status block 4 status block 0 status 1 P 2 P 3 P 0 P Global cache pool local memory page 3 page 2 page 1 block 4block 3block 2block 1block 0 page 3 page 2 page 1 page 3 page 2 page 1 page 3 page 2 page 1
Performance Results 1 IBM SP at SDSC using GPFS –System peak performance: 2.1 GB/s for reads, 1 GB/s for writes Sliding-window benchmark –I/O requests are overlapped –Can cause cache coherence problem
Performance Results 2 FLASH I/O - 8 x 8 x Number of nodes I/O Bandwidth in MB/s FLASH I/O - 16 x 16 x Number of nodes I/O bandwidth in MB/s BTIO Benchmark - class A Number of nodes I/O Bandwidth in MB/s OriginalCollective Caching BTIO Benchmark - class B Number of nodes I/O Bandwidth in MB/s BTIO benchmark –From NAS Ames Research Center -- Parallel Benchmarks version 2.4 –Block Tri-diagonal array partitioning pattern –Use MPI collective I/O calls –I/O requests are not overlapped FLASH I/O benchmark –From U. of Chicago, ASCI Alliances Center –Access pattern is non-contiguous both in memory and in file –Use HDF5 –I/O requests are not overlapped OriginalCollective CachingOriginalCollective CachingOriginalCollective Caching