SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching, Kenin Coloma ANL Collaborators:Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham Parallel I/O Middleware Optimizations and Future Directions
2 Progress and accomplishments – Wei-keng Liao –Parallel netCDF –Client-side file caching in MPI-IO –Data-type I/O for non-contiguous file access in PVFS Future research directions – Alok Choudhary –I/O middleware –Autonomic and Active storage Systems Outline
3 Parallel NetCDF NetCDF defines: –A set of APIs for file access –A machine-independent file format Parallel netCDF work –New APIs for parallel access –Maintaining the same file format Tasks –Built on top of MPI for portability and high performance –Support C and Fortran interfaces –Support external data representations P0P1P2P3 netCDF Parallel File System Parallel netCDF P0P1P2P3 Parallel File System
4 PnetCDF Current Status Version was released on July 27, 2005 Supported platforms –Linux Cluster, IBM SP, SGI Origin, Cray X, NEC SX Two sets of parallel APIs are completed –High level APIs (mimicking the serial netCDF APIs) –Flexible APIs (extended to utilize MPI derived datatype) Fully supported both in C and Fortran Support for large file ( > 4GB files) Test suites –Self test codes ported from Unidata netCDF package to validate against single-process results –Parallel test codes for both sets of APIs
5 Illustrative PnetCDF Users FLASH – astrophysical thermonuclear application from ASCI/Alliances center at university of Chicago ACTM – atmospheric chemical transport model, LLNL WRF-ROMS – regional ocean model system I/O module from scientific data technologies group, NCSA ASPECT – data understanding infrastructure, ORNL pVTK – parallel visualization toolkit, ORNL PETSc – portable, extensible toolkit for scientific computation, ANL PRISM – PRogram for Integrated Earth System Modeling, users from C&C Research Laboratories, NEC Europe Ltd. ESMF – earth system modeling framework, national center for atmospheric research More …
6 PnetCDF Future Work Non-blocking I/O APIs Performance improvement for data type conversion –Type conversion while packing non-contiguous buffers Extending PnetCDF for newer applications, e.g., data analysis and mining Collaboration with application users
7 File Caching in MPI-IO Parallel netCDF MPI-IO PVFS Applications Storage devices
8 File Caching for Parallel Apps Why file caching? –Improves the performance for repeated file access –Enable write-behind strategy Accumulates multiple small writes to better utilize network bandwidth May balance the work load for irregular I/O patterns Useful for checkpointing –Enable data pre-fetching Useful for read-only applications (parallel data mining, visualization) Why not just use traditional caching strategies? –Each client performs independently cache incoherence –I/O servers are in charged with cache coherence control potential I/O serialization –Inadequate for parallel environment where application clients frequently read/write shared files
9 Caching Sub-system in MPI-IO Application-aware file caching –A user-level implementation in MPI-IO library –MPI communicators define the subsets of processes operating on a shared file client processors I/O servers global cache pool local cache buffers network interconnect memory –Processes cooperate with each other to perform caching –Data cached in one client can be directly accessed by another –Moves cache coherence control from servers to clients –Distributed coherence control (less overhead) Supports both collective and independent I/O
10 Design Cache metadata –File-block based granularity –Cyclically stored in all processes Global cache pool –Comprises local memory of all processes –Single copy of file data to avoid coherence issue processes 1 P 2 P 3 P 0 P File logical partitioning Distributed cache meta data processes block 9 status block 5 status block 1 status block 10 status block 6 status block 2 status block 11 status block 7 status block 3 status block 8 status block 4 status block 0 status 1 P 2 P 3 P 0 P Global cache pool local memory page 3 page 2 page 1 block 4block 3block 2block 1block 0 page 3 page 2 page 1 page 3 page 2 page 1 page 3 page 2 page 1 Two implementations: –Using an I/O thread (POSIX thread) –Using the MPI remote-memory-access (RMA) facility
11 Example Read Operation File logical partitioning block 4block 3block 2block 1block 0 1 P 2 P 3 P 0 P Distributed metadata processes block 9 status block 5 status block 1 status block 10 status block 6 status block 2 status block 11 status block 7 status block 3 status block 8 status block 4 status block 0 status page 3 page 2 page 1 processes 1 P 2 P 3 P 0 P Global pool local memory page 3 page 2 page 1 page 3 page 2 page 1 page 3 page 2 page 1 If not yet cached page 2 Already cached block 3 metadata lookup lock it ! unlock it !
12 Future Work Data pre-fetching –Instructional (through MPI info) and non-instructional (based on sequential access) Collective write-behind for data check-pointing Stand-alone distributed lock sub-system –Using MPI-2 remote-memory access facility Design new MPI file hints for caching Application I/O pattern study –Structured/unstructured AMR
13 Data-type I/O in PVFS Parallel netCDF MPI-IO PVFS Applications Storage devices
14 Non-contiguous I/O Four types –Contiguous both in memory and file –Contiguous in memory, non- contiguous in file –Non-contiguous in memory, contiguous in file –Non-contiguous both in memory and file Each segment is an I/O request of (offset, length) memory file memory file memory file memory file
15 Implementations POSIX I/O –One call per (offset, length) –Generates large number of I/O requests Data sieving –Single (offset, length) covering multiple segments –Accessing unused data and introduces consistency control overhead List I/O –Single calls handle multiple non- contiguous access –Passing multiple (offset, length)s across network Application process I/O request Client-side file system Application process List I/O request Client-side file system Server-side file system network
16 Data-type I/O Single requests all the way to the servers Abandons offset-length pair representation –Borrow MPI datatype concept to describe non-contiguous access patterns –New file system data types –New file system interfaces An implementation in PVFS –Both client and server sides Application process Datatype I/O request PVFS client PVFS server network Single request
17 Summary of Accomplishments High-level I/O –Parallel netCDF Low-level I/O –MPI-IO file caching Parallel file system –Data-type I/O in PVFS Parallel netCDF MPI-IO PVFS
18 Future Research
19 Typical Components in I/O Systems Based on a lot of current apps High-level –E.g., NetCDF, HDF, ABC –Applications use these Mid-level –E.g., MPI-IO –Performance experience Low-level –E.g., File systems –Critical for performance in above More access info lost if more components used Compute node network I/O Server I/O Server I/O Server End-to-End Performance critical Applications Client-side File System Parallel netCDF, HDF5,... MPI-IO
20 Collectives, independents I/O hints: access style (read_once, write_mostly, sequential, random, …), collective buffering, chunking, striping Open mode (O_RDONLY, O_WRONLY, O_SYNC), f ile status, locking, flushing, cache invalidation Machine dependent: data shipping, sparse access, double buffering Access base on : file blocks, objects scheduling, aggregation Read-ahead, write-behind, metadata management, file striping, security, redundancy Save attributes along with data, external data types (byte-alignment), data structures (flexible dimensionality), hierarchical data model Access patterns: shared files, individual files, data partitioning, check- pointing, data structures, inter-data relationship network I/O Server I/O Server I/O Server Applications Client-side File System Parallel netCDF, HDF5,... MPI-IO application-aware caching, pre-fetching, file grouping, “vector of bytes”, flexible caching control, object-based data alignment, memory-file layout mapping, more control over hardware, Shared file descriptors, Group locks, flexible locking control, scalable metadata management, zero-copying, QoS, Shared file descriptors, Active storage: data filtering,object-based/hierarchical storage management, indexing, mining, power-management Caching, fault tolerance, read-ahead, write-behind, I/O load balance, wide-area, heterogeneous FS support, thread-safe Graph-based data model
21 FSDMDatasetsHSS Goal Decouple “What” from “How” and Be Proactive caching collective reorganize load balance Fault-tolerance Understand App1App2 App3 App4 I/O SW OPT streaming/ Small/large configuration s/w layer Regular/irregular Local/remote user burdened Ineffective interfaces Non-communicating layers Current Speed BW Latency QoS
22 Component Design for I/O Application-aware –Capture application’s file access information –Relationship between files, objects, users Environment-aware –Network (reliability, security), storage devices (active disks) Context-aware –Binding data attributes to files, indexing for fast search High-performance I/O needs supports from –Languages + Compilers –I/O libraries –File systems –Storage devices
23 Component Interface Design Informative –Should deliver access/storage information top-down/bottom- up Flexibility –Should describe arbitrary data distribution in memory buffers, files, storage devices Functionality –Asynchronous operations, read-ahead, write-behind, replications –Provides ability for additional innovation Object-based I/O –For hardware control (I/O co-processor, active disk, object- based file systems, etc.)
24 Future Work in MPI-IO Investigate interface extensions Client-side caching sub-system –Implementations for various I/O strategies: buffering, pre- fetching, replication, migration –Adaptive caching mechanisms and algorithms for optimizing different access patterns Distributed mutual exclusive locking sub-system –Shared resources, such as files and memory –Pipeline locking (overlap lock waiting time with I/O) Work with HDF5 and parallel netCDF –Design I/O strategies for metadata and data Metadata: small, overlap, repeated, strong consistency requirement Array data: large, less frequent update
25 Future Work in Parallel File Systems File caching (focus on parallel apps) File versioning –Alternative to file locking –Reliability and availability aspects Guarantee atomicity in the presence of client or I/O system failure Can enable efficient RAID-type schemes in PFS (because of atomicity) Dynamic rebalancing of I/O File list lock –Locks to multiple regions in a single request
26 ML310-board4 ML310-board3 ML310-board2 Active Storage System (reconfigurable system) External net ML310-host ML310-board1 Switch Xilinx XC2VP30 Virtex-II Pro family –30,816 logic cells (3424 CLBs) –2 PPC405 embedded cores –2,448 Kb ( Kb blocks) BRAM –136 dedicated 18x18 multiplier blocks Software: –Data Mining –Encryption –Functions and runtime libs –Linux micro-kernel
27 MineBench - data mining benchmark suite