1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8, 1999
2 Conceptual Partition Model
3 File I/O Model Support large-scale unstructured grid applications. –Manipulate single file per application, not per processor. Support collective I/O libraries. –Require fast concurrent writes to a single file.
4 Problems Need a file system NOW! Need scalable, parallel I/O. Need file management infrastructure. Need to present the I/O subsystem as a single parallel file system both internally and externally. Need production-quality code.
5 Approaches Provide independent access to file systems on each I/O node. –Can’t stripe across multiple I/O nodes to get better performance. Add a file management layer to “glue” the independent file systems so as to present a single file view. –Require users (both on and off Cplant) to differentiate between this “special” file system and other “normal” file systems. –Lots of special utilities are required. Build our own parallel file system from scratch. –A lot of work just to reinvent the wheel, let alone the right wheel. Port other parallel file systems into Cplant. –Also a lot of work with no immediate payoff.
6 Current Approach Build our I/O partition as a scalable nexus between Cplant and external file systems. +Leverage off existing and future parallel file systems. +Allow immediate payoff with Cplant accessing existing file systems. +Reduce data storage, copies, and management. –Expect lower performance with non-local file systems. –Waste external bandwidth when accessing scratch files.
7 Building the Nexus Semantics –How can and should the compute partition use this service? Architecture –What are the components and protocols between them? Implementation –What we have now and what we hope to achieve in the future?
8 Compute Partition Semantics POSIX-like. –Allow users to be in a familiar environment. No support for ordered operations (e.g., no O_APPEND). No support for data locking. –Enable fast non-overlapping concurrent writes to a single file. –Prevent a job from slowing down the entire system for others. Additional call to invalidate buffer cache. –Allow file views to synchronize when required.
9 Cplant I/O I/O Enterprise Storage Services
10 Architecture I/O nodes present a symmetric view. –Every I/O node behaves the same (except for the cache). –Without any control, a compute node may open a file with one I/O node, and write that file via another I/O node. I/O partition is fault-tolerant and scalable. –Any I/O node can go down without the system losing jobs. –Appropriate number of I/O nodes can be added to scale with the compute partition. I/O partition is the nexus for all file I/O. –It provides our POSIX-like semantics to the compute nodes and accomplishes tasks on behalf of the them outside the compute partition. Links/protocols to external storage servers are server dependent. –External implementation hidden from the compute partition.
11 Compute -- I/O node protocol Base protocol is NFS version 2. –Stateless protocols allow us to repair faulty I/O nodes without aborting applications. –Inefficiency/latency between the two partitions is currently moot; Bottleneck is not here. Extension/modifications: –Larger I/O requests. –Propagation of a call to invalidate cache on the I/O node.
12 Current Implementation Basic implementation of the I/O nodes Have straight NFS inside Linux with ability to invalidate cache. I/O nodes have no cache. I/O nodes are dumb proxies knowing only about one server. Credentials rewritten by the I/O nodes and sent to the server as if the the requests came from the I/O nodes. I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI O2K as the (XFS) file server on the other end. Don’t have jumbo packets. Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each using about 15% of CPU.
13 Current Improvements Put a VFS infrastructure into I/O node daemon. –Allow access to multiple servers. –Allow a Linux /proc interface to tune individual I/O nodes quickly and easily. –Allow vnode identification to associate buffer cache with files. Experiment with a multi-node server (SGI/CXFS).
14 Future Improvements Stop retries from going out of network. Put in jumbo packets. Put in read cache. Put in write cache. Port over Portals 3.0. Put in bulk data services. Allow dynamic compute-node-to-I/O-node mapping.
15 Looking for Collaborations Lee Ward Pang Chen