RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations
Parallel I/O One or more processes accessing data on disk at the same time (usually within an application) Used on platforms that access data on parallel file systems (Lustre, GPFS, PVFS2): – Multiple I/O servers – Data is striped across servers Several parallel programming models available: – MPI is the dominant
Message Passing Interface MPI: standardized interface for message passing between processes: – point-to-point communication – collective communication – one-sided communication (Remote Memory Access) – derived datatypes – process grouping (MPI Communicator) – parallel I/O standard (MPI-I/O)
MPI-I/O Standard interface for doing I/O relaxed consistency semantics: – scope – synchronization concept of a file view, derived datatypes individual and collective I/O operations blocking & non-blocking I/O operations
4 processes accessing a 2D matrix stored in row-major format Each process can define the portions it will access as its file view (optional) Access can be defined at I/O time using derived datatypes (similar to HDF5 file selections) Individual I/O: Each process posts its access to disk independently Collective I/O: All processes participate in the I/O operations Different algorithms underneath the hood to optimize I/O access to disk. Example
Parallel HDF5 (I) MPI-I/O offers a limited interface to the user: – A file is seen as a stream of bytes. – The user is responsible to structure his data on file using MPI derived datatypes which is very difficult for a scientific application programmer. HDF5 offers a higher level of access through different objects and I/O access. Parallel HDF5 is layered on top of MPI-I/O to hide the cumbersomeness of using MPI-I/O to access structured data. The user gets the nice HDF5 interface, but the HDF5 library still has to provide an efficient and optimized access to data on disk.
Parallel HDF5 (II) Parallel HDF5 still has some limitations imposed on the user: – PHDF5 allows the user to access data on disk individualy or collectively, however operations that modify metadata itself are required to be collective. While this requirement satisfies the application needs in some cases, in other cases it adds significant overhead and restrictions on application developers. A similar requirement is placed internally on space allocation, where processes need to communicate in a collective manner to be able to allocate space.
RFC In this RFC, we propose several design options that will remove the collective requirement for space allocation and operations that require modifying the file metadata. Any solution has to achieve the following: – Allow an independent (but globally visible), scalable "fetch-and- add" operation to the "end of allocated space" (EOA) value in the file (to allow new space in the file to be allocated to an individual process, without involving all the other processes). – Allow [the metadata for] an object to be locked, updated, unlocked, and made visible to other processes without requiring a collective operation (to allow changes to the structure of a file).
Use case 1 A parallel application with n processes, and each process creates a dataset in the root group for its private use. This is currently done by having all processes call the dataset creation function n times, despite the fact that each process needs 1 dataset only. / P1P1 P2P2 P3P3 P4P4 P6P6 P5P5
Use case 2 A process within a parallel application doing independent read/write to a dataset while another process comes in and modifies the metadata of that Dataset, making the former process’s access possibly obsolete. Currently, this use case would not happen, because all processes are required to be present at the time of metadata modification. However, this use case becomes possible when the collective requirement to metadata modification is removed and should be taken into consideration by any design option.
Use case 3, 4, 5 Collective read/write to a dataset, where all processes that opened the file are required to call the function with or without data to read/write. 2 or more processes creating an attribute on their own datasets. In the current implementation, similar to use case U.1, the attribute creation function needs to be collective, even though they are being created on different datasets. 2 or more processes creating an attribute on a shared dataset. A solution that makes this operation independent has to take into account that there are other processes that can create an attribute (or any other object) at the same time and those objects might overlap in the file if they are not serialized.
Space Allocation at EOA One sided Fetch and Add problem MPI does not provide a FAD implementation, but does provide one sided RMA operations (Put, Get, Accumulate). We use an algorithm introduced by Bill Gropp in Using MPI-2 to implement an MPI based FAD operation.
MPI FAD Create an MPI window on all processes with the root process exposing a memory buffer where the fetch and add will take place. All processes wanting to allocate space from the EOA will execute the following code section to get the current EOA and increment it atomically. MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win); MPI_Accumulate(increment_by_x, nlevels, MPI_INT, 0, 0, 1, acc_type, MPI_SUM, win); MPI_Get(get_array, nlevels, MPI_INT, 0, 0, 1, get_type, win); MPI_Win_unlock(0, win); The current EOA for this process would be the increment + all the values in get_array.
Performance Some of the disadvantages of this approach: – Needs a set-aside process, or else the root process will not be able to serve promptly the RMA operations. – We hit performance issues when a large number of processes try to fetch and add at the same time repeatedly, however we don’t see that as a common case. Num_procsNN /
Independent Metadata Operations 3 classes of solutions: – Distributed Lock Manager (DLM) – Synchronization Step – Metadata Server (MDS) Enhanced MDS Data Staging
DLM Use the DLM to serialize access to an HDF5 object by requiring any process that needs access to the object obtain a lock (shared/exclusive) DLM implementation: – RMA & point-to-point – Point-to-point non-blocking operations
RMA & Point-to-Point (I) Allocate an MPI window at the DLM with size = n bits, n being the number of processes. If a process k wants a lock, then using one RMA access epoch: – get the value of all the n-1 bits that correspond to the status of other processes – switch the kth bit on to indicate that process k asked for the lock. If all n-1 bits are zero, then it’s safe to acquire the lock, otherwise, the process will post an MPI_Recv from MPI_ANY_SOURCE that will block until some process tells it that it is safe to acquire the lock. To release the lock, another access epoch is done to the root process to switch the kth bit off and determine to which process the lock needs to be sent to (if any). Shared locks can be implemented by extending this algorithm to use nx2 bit field.
RMA & Point-to-Point (II) Request lock Lock Granted Request lock Wait Request lock Wait Release lock Send lock to 0 Receive lock Release lock Send lock to 1 Receive lock Release lock
DLM In addition to acquiring locks, the DLM must also handle metadata changes. The process releasing the lock is also responsible to push all changes it made to metadata to the DLM or the process the lock is being sent to. Once the changes made are large enough, a flush to disk can be initiated from the DLM or any process that has the lock on an object. Since there are multiple objects that need to be locked, the DLM needs to maintain a queue (the n bit field) for each object. Need to deal with deadlocks
Synchronization Step Each process modifies its metadata cache independently and logs the changes in a list. Whenever a collective operation, or a flush, is issued by the user, the processes enter a sync point, and resolve any conflicts that appear in the logs. This approach is meant for applications where users guarantees that the HDF use of metadata is simple enough that the resolve step does not introduce unresolvable conflicts. The Metadata operations are not entirely independent, since a sync step is still required.
MDS (I) A server opens the file and reads/writes metadata to disk, instead of just managing access to metadata (DLM). Clients do not touch the metadata on disk. The client, for example, sends a request to the MDS to create a dataset. The MDS would create the dataset and return it to the client. Locking still needs to be done when a client wants to do raw data reads/writes on a mutable object, to make sure that the structure of the object does not change.
MDS (II) Having 1 MDS is not scalable, so at some point we need more than one. To handle hot objects, we can use a DHT to replicate the objects across the servers. Clients can still cache metadata and bypass the MDS or keep communication with the MDS very simple in certain situations.
Enhanced MDS (I) Objects in HDF5 will be marked as private or public. Access to metadata of public objects has to go through the MDS, whereas access to metadata of private objects can be done locally. If a private object is needed to be made public, the process that holds it needs to transfer the “ownership” to the MDS, and that includes pushing all the metadata to the MDS or writing the metadata to disk and informing the MDS about it.
Enhanced MDS (II) MDS Local
Data Staging Similar to what ADIOS does. Set aside processes to handle all HDF5 operations (metadata and raw data) Asynchronous I/O operations: – Must not interfere with application
Progress Function An alternative to setting aside processes A library on top of MPI that the user calls to make progress. The user is responsible to insert progress functions in his code to allow designated “MDS” or “DLM” processes to handle outstanding requests.
Recommendation As mentioned earlier, this work is still at the design phase. While we have a tentative roadmap on how to progress with this work, things might change depending on various reasons, such as new input from the HPC community, funding, etc… We are looking at proceeding with: – the Gropp FAD approach for the space allocation problem – the Enhanced MDS approach for the metadata operations, while moving later towards Data Staging. We would like to provide both options to the user in terms of either setting aside processes or calling progress functions in the application.