RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.

Slides:



Advertisements
Similar presentations
MPI3 RMA William Gropp Rajeev Thakur. 2 MPI-3 RMA Presented an overview of some of the issues and constraints at last meeting Homework - read Bonachea's.
Advertisements

Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.
RMA Considerations for MPI-3.1 (or MPI-3 Errata)
Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
Non-Blocking Collective MPI I/O Routines Ticket #273.
File Consistency in a Parallel Environment Kenin Coloma
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Tam Vu Remote Procedure Call CISC 879 – Spring 03 Tam Vu March 06, 03.
04/14/2008CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying the textbook.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Distributed File System: Design Comparisons II Pei Cao Cisco Systems, Inc.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
1 Concurrency: Deadlock and Starvation Chapter 6.
1 I/O Management in Representative Operating Systems.
Distributed process management: Distributed deadlock
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Nachos Phase 1 Code -Hints and Comments
Distributed File Systems
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Remote Procedure Calls Adam Smith, Rodrigo Groppa, and Peter Tonner.
(Business) Process Centric Exchanges
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Presenter: Long Ma Advisor: Dr. Zhang 4.5 DISTRIBUTED MUTUAL EXCLUSION.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
The concept of RAID in Databases By Junaid Ali Siddiqui.
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
Silberschatz, Galvin and Gagne  Operating System Concepts Six Step Process to Perform DMA Transfer.
File Systems cs550 Operating Systems David Monismith.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 7 – Buffer Management.
I/O Software CS 537 – Introduction to Operating Systems.
Synchronization in Distributed File Systems Advanced Operating System Zhuoli Lin Professor Zhang.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.
Disk Cache Main memory buffer contains most recently accessed disk sectors Cache is organized by blocks, block size = sector’s A hash table is used to.
Distributed Shared Memory
Memory Consistency Models
Parallel Programming By J. H. Wang May 2, 2017.
Memory Consistency Models
Current status and future work
Lock Ahead: Shared File Performance Improvements
Page Replacement.
Operating Systems.
CS703 - Advanced Operating Systems
Threading And Parallel Programming Constructs
Chapter 2: Operating-System Structures
Memory Consistency Models
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Lecture 24: Multiprocessors
Chapter 2: Operating-System Structures
Presentation transcript:

RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations

Parallel I/O One or more processes accessing data on disk at the same time (usually within an application) Used on platforms that access data on parallel file systems (Lustre, GPFS, PVFS2): – Multiple I/O servers – Data is striped across servers Several parallel programming models available: – MPI is the dominant

Message Passing Interface MPI: standardized interface for message passing between processes: – point-to-point communication – collective communication – one-sided communication (Remote Memory Access) – derived datatypes – process grouping (MPI Communicator) – parallel I/O standard (MPI-I/O)

MPI-I/O Standard interface for doing I/O relaxed consistency semantics: – scope – synchronization concept of a file view, derived datatypes individual and collective I/O operations blocking & non-blocking I/O operations

4 processes accessing a 2D matrix stored in row-major format Each process can define the portions it will access as its file view (optional) Access can be defined at I/O time using derived datatypes (similar to HDF5 file selections) Individual I/O: Each process posts its access to disk independently Collective I/O: All processes participate in the I/O operations Different algorithms underneath the hood to optimize I/O access to disk. Example

Parallel HDF5 (I) MPI-I/O offers a limited interface to the user: – A file is seen as a stream of bytes. – The user is responsible to structure his data on file using MPI derived datatypes which is very difficult for a scientific application programmer. HDF5 offers a higher level of access through different objects and I/O access. Parallel HDF5 is layered on top of MPI-I/O to hide the cumbersomeness of using MPI-I/O to access structured data. The user gets the nice HDF5 interface, but the HDF5 library still has to provide an efficient and optimized access to data on disk.

Parallel HDF5 (II) Parallel HDF5 still has some limitations imposed on the user: – PHDF5 allows the user to access data on disk individualy or collectively, however operations that modify metadata itself are required to be collective. While this requirement satisfies the application needs in some cases, in other cases it adds significant overhead and restrictions on application developers. A similar requirement is placed internally on space allocation, where processes need to communicate in a collective manner to be able to allocate space.

RFC In this RFC, we propose several design options that will remove the collective requirement for space allocation and operations that require modifying the file metadata. Any solution has to achieve the following: – Allow an independent (but globally visible), scalable "fetch-and- add" operation to the "end of allocated space" (EOA) value in the file (to allow new space in the file to be allocated to an individual process, without involving all the other processes). – Allow [the metadata for] an object to be locked, updated, unlocked, and made visible to other processes without requiring a collective operation (to allow changes to the structure of a file).

Use case 1 A parallel application with n processes, and each process creates a dataset in the root group for its private use. This is currently done by having all processes call the dataset creation function n times, despite the fact that each process needs 1 dataset only. / P1P1 P2P2 P3P3 P4P4 P6P6 P5P5

Use case 2 A process within a parallel application doing independent read/write to a dataset while another process comes in and modifies the metadata of that Dataset, making the former process’s access possibly obsolete. Currently, this use case would not happen, because all processes are required to be present at the time of metadata modification. However, this use case becomes possible when the collective requirement to metadata modification is removed and should be taken into consideration by any design option.

Use case 3, 4, 5 Collective read/write to a dataset, where all processes that opened the file are required to call the function with or without data to read/write. 2 or more processes creating an attribute on their own datasets. In the current implementation, similar to use case U.1, the attribute creation function needs to be collective, even though they are being created on different datasets. 2 or more processes creating an attribute on a shared dataset. A solution that makes this operation independent has to take into account that there are other processes that can create an attribute (or any other object) at the same time and those objects might overlap in the file if they are not serialized.

Space Allocation at EOA One sided Fetch and Add problem MPI does not provide a FAD implementation, but does provide one sided RMA operations (Put, Get, Accumulate). We use an algorithm introduced by Bill Gropp in Using MPI-2 to implement an MPI based FAD operation.

MPI FAD Create an MPI window on all processes with the root process exposing a memory buffer where the fetch and add will take place. All processes wanting to allocate space from the EOA will execute the following code section to get the current EOA and increment it atomically. MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win); MPI_Accumulate(increment_by_x, nlevels, MPI_INT, 0, 0, 1, acc_type, MPI_SUM, win); MPI_Get(get_array, nlevels, MPI_INT, 0, 0, 1, get_type, win); MPI_Win_unlock(0, win); The current EOA for this process would be the increment + all the values in get_array.

Performance Some of the disadvantages of this approach: – Needs a set-aside process, or else the root process will not be able to serve promptly the RMA operations. – We hit performance issues when a large number of processes try to fetch and add at the same time repeatedly, however we don’t see that as a common case. Num_procsNN /

Independent Metadata Operations 3 classes of solutions: – Distributed Lock Manager (DLM) – Synchronization Step – Metadata Server (MDS) Enhanced MDS Data Staging

DLM Use the DLM to serialize access to an HDF5 object by requiring any process that needs access to the object obtain a lock (shared/exclusive) DLM implementation: – RMA & point-to-point – Point-to-point non-blocking operations

RMA & Point-to-Point (I) Allocate an MPI window at the DLM with size = n bits, n being the number of processes. If a process k wants a lock, then using one RMA access epoch: – get the value of all the n-1 bits that correspond to the status of other processes – switch the kth bit on to indicate that process k asked for the lock. If all n-1 bits are zero, then it’s safe to acquire the lock, otherwise, the process will post an MPI_Recv from MPI_ANY_SOURCE that will block until some process tells it that it is safe to acquire the lock. To release the lock, another access epoch is done to the root process to switch the kth bit off and determine to which process the lock needs to be sent to (if any). Shared locks can be implemented by extending this algorithm to use nx2 bit field.

RMA & Point-to-Point (II) Request lock Lock Granted Request lock Wait Request lock Wait Release lock Send lock to 0 Receive lock Release lock Send lock to 1 Receive lock Release lock

DLM In addition to acquiring locks, the DLM must also handle metadata changes. The process releasing the lock is also responsible to push all changes it made to metadata to the DLM or the process the lock is being sent to. Once the changes made are large enough, a flush to disk can be initiated from the DLM or any process that has the lock on an object. Since there are multiple objects that need to be locked, the DLM needs to maintain a queue (the n bit field) for each object. Need to deal with deadlocks

Synchronization Step Each process modifies its metadata cache independently and logs the changes in a list. Whenever a collective operation, or a flush, is issued by the user, the processes enter a sync point, and resolve any conflicts that appear in the logs. This approach is meant for applications where users guarantees that the HDF use of metadata is simple enough that the resolve step does not introduce unresolvable conflicts. The Metadata operations are not entirely independent, since a sync step is still required.

MDS (I) A server opens the file and reads/writes metadata to disk, instead of just managing access to metadata (DLM). Clients do not touch the metadata on disk. The client, for example, sends a request to the MDS to create a dataset. The MDS would create the dataset and return it to the client. Locking still needs to be done when a client wants to do raw data reads/writes on a mutable object, to make sure that the structure of the object does not change.

MDS (II) Having 1 MDS is not scalable, so at some point we need more than one. To handle hot objects, we can use a DHT to replicate the objects across the servers. Clients can still cache metadata and bypass the MDS or keep communication with the MDS very simple in certain situations.

Enhanced MDS (I) Objects in HDF5 will be marked as private or public. Access to metadata of public objects has to go through the MDS, whereas access to metadata of private objects can be done locally. If a private object is needed to be made public, the process that holds it needs to transfer the “ownership” to the MDS, and that includes pushing all the metadata to the MDS or writing the metadata to disk and informing the MDS about it.

Enhanced MDS (II) MDS Local

Data Staging Similar to what ADIOS does. Set aside processes to handle all HDF5 operations (metadata and raw data) Asynchronous I/O operations: – Must not interfere with application

Progress Function An alternative to setting aside processes A library on top of MPI that the user calls to make progress. The user is responsible to insert progress functions in his code to allow designated “MDS” or “DLM” processes to handle outstanding requests.

Recommendation As mentioned earlier, this work is still at the design phase. While we have a tentative roadmap on how to progress with this work, things might change depending on various reasons, such as new input from the HPC community, funding, etc… We are looking at proceeding with: – the Gropp FAD approach for the space allocation problem – the Enhanced MDS approach for the metadata operations, while moving later towards Data Staging. We would like to provide both options to the user in terms of either setting aside processes or calling progress functions in the application.