Prof. Thomas Sterling Department of Computer Science

High Performance Computing: Concepts, Methods & Means Parallel I/O : File Systems and Libraries
Prof. Thomas Sterling Department of Computer Science Louisiana State University March 29th, 2007

Topics Introduction RAID Distributed File Systems (NFS) Parallel File Systems (PVFS2) Parallel I/O Libraries (MPI-IO) Parallel File Formats (HDF5) Additional Parallel File Systems (GPFS) Summary – Materials for Test

Permanent Storage: Hard Disks Review
Storage capacity: 1 TB per drive Areal density: 132 Gbit/in2 (perpendicular recording) Rotational speed: 15,000 RPM Average latency: 2 ms Seek time Track-to-track: 0.2 ms Average: 3.5 ms Full stroke: 6.7 ms Sustained transfer rate: up to 125 MB/s Non-recoverable error rate: 1 in 1017 Interface bandwidth: Fibre channel: 400 MB/s Serially Attached SCSI (SAS): 300 MB/s Ultra320 SCSI: 320 MB/s Serial ATA (SATA): 300 MB/s

Storage – SATA & Overview - Review
Serial ATA is the newest commodity hard disk standard. SATA uses serial buses as opposed to parallel buses used by ATA and SCSI. The cables attached to SATA drives are smaller and run faster (around 150 MB/s). The Basic disk technologies remain the same across the three busses The platters in disk spin at variety of speeds, faster the platters spin the faster the data can be read off the disk and data on the far end of the platter will become available sooner. Rotational speeds range between 5400 RPM to RPM Faster the platters rotate, the lower the latency and higher the bandwidth. PATA vs SATA

I/O Needs on Parallel Computers
High Performance Take advantage of parallel I/O paths (when available) Support application-level data access and throughput needs Data Integrity sanely deal with hardware and power failures Single Namespace All nodes and users “see” the same file systems Equal access from anywhere on the resource. Ease of Use Where possible, a parallel file system should be accessible in consistent way, in the same ways as a traditional UNIX-style file systems. Ohio Supercomputer Center

Parallel I/O - RAID RAID stands for Redundant Array of Inexpensive Disks provides a mechanism by which the performance and storage properties of individual disks can be aggregated Group of disks appear to be a single large disks; performance of multiple disks is better than single disks. Using multiple disks helps store data in multiple places allowing the system to continue functioning. Both software and hardware raid solutions available. Hardware solutions are more expensive, but provide better performance without CPU overhead. Software solutions provide various levels of flexibility but have associated computational overhead.

RAID : Key Concepts Variety of RAID allocation schemes :
RAID 0 (disk striping without redundant storage) : Data is striped across multiple disks. The result of striping is a logical storage device that has the capacity of each disk times the number of disks present in the raid array. Both read and write performances are accelerated. Each byte of data can be read from multiple locations, so interleaving reads between disks can help double read performance. No Fault tolerance High transfer rates High request rates

RAID : Key Concepts RAID 1 (disk mirroring): RAID 5:
Complete copies of data are stored on multiple locations. Capacity of one of these RAID sets will be half of its raw capacity. Read performance is accelerated and is comparable to Raid 0. Writes are slowed down, as new data needs to be transmitted multiple times. RAID 5: Like Raid 0 data is striped across multiple disks, with parity being distributed across the disks. For any block of data stored across the drives, their parity checksum is computed and is stored on a predetermined disk. Read performance of RAID 5 is reduced as the parity data is distributed across drives, and the write performance lags behind because of checksum computation.

Distributed File Systems
A distributed file system is a file system that is stored locally on one system (server) but is accessible by processes on many systems (clients). Multiple processes access multiple files simultaneously. Other attributes of a DFS may include : Access control lists (ACLs) Client-side file replication Server- and client- side caching Some examples of DFSes: NFS (Sun) AFS (CMU) DCE/DFS (Transarc / IBM) CIFS (Microsoft) Distributed file systems can be used by parallel programs, but they have significant disadvantages : The network bandwidth of the server system is a limiting factor on performance To retain UNIX-style file consistency, the DFS software must implement some form of locking which has significant performance implications Ohio Supercomputer Center

Distributed File System : NFS
Popular means for accessing remote file systems in a local area network. Based on the client-server model , the remote file systems are “mounted” via NFS and accessed through the Linux virtual file system (VFS) layer. NFS clients cache file data, periodically checking with the original file for any changes. The loosely-synchronous model makes for convenient, low-latency access to shared spaces. NFS avoids the common locking systems used to implement POSIX semantics.

Why NFS is bad for Parallel I/O
Clients can cache data indiscriminately, and tend to block boundaries. When nearby regions of a file are written by different processes on different clients, the result is undefined due to lack of consistency control. Secondly all file operations are remote operations. Extensive file locking required to implement sequential consistency Communication between client and server typically uses relatively slow communication channels, adding to performance degradation. Inefficient specification (eg. a read operation involves two RPC operations (one for look-up of file handle and second for reading of file data)

Parallel File Systems Parallel File System is one in which there are multiple servers as well as clients for a given file system, equivalent of RAID across several file systems. Multiple processes can access the same file simultaneously Parallel File Systems are usually optimized for high performance rather than general purpose use, common optimization criterion being : very large block sizes ( => 64kB) relatively slow metadata operations (eg. fstat()) compared to reads and writes Special APIs for direct access Examples of Parallel file systems include : GPFS (IBM) LUSTRE (Cluster File Systems) PVFS2 (Clemson/ANL) Ohio Supercomputer Center

Characteristics of Parallel File Systems
Three Key Characteristics : Various hardware I/O data storage resources Multiple connections between these hardware devices and compute resources. High-performance, concurrent access to these I/O resources. Multiple physical I/O devices and paths ensures sufficient bandwidth for the high performance desired. Parallel I/O systems include both the hardware and number of layers of software High-Level I/O Library Parallel I/O (MPI I/O) Parallel File System Storage Hardware

Parallel File Systems: Hardware Layer
I/O Hardware is usually comprised of disks, controllers, and interconnects for data movement. Hardware determines the maximum raw bandwidth and the minimum latency of the system. Bisection bandwidth of the underlying transport determines the aggregate bandwidth of the resulting parallel I/O system. At the hardware level, data is accessed at the granularity of blocks, either physical disk blocks or logical blocks spread across multiple physical devices such as in a RAID array. Parallel File Systems : manage data on the storage hardware, present this data as a directory hierarchy, coordinate access to files and directories in a consistent manner File systems usually provide a UNIX like interface, allowing users to access contiguous regions of files. Storage Hardware Parallel I/O (MPI I/O) High-Level I/O Library Parallel File System

Parallel File Systems : Other Layers
Lower level interfaces may be provided by the file system for higher-performance access. Above the parallel file systems are the parallel I/O layers provided in the form of libraries such as MPI I/O. The parallel I/O layer provides a low level interface and operations such as collective I/O. Scientific applications work with structured data for which a higher level API written on top of MPI-IO such as HDF5 or parallel netCDF are used. HDF5 and parallel netCDF allow the scientists to represent the data sets in terms closer to those used in their applications. High-Level I/O Library Parallel I/O (MPI I/O) Parallel File System Storage Hardware

PVFS2 PVFS2 designed to provide :
a modular networking and storage subsystems structured data request format modeled after MPI datatypes flexible and extensible data distribution models distributed metadata tunable consistency semantics, and support for data redundancy. Supports variety of network technologies including Myrinet, Quadrics, and Infiniband. Also supports variety of storage devices including locally attached hardware, SANs and iSCSI Key abstractions include : Buffered Message Interface (BMI) : non-blocking network interface Trove : non-blocking storage interface Flows : mechanism to specify a flow of data between network and storage

PVFS2 Software Architecture
Buffered Messaging Interface (BMI) Non blocking interface that can be used with many High performance network fabrics Currently TCP/IP and Myrinet (GM) networks exist Trove : Non blocking interface that can be used with a number of underlying storage mechanisms. Trove storage objects consist of stream of bytes and keyword/value pair space. Keyword/value pairs are convenient for arbitrary metadata storage and directory entries, while stream of bytes provides ideal storage for the stream of bytes. Client Server Client API Request Processing Job Sched Job Sched BMI Flo-ws Tro-ve BMI Flo-ws Dist Dist Network Disk

PVFS2 Software Architecture
Flows : Combines network and storage subsystems by providing mechanism to describe flow of data between network and storage. Provide a point for optimization to optimize data movement between a particular network and storage pair to exploit fast paths. The job scheduling layer provides a common interface to interact with BMI, Flows, and Trove and checks on their completion The job scheduler is tightly integrated with a state machine that is used to track operations in progress. Client Server Client API Request Processing Job Sched Job Sched BMI Flo-ws Tro-ve BMI Flo-ws Dist Dist Network Disk

The PVFS2 Components The 4 major components to a PVFS system are :
Metadata Server (mgr) I/O Server (iod) PVFS native API (libpvfs) PVFS Linux kernel support Metadata Server (mgr) : manages all the file metadata for PVFS files, using a daemon which atomically operates on the file metadata. PVFS avoids the pitfalls of many storage area network approaches, which have to implement complex locking schemes to ensure that metadata stays consistent in the face of multiple accesses.

The PVFS2 Components I/O daemon:
handles storing and retrieving file data stored on local disks connected to a node using traditional read(), write, etc for access to these files. PVFS native API provides user-space access to the PVFS servers. The library handles the operations necessary to move data between user buffers and PVFS servers. metadata access data access

Parallel File Systems Comparison

Comparison of NFS vs. GPFS
File-System Features NFS GPFS Introduced: 1985 1998 Original vendor: Sun IBM Example at LC: /nfs/tmpn /p/gx1 Primary role: Share files among machines Fast parallel I/O for large files Easy to scale? No Yes Network needed: Any TCP/IP network Only IBM SP "switch" Access control method: UNIX permission bits (CHMOD) Block size: 256 byte 512 Kbyte (White) Stripe width: Depends on RAID 256 Kbyte Maximum file size: 2 Gbyte (longer with v3) 26 Gbyte File consistency: .....uses client buffering? Yes (see diagram) .....uses server buffering? .....uses locking? Yes (token passing) .....lock granularity? Byte range .....lock managed by? Requesting compute node Purged at LC? Home, No; Tmp, Yes Supports file quotas?

MPI-IO Overview Initially developed as a research project at the IBM T. J. Watson Research Center in 1994 Voted by the MPI Forum to be included in MPI-2 standard (Chapter 9) Most widespread open-source implementation is ANL’s ROMIO, written by Rajeev Thakur ( ) Integrates file access with the message passing infrastructure, using similarities between send/receive and file write/read operations Allows MPI datatypes to describe meaningfully data layouts in files instead of dealing with unorganized streams of bytes Provides potential for performance optimizations through the mechanism of “hints”, collective operations on file data, or relaxation of data access atomicity Enables better file portability by offering alternative data representations

MPI-IO Features (I) Basic file manipulation (open/close, delete, space preallocation, resize, storage synchronization, etc.) File views (define what part of a file each process can see and how it is interpreted) Processes can view file data independently, with possible overlaps The users may define patterns to describe data distributions both in file and in memory, including non-contiguous layouts Permit skipping over fixed header blocks (“displacements”) Views can be changed by tasks at any time Data access positioning Explicitly specified offsets (suffix “_at”) Independent data access by each task via individual file pointers (no suffix) Coordinated access through shared file pointer (suffix “_shared”) Access synchronism Blocking Non-blocking (include split-collective operations)

MPI-IO Features (II) Access coordination
Non-collective (no additional suffix) Collective (suffix: “_all” for most blocking calls, “_begin” and “_end” for split-collective, or “_ordered” for equivalent of shared pointer access) File interoperability (ensures portability of data representation) Native: for purely homogeneous environments Internal: heterogeneous environments with implementation-defined data representation (subset of “external32”) External32: heterogeneous environments using data representation defined by the MPI-IO standard Optimization hints (the “info” interface) Access style (e.g. read_once, write_once, sequential, random, etc.) Collective buffering components (buffer and block sizes, number of target nodes) Striping unit and factor Chunked I/O specification Preferred I/O devices C, C++ and Fortran bindings

MPI-IO Types Etype (elementary datatype): the unit of data access and positioning; all data accesses are performed in etype units and offsets are measured in etypes Filetype: basis for partitioning the file among processes: a template for accessing the file; may be identical to or derived from the etype Source:

MPI-IO File Views A view defines the current set of data visible and accessible from an open file as an ordered set of etypes Each process has its own view of the file, defined by: a displacement, an etype, and a filetype Displacement: an absolute byte position relative to the beginning of file; defines where a view begins

MPI-IO: File Open Function: MPI_File_open()
int MPI_File_open(MPI_Comm comm, char *filename, int amode, MPI_Info info, MPI_File *fh); Description: Opens the file identified by filename on all processes in comm group, using access mode specified in amode. The operation is collective; all participating processes must pass identical values for amode and use the filename referencing the same file. Successful call returns the open file handle in fh, which can be used to subsequently access the file. It is possible to open file independently from other processes by passing MPI_COMM_SELF in comm argument. #include <mpi.h> ... MPI_File fh; int err; /* create a writable file with default parameters */ err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); if (err != MPI_SUCCESS) {/* handle error here */}

MPI-IO: File Close Function: MPI_File_close()
int MPI_File_open(MPI_File *fh); Description: Synchronizes file state (equivalent to implicit invocation of MPI_File_sync), and then closes the file associated with handle fh. The user must ensure that all oustanding non-blocking requests and split-collective operations associated with handle fh have completed. If the file was opened with access mode MPI_MODE_DELETE_ON_CLOSE, it is deleted from the file system. #include <mpi.h> ... MPI_File fh; int err; /* open a file storing the handle in fh */ /* perform file access */ err = MPI_File_close(&fh); if (err != MPI_SUCCESS) {/* handle error here */}

MPI-IO: Set File View Function: MPI_File_set_view()
int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info); Description: Changes the process’ view of data file, setting the start of the view to disp, the type of file data to etype, the distribution of file data to processes to filetype, and data representation to datarep. Resets the individual and shared file pointers to zero. The call is collective, requiring the values for datarep and etype extents to be identical for all processes. The data representation must be one of: “native”, “internal” or “external32”. #include <mpi.h> ... MPI_File fh; int err; /* open file storing the handle in fh */ /* view the file as stream of integers with no header, using native data representation */ err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); if (err != MPI_SUCCESS) {/* handle error */}

MPI-IO: Read File with Explicit Offset
Function: MPI_File_read_at() int MPI_File_read_at(MPI_File fh, MPI_Offset offs, void *buf, int count, MPI_Datatype type, MPI_Status *status); Description: Reads count elements of type type from file represented by fh at offset offs, storing them in buffer pointed to by buf. Offset offs is expressed in etype units relative to the current view associated with the file handle fh. Successful call returns the amount of data transferred in status. #include <mpi.h> ... MPI_File fh; MPI_Status stat; int buf[3], err; /* open file storing the handle in fh */ MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); /* read the third triad of integers from file */ err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat);

MPI-IO: Write to File with Explicit Offset
Function: MPI_File_write_at() int MPI_File_write_at(MPI_File fh, MPI_Offset offs, void *buf, int count, MPI_Datatype type, MPI_Status *status); Description: Writes count elements of type type from buffer buf to file represented by fh at offset offs. Offset offs is expressed in etype units relative to the current view associated with the file handle fh. Successful call returns the amount of data transferred in status. #include <mpi.h> ... MPI_File fh; MPI_Status stat; int err; double dt = ; /* open file storing the handle in fh */ MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); /* store timestep as the first item in file */ err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat);

MPI-IO: Read File Collectively with Individual File Pointers
Function: MPI_File_read_all() int MPI_File_read_all(MPI_File fh, void *buf, int count, MPI_Datatype type, MPI_Status *status); Description: All processes in communicator group associated with the file handle fh read their respective count elements of types type from file at the offsets determined by the current values of file pointers cached on their file handles, storing them in buffers pointed to by buf. Successful call returns the amount of data transferred in status. #include <mpi.h> ... MPI_File fh; MPI_Status stat; int buf[20], err; /* open file storing the handle in fh */ MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL); /* read 20 integers at current file offset in every process */ err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat);

MPI-IO: Write to File Collectively with Individual File Pointers
Function: MPI_File_write_all() int MPI_File_write_all(MPI_File fh, void *buf, int count, MPI_Datatype type, MPI_Status *status); Description: All processes in communicator group associated with the file handle fh write their respective count elements of types type from buffers buf to file at the offsets determined by the current values of file pointers cached on their file handles. Successful call returns the amount of data transferred in status. #include <mpi.h> ... MPI_File fh; MPI_Status stat; double t; int err, rank; /* open file storing the handle in fh; compute t */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* interleave time values t from each process at the beginning of file */ MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);

MPI-IO: File Seek Function: MPI_File_seek()
int MPI_File_seek(MPI_File fh, MPI_Offset offs, int whence); Description: Updates the value of the individual file pointer according to whence, which has the following possible values: MPI_SEEK_SET: the pointer is set to offs MPI_SEEK_CUR: the pointer is set to the current value plus offs MPI_SEEK_END: the pointer is set to the end of file plus offs. #include <mpi.h> ... MPI_File fh; MPI_Status stat; double t; int rank; /* open file storing the handle in fh; compute t */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* interleave time values t from each process at the beginning of file */ MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL); MPI_File_seek(fh, MPI_SEEK_SET, rank); MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);

MPI-IO Data Access Classification
Source:

Example: Scatter to File
Example created by Jean-Pierre Prost from IBM Corp.

Scatter Example Source
#include "mpi.h" static int buf_size = 1024; static int blocklen = 256; static char filename[] = "scatter.out"; main(int argc, char **argv) { char *buf, *p; int myrank, commsize; MPI_Datatype filetype, buftype; int length[3]; MPI_Aint disp[3]; MPI_Datatype type[3]; MPI_File fh; int mode, nbytes; MPI_Offset offset; MPI_Status status; /* initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &commsize); /* initialize buffer */ buf = (char *) malloc(buf_size); memset(( void *)buf, '0' + myrank, buf_size); /* create and commit buftype */ MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype); MPI_Type_commit(&buftype); /* create and commit filetype */ length[0] = 1; length[1] = blocklen; length[2] = 1; disp[0] = 0; disp[1] = blocklen * myrank; disp[2] = blocklen * commsize; type[0] = MPI_LB; type[1] = MPI_CHAR; type[2] = MPI_UB; MPI_Type_struct(3, length, disp, type, &filetype); MPI_Type_commit(&filetype); /* open file */ mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;

Scatter Example Source (cont.)
MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh); /* set file view */ offset = 0; MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL); /* write buffer to file */ MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status); /* print out number of bytes written */ MPI_Get_elements(&status, MPI_CHAR, &nbytes); printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes); /* close file */ MPI_File_close(&fh); /* free datatypes */ MPI_Type_free(&buftype); MPI_Type_free(&filetype); /* free buffer */ free (buf); /* finalize MPI */ MPI_Finalize(); }

Data Access Optimizations
Data Sieving 2-phase I/O Collective Read Implementation in ROMIO Source:

ROMIO Scaling Examples
Bandwidths obtained for 5123 arrays (astrophysics benchmark) on Argonne IBM SP Processors Independent I/O Collective I/O 16 1.26 MB/s 64.8 MB/s 32 1.25 MB/s 69.5 MB/s 48 1.36 MB/s 70.6 MB/s Write Operations Processors Independent I/O Collective I/O 16 12.8 MB/s 68.5 MB/s 32 6.46 MB/s 82.6 MB/s 48 5.83 MB/s 88.4 MB/s Read Operations Source:

Independent vs. Collective Access
Individual I/O on IBM SP Collective I/O on IBM SP Source:

Introduction to HDF5 Acronym for Hierarchical Data Format, a portable, freely distributable, and well supported library, file format, and set of utilities to manipulate it Explicitly designed for use with scientific data and applications Initial HDF version was created at NCSA/University of Illinois at Urbana-Champaign in 1988 First revision in widespread use was HDF4 Main HDF features include: Versatility: supports different data models and associated metadata Self-describing: allows an application to interpret the structure and contents of a file without any extraneous information Flexibility: permits mixing and grouping various objects together in one file in a user-defined hierarchy Extensibility: accommodates new data models, added both by the users and developers Portability: can be shared across different platforms without preprocessing or modifications HDF5 is the most recent incarnation of the format, adding support for new type and data models, parallel I/O, and streaming, and removing a number of existing restrictions (maximal file size, number of objects per file, flexibility of type use, storage management configurability, etc.) as well as improving the performance

HDF5 File Layout Major object classes: groups and datasets
Namespace resembles file system directory hierarchy (groups ≡ directories, datasets ≡ files) Alias creation supported through links (both soft and hard) Mounting of sub-hierachies is possible Low-level organization User’s view

HDF5 API & Tools Library functionality grouped by function name prefix
H5: general purpose functions H5A: attribute interface H5D: dataset manipulation H5E: error handling H5F: file interface H5G: group creation and access H5I: object identifiers H5P: property lists H5R: references H5S: dataspace definition H5T: datatype manipulation H5Z: inline data filters and compression Command-line utilities h5cc, h5c++, h5fc: C, C++ and Fortran compiler wrappers h5redeploy: updates compiler tools after installation in new location h5ls, h5dump: lists hierarchy and contents of a HDF5 file h5diff: compares two HDF5 files h5repack, h5repart: rearranges or repartitions a file h5toh4, h4toh5: converts between HDF5 and HDF4 formats h5import: imports data into HDF5 file gif2h5, h52gif: converts image data between gif and HDF5 formats

Basic HDF5 Concepts Group Dataset Dataspace Datatype Attribute
Structure containing zero or more HDF5 objects (possibly other groups) Provides a mechanism for mapping a name (path) to an object “Root” group is a logical container of all other objects in a file Dataset A named array of data elements (possibly multi-dimensional) Specifies the representation of the dataset the way it will be stored in HDF5 file through associated datatype and dataspace parameters Dataspace Defines dimensionality of a dataset (rank and dimension sizes) Determines the effective subset of data to be stored or retrieved in subsequent file operations (aka selection) Datatype Describes atomically accessed element of a dataset Permits construction of derived (compound) types, such as arrays, records, enumerations Influences conversion of numeric values between different platforms or implementations Attribute A small, user-defined structure attached to a group, dataset or named datatype, providing additional information

HDF5 Spatial Subset Examples
Source:

HDF5 Virtual File Layer Developed to cope with large number of available storage subsystem variations Permits custom file driver implementations and related optimizations Source:

Overview of Data Storage Options
Source:

Simultaneous Spatial and Type Transformation Example
Source:

Simple HDF5 Code Example
/* Writing and reading an existing dataset. */ #include "hdf5.h" #define FILE "dset.h5" int main() { hid_t file_id, dataset_id; /* identifiers */ herr_t status; int i, j, dset_data[4][6]; /* Initialize the dataset. */ for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) dset_data[i][j] = i * 6 + j + 1; /* Open an existing file. */ file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT); /* Open an existing dataset. */ dataset_id = H5Dopen(file_id, "/dset"); /* Write the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data); status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data); /* Close the dataset. */ status = H5Dclose(dataset_id); /* Close the file. */ status = H5Fclose(file_id); }

Parallel HDF5 Relies on MPI-IO as the file layer driver
Uses MPI for internal communications Most of the functionality controlled through property lists (requires minimal HDF5 interface changes) Supports both individual and collective file access Three raw data storage layouts: contiguous, chunking and compact Enables additional optimizations through derived MPI datatypes (esp. for regular collective accesses) Limitations Chunked storage with overlapping chunks (results non-deterministic) Read-only compression Writes with variable length datatypes not supported

Topics Introduction RAID Distributed File Systems (NFS) Parallel File Systems (PVFS2) Parallel I/O Libraries (MPI IO, ROMIO) Parallel File Formats (HDF5..) Additional Parallel File Systems (GPFS) Summary – Materials for Test

General Parallel File System (GPFS)
Brief history: Based on the Tiger Shark parallel file system developed at the IBM Almaden Research Center in 1993 for AIX Originally targeted at dedicated video servers The multimedia orientation influenced GPFS command names: they all contain “mm” First commercial release was GPFS V1.1 in 1998 Linux port released in 2001; Linux-AIX interoperability supported since V2.2 in 2004 Highly scalable Distributed metadata management Permits incremental scaling High-performance Large block size with wide striping Parallel access to files from multiple nodes Deep prefetching Adaptable mechanism for recognizing access patterns Multithreaded daemon Highly available and fault tolerant Data protection through journaling, replication, mirroring and shadowing Ability to recover from multiple disk, node and connectivity failures (heartbeat mechanism) Recovery mechanism implemented in all layers

GPFS Features (I) Source:

GPFS Features (II)

GPFS Architecture Source:

Components Internal to GPFS Daemon
Configuration Manager (CfgMgr) Selects the node acting as Stripe Group Manager for each file system Checks for the quorum of nodes required for the file system usage to continue Appoints successor node in case of failure Initiates and controls recovery procedure Stripe Group Manager (FSMgr, aka File System Manager) Strictly one per each GPFS file system Maintains availability information of disks comprising the file system (physical storage) Processes modifications (disk removals and additions) Repairs file system and coordinates data migration when required Metanode Manages metadata (directory block updates) Its location may change (e.g. a node obtaining access to the file may become the metanode) Token Manager Server Synchronizes concurrent access to files and ensures consistency among caches Manages tokens, or per-object locks Mediates token migration when another node requests token conflicting with the existing token (token stealing) Always located on the same node as Stripe Group Manager

GPFS Management Functions & Their Dependencies
Source:

Components External to GPFS Daemon
Virtual Shared Disk (VSD, aka logical volume) Enables nodes in one SP system partition to share disks with the other nodes in the same system partition VSD node can be a client, a server (owning a number of VSDs, and performing data reads and writes requested by client nodes), or both at the same time Recoverable Virtual Shared Disk (RVSD) Used together with VSD to provide high availability against node failures reported by Group Services Runs recovery scripts and notifies client applications Switch (interconnect) Subsystem Starts switch daemon, responsible for initializing and monitoring the switch Discovers and reacts to topology changes; reports and services status/error packets Group Services Fault-tolerant, highly available and partition-sensitive service monitoring and coordinating changes related to another subsystem operating in the partition Operates on each node within the partition, plus the control workstation for the partition System Data Repository (RSD) Location where the configuration data are stored

Read Operation Flow in GPFS

Write Operation Flow in GPFS

Token Management in GPFS
First lock request for an object requires a message from node N1 to the token manager Token server grants token to N1 (subsequent lock requests can be granted locally) Node N2 requests token for the same file (lock conflict) Token server detects conflicting lock request and revokes token from N1 If N1 was writing to file, the data is flushed to disk before the revocation is complete Node N2 gets the token from N1

GPFS Write-behind and Prefetch
As soon as application’s write buffer is copied into the local pagepool, the write is operation is complete from client’s perspective GPFS daemon schedules a worker thread to finalize the request by issuing I/O calls to the device driver GPFS estimates the number of blocks to read ahead based on disk performance and rate at which application is reading the data Additional prefetch requests are processed asynchronously with the completion of the current read

Some GPFS Cluster Models
Network Shared Disk (NSD) with dedicated server model Direct attached model Mixed (NSD and direct attached) model Joined (AIX and Linux) model

Comparison of GPFS to Other File Systems

Topics Introduction RAID Distributed File Systems (NFS) Parallel File Systems (PVFS2) Parallel I/O Libraries (MPI IO, ROMIO) Parallel File Formats (HDF5..) Additional Parallel File Systems (GPFS) Summary – Materials for Test

Summary – Material for the Test
Need for Parallel I/O (slide 6) RAID concepts (slides 8-10) Distributed File System Concepts NFS (slides 12, 13) Why NFS is bad for parallel I/O (slide 14) Parallel File System Concepts (slides 16-19) PVFS (slides 20-24) MPI-IO concepts & features (slides 29-32) MPI-IO API & functionalities (slides 33-41)

Prof. Thomas Sterling Department of Computer Science

Similar presentations

Presentation on theme: "Prof. Thomas Sterling Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prof. Thomas Sterling Department of Computer Science

Similar presentations

Presentation on theme: "Prof. Thomas Sterling Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback