File Consistency in a Parallel Environment Kenin Coloma

Slides:

Advertisements

Similar presentations

RMA Considerations for MPI-3.1 (or MPI-3 Errata)

Advertisements

1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.

Part IV: Memory Management

Non-Blocking Collective MPI I/O Routines Ticket #273.

CS 4700 / CS 5700 Network Fundamentals

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.

Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

The Linux Kernel: Memory Management

Segmentation and Paging Considerations

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Chapter 11: File System Implementation

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.

File System Implementation CSCI 444/544 Operating Systems Fall 2008.

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

File System Implementation

Lecture 8 Epidemic communication, Server implementation.

Grid IO APIs William Gropp Mathematics and Computer Science Division.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

The memory allocation problem Define the memory allocation problem Memory organization and memory allocation schemes.

DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM R. Sandberg, D. Goldberg S. Kleinman, D. Walsh, R. Lyon Sun Microsystems.

1 Overview Assignment 12: hints  Distributed file systems Assignment 11: solution  File systems.

Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.

1 Outline l Performance Issues in I/O interface design l MPI Solutions to I/O performance issues l The ROMIO MPI-IO implementation.

Distributed File Systems

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Distributed File Systems Overview  A file system is an abstract data type – an abstraction of a storage device.  A distributed file system is available.

Chapter 10: File-System Interface Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Chapter 10: File-System.

Page 110/19/2015 CSE 30341: Operating Systems Principles Chapter 10: File-System Interface  Objectives:  To explain the function of file systems  To.

SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.

26-Oct-15CSE 542: Operating Systems1 File system trace papers The Design and Implementation of a Log- Structured File System. M. Rosenblum, and J.K. Ousterhout.

Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.

The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.

Serverless Network File Systems Overview by Joseph Thompson.

A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

12.1 Silberschatz, Galvin and Gagne ©2003 Operating System Concepts with Java Chapter 12: File System Implementation Chapter 12: File System Implementation.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 12: File System Implementation File System Structure File System Implementation.

File Systems cs550 Operating Systems David Monismith.

4P13 Week 12 Talking Points Device Drivers 1.Auto-configuration and initialization routines 2.Routines for servicing I/O requests (the top half)

W4118 Operating Systems Instructor: Junfeng Yang.

Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Virtual memory.

Jonathan Walpole Computer Science Portland State University

Parallel I/O Optimizations

File System Implementation

Chapter 11: File System Implementation

Database Performance Tuning and Query Optimization

Chapter 11: File System Implementation

Lock Ahead: Shared File Performance Improvements

Chapter 11: File System Implementation

Overview: File system implementation (cont)

File System Implementation

DESIGN AND IMPLEMENTATION OF THE SUN NETWORK FILESYSTEM

Chapter 11 Database Performance Tuning and Query Optimization

PVFS: A Parallel File System for Linux Clusters

Chapter 11: File System Implementation

THE GOOGLE FILE SYSTEM.

Ch 9 – Distributed Filesystem

Database System Architectures

Presentation transcript:

File Consistency in a Parallel Environment Kenin Coloma

Outline Data consistency in parallel file systems –Consistency Semantics –File caching effect –Consistency in MPI-IO 2-phase collective IO in ROMIO (a popular MPI-IO implementation) Intuitive Solutions Persistent File Domains –PFDs - concept –PFDs - statically blocked assignment –PFDs - statically striped assignment –PFDs - dynamic assignment Performance Comparisons Conclusions & Future Work

Consistency Semantics POSIX and UNIX sequential consistency: –Once a write has returned, the resulting file must be visible to all processors MPI-IO sequential consistency: –Once a write has returned, the resulting file must be visible only to processors in the same Communicator –If the underlying file system does not support POSIX or UNIX consistency semantics, MPI-IO must enforce its sequential consistency semantics itself

Caching and Consistency The client-server model for file systems often relies on client-side caching for performance benefits –Client-side caching reduces the amount of data that needs to be transferred from the server NFS is one such file system, and does not enforce POSIX or UNIX consistency semantics

Caching and Consistency Open Seek(0 byte_off) Read(16 bytes) Barrier Seek(rank*4 byte_off) Write(4 bytes) Barrier p0: p1: p2: p3: client-side file caches p0: p1: p2: p3: Seek(0 byte_off) Read(16 bytes) Close ≠ user buffers A simple example using MPI and unix io on NFS - 4 procs

2-phase Collective IO in ROMIO 2-phase I/O, proposed and designed in PASSION (by Prof. Choudhary) is widely used in parallel I/O optimizations. MPI-IO implementation in ROMIO uses 2-phase collective I/O Advantages of collective IO –Awareness of access patterns (often non-contiguous) of all participating processes –Means of coordinating participating processes to optimize overall IO performance

2-phase Collective IO in ROMIO 2-phase IO –Communication –IO Reduce the number of IO calls to IO servers as well as the number of IO requests generated at the server All the IO done is more localized than it would otherwise be User buffers Comm. buffers IO buffers File 2-phase Collective Write File Domain Aggregate Access [Region]

2-phase Collective IO in ROMIO A simple example to exhibit the file consistency problems even with collective IO in ROMIO - 4 procs p0: p1: p2: p3: client-side file caches p0: p1: p2: p3: user buffers MPI_File_open MPI_File_read_all() [whole file] MPI_File_read_all() [whole file] MPI_File_write_all() [stripe 1st half] ≠ MPI_File_close

Intuitive Solutions The cause: obsolete data cached in client-side system buffer Simple solutions: –Disabling client-side caching entails changes to system configuration lose performance benefits of caching –Use file locking can serialize I/O not feasible on large scale parallel systems effectively disables client-side caching –Explicitly flushing out the cached data is the simplest solution, such as on Cplant ioctl(fd, BLKBLSBUF) fsync(fd) ensure the write reside on disk also effectively disables client-side caching

File locking File locking can cause IO serialization even if accesses do not logically overlap This is evident in collective IO where file domains never overlap p0: p1:

fsync and ioctl On Cplant –Flush before every read –Fsync after every write Performance ramifications –Could be invalidating perfectly good data Open Seek(0 byte_off) Read(16 bytes) Barrier Seek(rank*4 byte_off) Write(4 bytes) Barrier Seek(0 byte_off) Read(16 bytes) Close < fsync(fd) < ioctl(fd, BLKFLSBUF)

Persistent File Domains Similar to the file domains concept in ROMIO’s collective IO routines Enforces MPI-IO consistency semantics while retaining client-side file caching Safe concurrent accesses 3 - assignment strategies –Statically blocked assignment –Statically striped assignment –Dynamic (on-the-fly) assignment

Statically blocked assignment Client side caches are coherent before starting File domains are kept the same between collective IO calls Maintain file consistency -- each byte can only be accessed by one processor Avoids excessive fsync and ioctl MPI_File_open MPI_File_set_size MPI_File_read_all MPI_File_write_all MPI_File_read_all MPI_File_close File size could be useful in creating file domains Create file domains Delete file domains fsync(fd->fd_sys) ioctl(fd->fd_sys, BLKFLSBUF) fsync(fd->fd_sys) ioctl(fd->fd_sys, BLKFLSBUF) ENFS Servers & File Domains Compute Nodes

Statically blocked assignment Statically Blocked Assignment Based on ~equal division of whole file Least complexity & least amount of changes to ROMIO ADIOI_Calc_aggregator() - just a calculation, based on –File size –Number of processes

Statically blocked assignment A Key Structure - ADIOI_Access struct { ADIO_Offset *offsets int *lens MPI_Aint *mem_ptrs int *file_domains int count } my_reqs[nprocs] others_reqs[nprocs]

Statically blocked assignment MPI_File_open MPI_File_set_size MPI_File_read_all MPI_File_close

Statically blocked assignment MPI_File_open MPI_File_set_size MPI_File_read_all MPI_File_close

Statically blocked assignment MPI_File_open MPI_File_set_size MPI_File_read_all MPI_File_close

Statically blocked assignment MPI_File_open MPI_File_set_size MPI_File_read_all MPI_File_close

Statically blocked assignment Drawback –File inconsistency comes about when there are multiple IO calls often to different regions of the file rather than the whole file –The previous point means that this assignment scheme will not be efficient unless accesses are rather large portions of file (~3/4 of the file size) p0: p1: p2: p3: p0: p1: p2: p3: user buffers client-side file caches

Statically striped assignment Statically Striped Assignment Based on a striping block size parameter passed to ROMIO through file system hints mechanism Somewhat more complex than statically blocked assignments –Processes can “own” multiple file domains –More end cases ADIOI_Calc_Aggregator() - still just a calculation, based on –Striping block size –Number of processes Striping block size

Statically striped assignment MPI_File_open MPI_File_set_size MPI_File_read_all MPI_File_close

Statically striped assignment One significant change due to processes having multiple file domains and communication Mapping communicated data to or from the user buffer p0p1p0p1 p0p1 buf_idx[0] buf_idx[1] buf_idx[0]buf_idx[1] buf_idx[0]

Statically striped assignment MPI_File_open MPI_File_set_size MPI_File_read_all MPI_File_close

Statically striped assignment

Opportunity to match stripe size to access pattern Should work particularly well if the aggregate access regions for each IO call are fairly consistent ~nprocs*stripe size This becomes less significant if the stripe size is greater than the data sieve buffer (dflt: 4MB) p0: p1: p2: p3: p0: p1: p2: p3: user buffers client-side file caches

Dynamically assigned Static approaches cannot autonomously adapt to actual file access patterns 2 approaches –Incremental book keeping approach –reassignment Most complex of the three –Multiple file domains –With respect to the file layout, file domains are irregular –Assignment a definitive assignment policy must be established p0p1p2p3 p0p1 p2p3 write_all 1 write_all 2

Dynamically assigned ADIOI_Calc_aggregator will become a search function Augment ADIOI_Access Struct { ADIO_Offset *offsets int *lens int count Data structure pointers (e.g. b tree) }

Performance Comparisons MPI_File_Open MPI_File_set_size() Loop (iter) MPI_File_Read_all MPI_File_Write_all MPI_File_close Factors: Collective Buffer Size (4MB) Stripe Size in Application Available cache Aggregate Access File size (Static Block) No. procs

Conclusions & Future Work File consistency can be realized without locking or any changes to system configuration Except for the statically block assigned method, all the methods tested resulted in similar results The exact conditions under which each solution will perform best still need to be determined through further experimentation The Dynamic approach to persistent file domains is still unimplemented and is still under design considerations –Reassignment vs. book keeping –Specifics of each policy also need to be worked out

Data sieving in ROMIO Quick overview of data sieving Data sieving is best suited for small densely distributed non-contiguous accesses Read case User buffer Data sieve buffer File