Large Scale Parallel I/O with HDF5 Darren Adams, NCSA Considerations for Parallel File Systems on Distributed HPC Platforms.

Slides:



Advertisements
Similar presentations
Operating Systems Components of OS
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,
Database Architectures and the Web
Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.
File Consistency in a Parallel Environment Kenin Coloma
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.
Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.
Spark: Cluster Computing with Working Sets
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Technical Architectures
Reference: Message Passing Fundamentals.
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
Client/Server Architecture
Grid IO APIs William Gropp Mathematics and Computer Science Division.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Architecture Of ASP.NET. What is ASP?  Server-side scripting technology.  Files containing HTML and scripting code.  Access via HTTP requests.  Scripting.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
An Introduction to Software Architecture
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.
RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.
SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Session-8 Data Management for Decision Support
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Replica Management Kelly Clynes. Agenda Grid Computing Globus Toolkit What is Replica Management Replica Management in Globus Replica Management Catalog.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Why is Design so Difficult? Analysis: Focuses on the application domain Design: Focuses on the solution domain –The solution domain is changing very rapidly.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Advanced Computer Systems
SOFTWARE DESIGN AND ARCHITECTURE
Locality-driven High-level I/O Aggregation
Java Beans Sagun Dhakhwa.
In-situ Visualization using VisIt
The Client/Server Database Environment
Oracle Solaris Zones Study Purpose Only
University of Technology
Many-core Software Development Platforms
Lock Ahead: Shared File Performance Improvements
Research Challenges of Autonomic Computing
An Introduction to Software Architecture
Presentation transcript:

Large Scale Parallel I/O with HDF5 Darren Adams, NCSA Considerations for Parallel File Systems on Distributed HPC Platforms

Survey of Parallel I/O Techniques Root Process Gather and write: Yes, this is still common… N-N: Write independent file from each process N-1: Write to a single file from all processes. – MPIIO based Parallel I/0 – Higher level I/O Library HDF5,NetCFD,Silo,etc.

Advantages of N-1 The resulting file (can be) independent of process count. Single file is easier to analyze with other applications More conceptually compatible with the “global” data set that has been computed in a distributed fashion.

Disadvantages of N-1 Usually results in a write performance penalty vs. N-N. Without particular care, can loose information about how the data was distributed when it was created. File size can become a problem. Some approaches can result in pathologically bad IO patterns vs. N-N.

A Couple of New Approaches to Optimize Performance PLFS –Virtual Parallel Log Structured File System – Sits in between the application and the file system and intercepts the application's I/O requests. – Optimizes I/O for the underlying file system by aligning stripe boundaries ADIOS – ADaptable IO System – Change the IO behavior though an XML file that is independent of the application. – Allows Fortran READ and WRITE statements to be used in the application.

Other Considerations (besides parallel I/O performance) Portability across architecture and OS. Ability to read and manipulate data with other applications. Stability of software (library) and availability used to create file format. Sharing data with collaborators.

Reasons to use HDF5 Provides a rich API that is capable of creating a file system in a file Supports Parallel IO Is highly stable and backward compatible. Platform Independent. Widely used and supported (will likely be around for a long time)

Reasons Not to Use HDF5 (directly) Complicated, general purpose API No standard is imposed on the data format created. – While HDF5 has the capability to create a self- describing well-formed data format, it does not require these practices. More of a general toolkit to build a higher application-level API (like NetCDF). More time and effort is needed for development of I/O routines.

Reasons to Use HDF5 (directly) Maximum control over file format. Access to all of the performance settings in the underlying MPIIO layer. Ability to restructure the data layout if needed to improve performance of either the application writing the data OR visualization or other analysis software Can incrementally expose portions of the HDF5 API that are needed to the application rather than only having access to what a higher level API exposes (square hole – round peg).

Remarks on (P)HDF5 HDF5 can be compiled with parallel MPIIO support. – PHDF5 must be linked separately from a serial (non-mpi) HDF5 build. PHDF5 opens up an interface to the MPI I/O layer via MPIIO Hints and various internal MPI- specific properties. Using HDF5 directly gives full access to all of these settings

Data Requirements for DNS Application The science code is a CFD solver that used domain- decomposed MPI processes to parallelize the computation. Visualization Data – The main data product for analyzing the scalar and vector field data included 3D scalar and 3D – 3vector field data. – Important to have a workable data format that can be read with tools such as Visit Restart Data – State dump, only used by the simulation code can be highly optimized, but should still be portable across architectures. Statistics Data – Smaller data sets and Integral quantities

Parallel I/O Approach for DNS Code Chose to use HDF5 directly – Developed simple I/O library that encapsulates the HDF5 file structure Exposes a simpler interface to the simulation code. Allows I/O routines to be re-used in analysis programs Can use the simple library to implement a Visit Database reader. – Exposes only a very well maintained and widely supported library dependency to the application.

Parallel I/O Approach for DNS Code Need to mitigate performance degradation of N-1 file I/O. – As long as I/O performance is “reasonable”, code execution time will not be significantly impacted. There simply is not enough storage capacity on the file system to allow I/O to dominate. Hmmm, what does “reasonable” mean… – Structured usable files are worth some level of performance sacrifice to researchers.

HDF5 Datasets in Parallel Applications Use Hyperslabs to define a subset of a global data set for each process. – In CFD domain decomposition, each MPI rank has data that is physically adjacent to another rank’s. – The “global” data set, in this context, is achieved when data from all rank is “glued” together. – This approach creates a single glued-together data set which can be supplemented by MPI rank info, but does not need to be. Let each process write to a separate data set. – Information about how to glue the data together MUST be provided.

Considerations for the Cray XT System Kraken Lustre File System with ~160 OSTs ?. Experience shows that Collective N-1 I/O can lead to pathological performance degradation. – This is often due to network contention introduced when processes overlap the Lustre stripe boundaries. In effect each process writes to several OST’s rather than striking a balance where all OSTs work in concert.

Lustre and MPI-IO: ADIO to the Rescue Recent additions to the MPICH ROMIO layer upon which most MPI distributions are based allow Lustre stripe size and count to be set via the MPI_Info object. – Setting these parameters will set the actual Lustre striping parameters to the new file when it is created. The only other way to do this is to set striping parameter ahead of time to the I/O directory, or use the lustre c library an c-style open calls. – Perhaps equally as important these parameters are sued by the MPIIO collective buffering stack. The MPI hint “CO” is introduced to set the client to OST ratio. – These improvements directly address the shortcomings of N-1 file I/O cited by developers of PLFS and ADIOS. But, do they work?

A Note of Caution for Lustre

Lustre MPI-IO Tweaks From “man mpi” on Kraken: MPICH_MPIIO_CB_ALIGN If set to 2, an algorithm is used to divide the I/O workload into Lustre stripe-sized pieces and assigns them to collective buffering nodes (aggregators), so that each aggregator always accesses the same set of stripes and no other aggregator accesses those stripes. This is generally the optimal collective buffering mode as it minimizes the Lustre file system extent lock contention and thus reduces system I/O time. However, the overhead associated with dividing the I/O workload can in some cases exceed the time otherwise saved by using this method.

striping_factor Specifies the number of Lustre file system stripes (stripe count) to assign to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2. Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe count of the directory to a value other than the system default.

striping_unit Specifies in bytes the size of the Lustre file system stripes (stripe size) assigned to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2. Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe size of the directory to a value other than the system default.

Conclusions Log File formats may be the best approach to optimize I/O performance if performance is the only goal. More traditional file formats may not achieve the performance peaks of a log file format, but are more usable after the run. Recent Lustre ADIO improvements provide tools to mitigate poor performance for N-1 I/O patterns enough to still justify their use. Additional improvements can be made when approaching Peta- scale while maintaining a consistent file format. – Deploy separate I/O aggregator processes with (perhaps) very large buffer settings. – Implement a “N-M” approach where the single file is broken up, but not to full N-N per-process. Can use internal HSDF5 file structure and linking to preserve the “global” process-independent dataset.

Future Work Develop routines for complete state dumps for DNS restarts using HDF5 and a N-1 approach – Want single file should be readable from different process counts across platforms – Pull out all stops to get achieve peak write performance – Is a log file format or I/O imposition layer really needed? Document optimal settings for large scale grids on Kraken at 1024 processes and beyond.

References PLFS: A Checkpoint Filesystem for Parallel Applications ADIOS Lustre Technical White Paper: Lustre ADIO collective write driver Driver_Whitepaper_0926.pdf Driver_Whitepaper_0926.pdf