Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.

Slides:



Advertisements
Similar presentations
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Advertisements

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.
An HDF5-WRF module -A performance report MuQun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University of Illinois,
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
University of Illinois at Urbana-ChampaignHDF 1McGrath/Yang 2/27/02 Transitioning from HDF4 to HDF5 Robert E. McGrath Kent Yang.
NetCDF4 Performance Benchmark. Part I Will the performance in netCDF4 comparable with that in netCDF3? Will the performance in netCDF4 comparable with.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
NASA EOS DATA COMPRESSION WITH HDF5 SCALEOFFSET FILTER This work was funded by the NASA Earth Science Technology Office under NASA award AIST and.
HDF5 A new file format & software for high performance scientific data management.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Using HDF5 in WRF Part of MEAD - an alliance expedition.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
HDF Converting between HDF4 and HDF5 MuQun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University of Illinois,
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
11/7/2007HDF and HDF-EOS Workshop XI, Landover, MD1 HDF5 Software Process MuQun Yang, Quincey Koziol, Elena Pourmal The HDF Group.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad.
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
1 HDF5 Life cycle of data Boeing September 19, 2006.
A High performance I/O Module: the HDF5 WRF I/O module Muqun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
I/O on Clusters Rajeev Thakur Argonne National Laboratory.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Using IOR to Analyze the I/O Performance
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
An HDF5-WRF module -A performance report MuQun Yang, Robert E. McGrath, Mike Folk National Center for Supercomputing Applications University of Illinois,
WRF Software Development and Performance John Michalakes, NCAR NCAR: W. Skamarock, J. Dudhia, D. Gill, A. Bourgeois, W. Wang, C. Deluca, R. Loft NOAA/NCEP:
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Intro to Parallel HDF5 10/17/151ICALEPCS /17/152 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis.
Parallel NetCDF Rob Latham Mathematics and Computer Science Division Argonne National Laboratory
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
SDM Center Parallel I/O Storage Efficient Access Team.
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Input and Output Optimization in Linux for Appropriate Resource Allocation and Management James Avery King.
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
Plans for an Enhanced NetCDF-4 Interface to HDF5 Data
Chapter 9 – Real Memory Organization and Management
CSCE 990: Advanced Distributed Systems
What NetCDF users should know about HDF5?
Presentation transcript:

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber National Center for Supercomputing Applications University of Illinois, Urbana-Champaign

2 Outline HDF5 and PnetCDF performance comparison Performance of collective and independent I/O operations Collective I/O optimizations inside HDF5

3 Introduction HDF5 and netCDF provide data formats and programming interfaces HDF5  Hierarchical file structures  High flexibility for data management NetCDF  Linear data layout  Self-descriptive and minimizes overhead HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO

4 HDF5 and PnetCDF performance comparison Previous Study:  PnetCDF achieves higher performance than HDF5 NCAR Bluesky  IBM Cluster 1600  AIX 5.1, GPFS  Power4 LLNL uP  IBM SP  AIX 5.3, GPFS  Power5 PnetCDF vs. HDF

5 HDF5 and PnetCDF performance comparison Benchmark is the I/O kernel of FLASH. FLASH I/O generates 3D blocks of size 8x8x8 on Bluesky and 16x16x16 on uP. Each processor handles 80 blocks and writes them into 3 output files. The performance metric given by FLASH I/O is the parallel execution time. The more processors, the larger the amount of data. Collective vs. Independent I/O operations

6 HDF5 and PnetCDF performance comparison BlueskyuP

7 HDF5 and PnetCDF performance comparison BlueskyuP

8 Performance of collective and independent I/O operations Effects of using contiguous and chunked layout storage. Four test cases:  Writing a rectangular selection of double floats.  Different access patterns and I/O types. Data is stored in row-major order. The tests executed on NCAR Bluesky using HDF5 version

9 Contiguous storage (Case 1) Every processor has a noncontiguous selection. Access requests are interleaved. Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.)  Independent I/O: 1, s.  Collective I/O: 4.33 s. P0P1P2P3P0P1P2P3 ……. P0P1P2P3 Row 1Row 2

10 Contiguous storage (Case 2) No interleaved requests to combine. Contiguity is already established. Write operation with 32 processors where each processor selection has 16K rows and 256 columns (32 MB/proc.)  Independent I/O: 3.64 s.  Collective I/O: 3.87 s. P0 P1 P2 P3 P0P1P2P3

11 Chunked storage (Case 3) The chunk size does not divide exactly the array size. If info about access pattern is given, MPI-IO can performs efficient independent access. Write operation with 32 processors, each processor selection has 16K rows and 256 columns (32 MB/proc.)  Independent I/O: s.  Collective I/O: s. P0 P1 P2 P3 chunk 0chunk 1  dataset P0 …. P1 ….P2 ….P3 …. P0

12 Chunked storage (Case 4) Array partitioned exactly into two chunks. Write operation with 32 processors, each chunk is selected by 4 processors, each processor selection has 256K rows and 32 columns (64 MB/proc.)  Independent I/O: 7.66 s.  Collective I/O: 8.68 s. P0 P1 P2 P3 chunk 0chunk 1 P4 P5 P6 P7 P0P1P2P3

13 Conclusions Performance is quite comparable when the two libraries are used in similar manners. Collective access should be preferred in most situations when using HDF Use of chunked storage may affect the contiguity of the data selection.

14 Improvements of collective IO supports inside HDF5 Advanced HDF5 feature: non-regular selections Performance optimizations: chunked storage  Provide several IO options to achieve good collective IO performance  Provide APIs for applications to participate the optimization process

15 Improvement 1- non-regular selections Only one HDF5 IO call Good for collective IO Any such applications?

16 Performance issues for chunked storage  Previous support Collective IO can only be done per chunk  Problem Severe performance penalties with many small chunk IOs

17 P0 chunk 1chunk 2 chunk 3chunk 4 P1 MPI-IO Collective View Improvement 2: One linked chunk IO

18 Improvement 3: Multi-chunk IO Optimization Have to keep the option to do collective IO per chunk  Collective IO bugs inside different MPI-IO packages  Limitation of system memory Problem  Bad performance caused by improper use of collective IO P0 P1 P5 P6 P3 P4 P7 P8 P0 P1 P5 P6 P3 P4 P7 P8

19 Improvement 4 Problem  HDF5 may make wrong decision about the way to do collective IO Solution  Provide APIs for applications to participate the decision-making process

20 Decision-making about One linked-chunk IO Multi-chunk IO Decision-making about Collective IO One Collective IO call for all chunksIndependent IO Optional User Input Flow chart of Collective Chunking IO improvements inside HDF5 Collective chunk mode Yes NO Yes NO Collective IO per chunk

21 Conclusions HDF5 provides collective IO supports for non- regular selections Supporting collective IO for chunked storage is not trivial. Several IO options are provided. Users can participate the decision-making process that selects different IO options.

22 Acknowledgments This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.