Www.hdfgroup.org The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

File Consistency in a Parallel Environment Kenin Coloma
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.
Parallel I/O Performance Study Christian Chilan The HDF Group September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1.
Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Grid IO APIs William Gropp Mathematics and Computer Science Division.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
1 Input/Output. 2 Principles of I/O Hardware Some typical device, network, and data base rates.
Director of Core Software & HPC
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Sep , 2010HDF/HDF-EOS Workshop XIV1 HDF5 Advanced Topics Neil Fortner The HDF Group The 14 th HDF and HDF-EOS Workshop September 28-30, 2010.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.
Using HDF5 in WRF Part of MEAD - an alliance expedition.
May 30-31, 2012HDF5 Workshop at PSI1 HDF5 at Glance Quick overview of known topics.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
OSes: 11. FS Impl. 1 Operating Systems v Objectives –discuss file storage and access on secondary storage (a hard disk) Certificate Program in Software.
October 15, 2008HDF and HDF-EOS Workshop XII1 What will be new in HDF5?
Project 4 : SciDAC All Hands Meeting, September 11-13, 2002 A. Choudhary, W. LiaoW. Gropp, R. Ross, R. Thakur Northwestern UniversityArgonne National Lab.
1 HDF5 Life cycle of data Boeing September 19, 2006.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.
UNIX Unit 1- Architecture of Unix - By Pratima.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
Intro to Parallel HDF5 10/17/151ICALEPCS /17/152 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis.
Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
The HDF Group Introduction to HDF5 Session Two Data Model Comparison HDF5 File Format 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
An Introduction to GPFS
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
- 1 - Overview of Parallel HDF Overview of Parallel HDF5 and Performance Tuning in HDF5 Library NCSA/University of Illinois at Urbana- Champaign.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Introduction to HDF5 Session Three HDF5 Software Overview 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Introduction to Operating Systems Concepts
Parallel HDF5 Introductory Tutorial
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
HDF5 Metadata and Page Buffering
Introduction to HDF5 Tutorial.
Current status and future work
What NetCDF users should know about HDF5?
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 2: System Structures
Introduction to HDF5 Mike McGreevy The HDF Group
Lecture Topics: 11/1 General Operating System Concepts Processes
Outline Operating System Organization Operating System Examples
Presentation transcript:

The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1

Advantage of parallel HDF5 May 30-31, 2012HDF5 Workshop at PSI2 Recent success story Trillion particle simulation 120,000 cores 30TB file 23GB/sec average speed with 35GB/sec peaks (out of 40GB/sec max for system) Parallel HDF5 rocks! (when used properly )

May 30-31, 2012HDF5 Workshop at PSI3 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis Parallel Tools PHDF5 Programming Model Examples

OVERVIEW OF PARALLEL HDF5 DESIGN May 30-31, 2012HDF5 Workshop at PSI4

May 30-31, 2012HDF5 Workshop at PSI5 PHDF5 should allow multiple processes to perform I/O to an HDF5 file at the same time Single file image to all processes Compare with one file per process design: Expensive post processing Not usable by different number of processes Too many files produced for file system PHDF5 should use a standard parallel I/O interface Must be portable to different platforms PHDF5 requirements

May 30-31, 2012HDF5 Workshop at PSI6 PHDF5 requirements Support Message Passing Interface (MPI) programming PHDF5 files compatible with serial HDF5 files Shareable between different serial or parallel platforms

May 30-31, 2012HDF5 Workshop at PSI7 Parallel environment requirements MPI with POSIX I/O (HDF5 MPI POSIX driver) POSIX compliant file system MPI with MPI-IO (HDF5 MPI I/O driver) MPICH2 ROMIO Vendor’s MPI-IO Parallel file system GPFS (General Parallel File System) Lustre

PHDF5 implementation layers May 30-31, 2012HDF5 Workshop at PSI8 HDF5 Application Compute node HDF5 I/O Library MPI I/O Library HDF5 file on Parallel File System Switch network/I/O servers Disk architecture and layout of data on disk

PHDF5 CONSISTENCY SEMANTICS May 30-31, 2012HDF5 Workshop at PSI9

Consistency semantics Consistency semantics are rules that define the outcome of multiple, possibly concurrent, accesses to an object or data structure by one or more processes in a computer system. May 30-31, 2012HDF5 Workshop at PSI10

PHDF5 consistency semantics PHDF5 library defines a set of consistency semantics to let users know what to expect when processes access data managed by the library. When the changes a process makes are actually visible to itself (if it tries to read back that data) or to other processes that access the same file with independent or collective I/O operations Consistency semantics varies depending on a driver used MPI-POSIX MPI I/O May 30-31, 2012HDF5 Workshop at PSI11

HDF5 MPI-POSIX consistency semantics May 30-31, 2012HDF5 Workshop at PSI12 Process 0Process 1 write() MPI_Barrier() read() POSIX I/O guarantees that Process 1 will read what Process 0 has written: the atomicity of the read and write calls and the synchronization using the barrier ensures that Process 1 will call the read function after Process 0 is finished with the write function. Same as POSIX semantics

HDF5 MPI-I/O consistency semantics May 30-31, 2012HDF5 Workshop at PSI13 Same as MPI-I/O semantics Default MPI-I/O semantics doesn’t guarantee atomicity or sequence of calls! Problems may occur (although we haven’t seen any) when writing/reading HDF5 metadata or raw data Process 0Process 1 MPI_File_write_at() MPI_Barrier() MPI_File_read_at()

HDF5 MPI-I/O consistency semantics May 30-31, 2012HDF5 Workshop at PSI14 MPI I/O provides atomicity and sync-barrier- sync features to address the issue PHDF5 follows MPI I/O H5Fset_mpio_atomicity function to turn on MPI atomicity H5Fsync function to transfer written data to storage device (in implementation now) We are currently working on reimplementation of metadata cache for PHDF5 (metadata server)

HDF5 MPI-I/O consistency semantics May 30-31, 2012HDF5 Workshop at PSI15 For more information see “Enabling a strict consistency semantics model in parallel HDF5” linked from H5Fset_mpi_atomicity RM page

MPI-I/O VS. PHDF5 May 30-31, 2012HDF5 Workshop at PSI16

May 30-31, 2012HDF5 Workshop at PSI17 MPI-IO vs. HDF5 MPI-IO is an Input/Output API It treats the data file as a “linear byte stream” and each MPI application needs to provide its own file view and data representations to interpret those bytes

May 30-31, 2012HDF5 Workshop at PSI18 MPI-IO vs. HDF5 All data stored are machine dependent except the “external32” representation External32 is defined in Big Endianness Little-endian machines have to do the data conversion in both read or write operations 64-bit sized data types may lose information

May 30-31, 2012HDF5 Workshop at PSI19 MPI-IO vs. HDF5 HDF5 is data management software It stores data and metadata according to the HDF5 data format definition HDF5 file is self-describing Each machine can store the data in its own native representation for efficient I/O without loss of data precision Any necessary data representation conversion is done by the HDF5 library automatically

PERFORMANCE ANALYSIS May 30-31, 2012HDF5 Workshop at PSI20

May 30-31, 2012HDF5 Workshop at PSI21 Performance analysis Some common causes of poor performance Possible solutions

May 30-31, 2012HDF5 Workshop at PSI22 My PHDF5 application I/O is slow Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity

May 30-31, 2012HDF5 Workshop at PSI23 Write speed vs. block size

May 30-31, 2012HDF5 Workshop at PSI24 My PHDF5 application I/O is slow Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity

May 30-31, 2012HDF5 Workshop at PSI25 Independent vs. collective access User reported Independent data transfer mode was much slower than the Collective data transfer mode Data array was tall and thin: 230,000 rows by 6 columns : 230,000 rows :

Collective vs. independent calls MPI definition of collective calls All processes of the communicator must participate in calls in the right order. E.g., Process1 Process2 call A(); call B(); call A(); call B(); **right** call A(); call B(); call B(); call A(); **wrong** Independent means not collective Collective is not necessarily synchronous May 30-31, 2012HDF5 Workshop at PSI26

Debug Slow Parallel I/O Speed(1) Writing to one dataset -Using 4 processes == 4 columns -data type is 8-byte doubles -4 processes, 1000 rows == 4x1000x8 = 32,000 bytes % mpirun -np 4./a.out i t Execution time: s. % mpirun -np 4./a.out i t Execution time: s. Difference of 2 seconds for 1000 more rows = 32,000 bytes. Speed of 16KB/sec!!! Way too slow. May 30-31, 2012HDF5 Workshop at PSI27

Debug slow parallel I/O speed(2) Build a version of PHDF5 with./configure --enable-debug --enable-parallel … This allows the tracing of MPIO I/O calls in the HDF5 library. E.g., to trace MPI_File_read_xx and MPI_File_write_xx calls % setenv H5FD_mpio_Debug “rw” May 30-31, 2012HDF5 Workshop at PSI28

Debug slow parallel I/O speed(3) % setenv H5FD_mpio_Debug ’rw’ % mpirun -np 4./a.out i t 1000# Indep.; contiguous. in H5FD_mpio_write mpi_off=0 size_i=96 in H5FD_mpio_write mpi_off=2056 size_i=8 in H5FD_mpio_write mpi_off=2048 size_i=8 in H5FD_mpio_write mpi_off=2072 size_i=8 in H5FD_mpio_write mpi_off=2064 size_i=8 in H5FD_mpio_write mpi_off=2088 size_i=8 in H5FD_mpio_write mpi_off=2080 size_i=8 … Total of 4000 of these little 8 bytes writes == 32,000 bytes. May 30-31, 2012HDF5 Workshop at PSI29

May 30-31, 2012HDF5 Workshop at PSI30 Independent calls are many and small Each process writes one element of one row, skips to next row, write one element, so on. Each process issues 230,000 writes of 8 bytes each. : 230,000 rows :

Debug slow parallel I/O speed (4) % setenv H5FD_mpio_Debug ’rw’ % mpirun -np 4./a.out i h 1000# Indep., Chunked by column. in H5FD_mpio_write mpi_off=0size_i=96 in H5FD_mpio_write mpi_off=3688 size_i=8000 in H5FD_mpio_write mpi_off=11688 size_i=8000 in H5FD_mpio_write mpi_off=27688 size_i=8000 in H5FD_mpio_write mpi_off=19688 size_i=8000 in H5FD_mpio_write mpi_off=96 size_i=40 in H5FD_mpio_write mpi_off=136size_i=544 in H5FD_mpio_write mpi_off=680 size_i=120 in H5FD_mpio_write mpi_off=800 size_i=272 … Execution time: s. May 30-31, 2012HDF5 Workshop at PSI31

May 30-31, 2012HDF5 Workshop at PSI32 Use collective mode or chunked storage Collective I/O will combine many small independent calls into few but bigger calls Chunks of columns speeds up too : 230,000 rows :

Collective vs. independent write May 30-31, 2012HDF5 Workshop at PSI33

May 30-31, 2012HDF5 Workshop at PSI34 My PHDF5 application I/O is slow Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity

May 30-31, 2012HDF5 Workshop at PSI35 GPFS at LLNL ASCI Blue machine 4 nodes, 16 tasks Total data size 1024MB I/O buffer size 1MB Effects of I/O hints: IBM_largeblock_io MB/sec

May 30-31, 2012HDF5 Workshop at PSI36 My PHDF5 application I/O is slow If my application I/O performance is slow, what can I do? Use larger I/O data sizes Independent vs. Collective I/O Specific I/O system hints Increase Parallel File System capacity Add more disks, interconnect or I/O servers to your hardware setup Expensive!

PARALLEL TOOLS May 30-31, 2012HDF5 Workshop at PSI37

May 30-31, 2012HDF5 Workshop at PSI38 Parallel Tools h5perf Performance measuring tool showing I/O performance for different I/O APIs

May 30-31, 2012HDF5 Workshop at PSI39 h5perf An I/O performance measurement tool Tests 3 File I/O APIs: POSIX I/O (open/write/read/close…) MPI-I/O (MPI_File_{open,write,read,close}) PHDF5 H5Pset_fapl_mpio (using MPI-I/O) H5Pset_fapl_mpiposix (using POSIX I/O) An indication of I/O speed upper limits

May 30-31, 2012HDF5 Workshop at PSI40 Useful Parallel HDF Links Parallel HDF information site Parallel HDF5 tutorial available at HDF Help address

HDF5 PROGRAMMING MODEL May 30-31, 2012HDF5 Workshop at PSI41

May 30-31, 2012HDF5 Workshop at PSI42 How to compile PHDF5 applications h5pcc – HDF5 C compiler command Similar to mpicc h5pfc – HDF5 F90 compiler command Similar to mpif90 To compile: % h5pcc h5prog.c % h5pfc h5prog.f90

May 30-31, 2012HDF5 Workshop at PSI43 Programming restrictions PHDF5 opens a parallel file with a communicator Returns a file handle Future access to the file via the file handle All processes must participate in collective PHDF5 APIs Different files can be opened via different communicators

May 30-31, 2012HDF5 Workshop at PSI44 Collective HDF5 calls All HDF5 APIs that modify structural metadata are collective File operations  H5Fcreate, H5Fopen, H5Fclose Object creation - H5Dcreate, H5Dclose Object structure modification (e.g., dataset extent modification)  H5Dextend ectiveCalls.htmlhttp:// ectiveCalls.html

May 30-31, 2012HDF5 Workshop at PSI45 Other HDF5 calls Array data transfer can be collective or independent  Dataset operations: H5Dwrite, H5Dread Collectiveness is indicated by function parameters, not by function names as in MPI API

May 30-31, 2012HDF5 Workshop at PSI46 What does PHDF5 support ? After a file is opened by the processes of a communicator All parts of file are accessible by all processes All objects in the file are accessible by all processes Multiple processes may write to the same data array Each process may write to individual data array

May 30-31, 2012HDF5 Workshop at PSI47 PHDF5 API languages C and F90, 2003 language interfaces Platforms supported: Most platforms with MPI-IO supported. E.g., IBM AIX Linux clusters Cray XT

May 30-31, 2012HDF5 Workshop at PSI48 Programming model HDF5 uses access template object (property list) to control the file access mechanism General model to access HDF5 file in parallel: -Set up MPI-IO access template (file access property list) -Open File -Access Data -Close File

MY FIRST PARALLEL HDF5 PROGRAM Moving your sequential application to the HDF5 parallel world May 30-31, 2012HDF5 Workshop at PSI49

Example of PHDF5 C program Parallel HDF5 program has extra calls MPI_Init(&argc, &argv); 1.fapl_id = H5Pcreate(H5P_FILE_ACCESS); 2. H5Pset_fapl_mpio(fapl_id, comm, info); 3.file_id = H5Fcreate(FNAME,…, fapl_id); 4.space_id = H5Screate_simple(…); 5.dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT, space_id,…); 6.xf_id = H5Pcreate(H5P_DATASET_XFER); 7. H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE); 8.status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…); MPI_Finalize(); May 30-31, 2012HDF5 Workshop at PSI50

EXAMPLE Writing patterns May 30-31, 2012HDF5 Workshop at PSI51

May 30-31, 2012HDF5 Workshop at PSI52 Parallel HDF5 tutorial examples For simple examples how to write different data patterns see

Programming model Each process defines the memory and file hyperslabs using H5Sselect_hyperslab Each process executes a write/read call using hyperslabs defined, which is either collective or independent The hyperslab start, count, stride, and block parameters define the portion of the dataset to write to -Contiguous hyperslab -Regularly spaced data (column or row) -Pattern -Chunks May 30-31, 2012HDF5 Workshop at PSI53

May 30-31, 2012HDF5 Workshop at PSI54 Four processes writing by rows HDF5 "SDS_row.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13

May 30-31, 2012HDF5 Workshop at PSI55 Two processes writing by columns HDF5 "SDS_col.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200

May 30-31, 2012HDF5 Workshop at PSI56 Four processes writing by pattern HDF5 "SDS_pat.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4, 1, 3, 1, 3, 2, 4, 2, 4

May 30-31, 2012HDF5 Workshop at PSI57 Four processes writing by chunks HDF5 "SDS_chnk.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4

The HDF Group Thank You! Questions? May 30-31, 2012HDF5 Workshop at PSI 58