Introduction to HDF5 Tutorial.

Slides:



Advertisements
Similar presentations
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Advertisements

File Systems.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
11/6/07HDF and HDF-EOS Workshop XI, Landover, MD1 Introduction to HDF5 HDF and HDF-EOS Workshop XI November 6-8, 2007.
The HDF Group Introduction to HDF5 Barbara Jones The HDF Group The 13 th HDF & HDF-EOS Workshop November 3-5, HDF/HDF-EOS Workshop.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004.
The HDF Group April 17-19, 2012HDF/HDF-EOS Workshop XV1 Introduction to HDF5 Barbara Jones The HDF Group The 15 th HDF and HDF-EOS Workshop.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Sep , 2010HDF/HDF-EOS Workshop XIV1 HDF5 Advanced Topics Neil Fortner The HDF Group The 14 th HDF and HDF-EOS Workshop September 28-30, 2010.
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
1 Introduction to HDF5 Data Model, Programming Model and Library APIs HDF and HDF-EOS Workshop VIII October 26, 2004.
April 28, 2008LCI Tutorial1 HDF5 Tutorial LCI April 28, 2008.
May 30-31, 2012HDF5 Workshop at PSI1 HDF5 at Glance Quick overview of known topics.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
HDF 1 New Features in HDF Group Revisions HDF and HDF-EOS Workshop IX November 30, 2005.
File System Interface. File Concept Access Methods Directory Structure File-System Mounting File Sharing (skip)‏ File Protection.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
April 28, 2008LCI Tutorial1 Introduction to HDF5 Tools Tutorial Part II.
The HDF Group October 28, 2010NetcDF Workshop1 Introduction to HDF5 Quincey Koziol The HDF Group Unidata netCDF Workshop October 28-29,
1 HDF5 Life cycle of data Boeing September 19, 2006.
1 Introduction to HDF5 Data Model, Programming Model and Library APIs HDF and HDF-EOS Workshop IX November 30, 2005.
HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
March 9, th International LCI Conference - HDF5 Tutorial1 HDF5 Advanced Topics.
The HDF Group 10/17/15 1 HDF5 vs. Other Binary File Formats Introduction to the HDF5’s most powerful features ICALEPCS 2015.
Intro to Parallel HDF5 10/17/151ICALEPCS /17/152 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis.
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
The HDF Group 10/17/151 Introduction to HDF5 ICALEPCS 2015.
1 Introduction to HDF5 Programming and Tools Boeing September 19, 2006.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
- 1 - Overview of Parallel HDF Overview of Parallel HDF5 and Performance Tuning in HDF5 Library NCSA/University of Illinois at Urbana- Champaign.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Introduction to HDF5 Session Three HDF5 Software Overview 1 Copyright © 2010 The HDF Group. All Rights Reserved.
1 January 11-13, 2010ESRF Workshop – Introduction to HDF5 Introduction to HDF5 Francesc Alted Consultant and PyTables creator.
HDF and HDF-EOS Workshop XII
Hierarchical Data Formats (HDF) Update
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CHP - 9 File Structures.
Chapter 11: File System Implementation
Moving from HDF4 to HDF5/netCDF-4
Parallel HDF5 Introductory Tutorial
Single Writer/Multiple Reader (SWMR)
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
Spark Presentation.
HDF5 Metadata and Page Buffering
Chapter 11: File-System Interface
Database Performance Tuning and Query Optimization
What NetCDF users should know about HDF5?
Operation System Program 4
Chapter 11: File System Implementation
HDF5 Virtual Dataset Elena Pourmal Copyright 2017, The HDF Group.
EECE.4810/EECE.5730 Operating Systems
File System B. Ramamurthy B.Ramamurthy 11/27/2018.
Directory Structure A collection of nodes containing information about all files Directory Files F 1 F 2 F 3 F 4 F n Both the directory structure and the.
Introduction to HDF5 Mike McGreevy The HDF Group
Moving applications to HDF
Virtual Memory Hardware
Chapter 10: File-System Interface
Hierarchical Data Format (HDF) Status Update
Chapter 11 Database Performance Tuning and Query Optimization
Presentation transcript:

Introduction to HDF5 Tutorial

Help new users to start with HDF5 Goals Help new users to start with HDF5 HDF5 concepts: data and programming models, terminology, major features Help everyone to avoid HDF5 pitfalls Performance tuning is for everyone, not for experts only

HDF = Hierarchical Data Format April 29-30, 2009 HDF = Hierarchical Data Format HDF5 is the second HDF format Development started in 1996 First release was in 1998 Supported by The HDF Group HDF4 is the first HDF format Originally called HDF Development started in 1987 Still supported by The HDF Group HDF5 Technical Consulting Meeting for EMRG Program

April 29-30, 2009 HDF5 is like… 5 HDF5 Technical Consulting Meeting for EMRG Program 4

for high volume and/or complex data April 29-30, 2009 HDF5 is designed … for high volume and/or complex data for every size and type of system (portable) for flexible, efficient storage and I/O to enable applications to evolve in their use of HDF5 and to accommodate new models to support long-term data preservation HDF5 Technical Consulting Meeting for EMRG Program

April 29-30, 2009 HDF5 Technology HDF5 Data Model Defines the “building blocks” for data organization and specification Files, Groups, Datasets, Attributes, Datatypes, Dataspaces, … HDF5 Library (C, Fortran 90, C++ APIs, Java, Python, Julia, R) High Level Libraries HDF5 Binary File Format Bit-level organization of HDF5 logical file Defined by HDF5 File Format Specification Tools For Accessing Data in HDF5 Format h5dump, h5repack, HDFView, … HDF5 Technical Consulting Meeting for EMRG Program

April 29-30, 2009 Where to start? HDF5 Technical Consulting Meeting for EMRG Program

HDF5 Resources The HDF Group Page: https://www.hdfgroup.org/ HDF5 Home Page: https://support.hdfgroup.org/HDF5/ Software (source code and binaries) Documentation Examples https://support.hdfgroup.org/HDF5/examples/ HDF Helpdesk: help@hdfgroup.org HDF Mailing Lists: http://support.hdfgroup.org/services

New Users USE Anaconda to install h5py and HDF5 software h5dump: Tool to “dump” or display contents of HDF5 files If using other languages leverage scripts h5cc, h5c++, h5fc to compile applications Other tools: h5ls, h5repack, h5copy

-H, --header Display header only – no data h5dump Utility h5dump [options] [file] -H, --header Display header only – no data -d <names> Display the specified dataset(s) -g <names> Display the specified group(s) and all members -p Display properties. <names> is one or more appropriate object names. Other tools: h5ls, h5repack, h5copy

Code: Create a file and a dataset (h5_crtdata.py) >>> import h5py >>> file = h5py.File('dset.h5','w') >>> dataset = file.create_dataset("dset", (4, 6), h5py.h5t.STD_I32BE) >>> ... >>> dataset[...] = data >>> file.close()

C code #include "hdf5.h" #define FILE "dset.h5” …. hid_t file_id, dset_id, dspace_id; /* identifiers */ …. file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); dims[0] = 4; dims[1] = 6; dspace_id = H5Screate_simple(2, dims, NULL); dset_id = H5Dcreate2(file_id, "/dset", H5T_STD_I32BE, dspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); H5Dwrite(dset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data); H5Dclose(dset_id); H5Sclose(dspace_id); H5Fclose(file_id);

Example of h5dump Output HDF5 "dset.h5" { GROUP "/" { DATASET "dset" { DATATYPE { H5T_STD_I32BE } DATASPACE { SIMPLE ( 4, 6 ) / ( 4, 6 ) } DATA { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }

HDF5 Tutorial and Examples HDF5 Tutorial: https://support.hdfgroup.org/HDF5/Tutor/ HDF5 Example Code: https://support.hdfgroup.org/ftp/HDF5/examples/examples-by-api/

April 29-30, 2009 HDF5 Data Model HDF5 Technical Consulting Meeting for EMRG Program

An HDF5 file is a container that holds data objects. April 29-30, 2009 HDF5 File lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 An HDF5 file is a container that holds data objects. Experiment Notes: Serial Number: 99378920 Date: 3/13/09 Configuration: Standard 3 HDF5 Technical Consulting Meeting for EMRG Program

The two primary HDF5 objects are: HDF5 Group: A grouping structure containing zero or more HDF5 objects HDF5 Dataset: Array of data elements, together with information that describes them (There are other HDF5 objects that help support Groups and Datasets.)

/ HDF5 Groups and Links HDF5 groups and links organize data objects. April 29-30, 2009 HDF5 Groups and Links / HDF5 groups and links organize data objects. Experiment Notes: Serial Number: 99378920 Date: 3/13/09 Configuration: Standard 3 Viz SimOut lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 HDF5 Technical Consulting Meeting for EMRG Program

HDFView – HDF Data Browser in Java https://support.hdfgroup.org/tools/

HDF Compass – HDF Data Browser in Python https://support.hdfgroup.org/projects/compass/

Groups and Links

Example h5_links.py / links.h5 Groups is a container for links of different types A B dangling a soft a External Example h5_links.py creates a file links.h5 and two groups “A” and “B” in it. Then it creates a one-dimensional array “a” in group “A”. After the datasets was created a hard link “a” was added to the root group. (It is one dimensional in example and doesn’t have data). Also soft link with the value “/A/a” was added to the root group along with the dangling soft link “dangling”. External link “External” was added to group B. It points to a dataset “dset” in dset.h5. Dataset can be “reached” using three paths /A/a /a /soft dset.h5 Dataset is in a different file HDF5 Workshop at PSI May 30-31, 2012

Links (Name, Value) pair Name UTF-8 string; example: “A”, “B”, “a”, “dangling”, “soft” Unique within a group; “/” are not allowed in names Depending on Value the links are called: Hard Link Value is object’s address in a file Created automatically when object is created Can be added to point to existing object Soft Link Value is a string , for example, “/A/a” Used to create aliases External Value is a pair of strings , for example, (“dset.h5”, “/dset” ) Used to access data in other HDF5 files

Datasets

HDF5 Datasets HDF5 Datasets organize and contains “raw data values”. They consist of: Data array Metadata describing the data array - Datatype - Dataspace (shape) - Properties (characteristics of the data, e.g., compressed) - Attributes (additional optional information that describes the data)

Metadata Data HDF5 Dataset Dataspace Rank Dimensions Datatype 3 Rank Dim_2 = 5 Dim_1 = 4 Dimensions Time = 32.4 Pressure = 987 Temp = 56 (optional) Attributes Chunked Compressed Dim_3 = 7 Properties Integer Datatype Data Array is an ordered collection of identically typed data items distinguished by their indices Metadata: Dataspace – Rank, dimensions; spatial info about dataset Datatype – Information on how to interpret your data Storage Properties – How array is organized Attributes – User-defined metadata (optional) 26

HDF5 Dataspaces An HDF5 Dataspace can be one of the following: Array (or simple dataspace) multiple elements in dataset organized in a multi-dimensional (rectangular) array maximum number of elements in each dimension may be fixed or unlimited NULL no elements in dataset Scalar single element in dataset

Spatial information (shape) of an array stored in a file: HDF5 Dataspaces Two roles: Spatial information (shape) of an array stored in a file: Rank and dimensions Permanent part of a dataset definition Partial I/0: Dataspace describes application’s data buffer and data elements participating in I/O 1 2 3 4 5 Rank = 2 Dimensions = 4x6 Rank = 1 Dimension = 10 1 2 3 4 5

HDF5 hyperslab and data selections Mechanism to describe elements for partial I/O Selections are result of set operations (defined by APIs) on hyperslabs Definition of hyperslab: Everything is “measured” in number of elements Start - starting location of a hyperslab (1,1) Stride - number of elements that separate each block (3,2) Count - number of blocks (2,6) Block - block size (2,1)

HDF5 Datatypes The HDF5 datatype describes how to interpret individual data elements. HDF5 datatypes include: integer, float, unsigned, bitfield, … user-definable (e.g., 13-bit integer) variable length types (e.g., strings) references to objects/dataset regions enumerations - names mapped to integers opaque compound (similar to C structs)

HDF5 Dataset with Compound Datatype 3 5 V V V V int8 int4 int16 2x3x2 array of float32 Compound Datatype: Dataspace: Rank = 2 Dimensions = 5 x 3

HDF5 Dataset Storage Properties Data elements stored physically adjacent to each other Contiguous (default) Better access time for subsets; extensible Chunked Improves storage efficiency, transmission speed Chunked & Compressed Allows to use old data with new tools h5py Old binary file External

Chunking is required for several HDF5 features HDF5 Chunking Chunking is required for several HDF5 features Applying compression and other filters like checksum FLETCHER32 SHUFFLE SCALEOFFSET NBIT GZIP SZIP (some licensing issues) Expanding/shrinking dataset dimensions and adding/”deleting” data Copyright © 2015 The HDF Group. All rights reserved.

Only two chunks are involved in I/O HDF5 Chunking Chunking improves partial I/O for big datasets Only two chunks are involved in I/O Copyright © 2015 The HDF Group. All rights reserved.

HDF5 Chunking Limitations Chunk dimensions cannot be bigger than dataset dimensions Number of elements a chunk is limited to 4GB H5Pset_chunk fails otherwise Total size chunk is limited to 4GB Total size = (number of elements) * (size of the datatype) H5Dwrite fails later on More data will be written in this case Ghost zones are filled with fill value unless fill value is disabled Copyright © 2015 The HDF Group. All rights reserved.

HDF5 Attributes An HDF5 attribute has a name and a value Attributes typically contain user metadata Attributes may be associated with - HDF5 groups - HDF5 datasets - HDF5 named datatypes An attribute’s value is described by a datatype and shape (dataspace) Attributes are analogous to datasets except… - they are NOT extensible - they do NOT support compression or partial I/O

HDF5 Abstract Data Model Summary April 29-30, 2009 HDF5 Abstract Data Model Summary The Objects in the Data Model are the “building blocks” for data organization and specification Files, Groups, Links, Datasets, Datatypes, Dataspaces, Attributes, … Projects using HDF5 “map” their data concepts to these HDF5 Objects HDF5 Technical Consulting Meeting for EMRG Program

April 29-30, 2009 HDF5 Software HDF5 Technical Consulting Meeting for EMRG Program

HDF5 Software Layers & Storage Tools … High Level APIs API h5dump tool h5repack tool HDFview tool HDF5 Library Language Interfaces HDF5 Data Model Objects Groups, Datasets, Attributes, … Tunable Properties Chunk Size, I/O Driver, … C, Fortran, JNI, C++ Memory Mgmt Datatype Conversion Chunked Storage Version Compatibility and so on… Internals Filters Virtual File Layer Posix I/O Split Files MPI I/O Custom I/O Drivers Storage HDF5 File Format File on Parallel Filesystem ? Split Files File Other

HDF5 Programming Model and APIs

Operations Supported by the API Create objects (groups, datasets, attributes, complex data types, …) Assign storage and I/O properties to objects Perform complex sub-setting during read/write Use variety of I/O “devices” (parallel, remote, etc.) Transform data during I/O Make inquiries on file and object structure, content, properties

General Programming Paradigm Properties of object are optionally defined Creation properties Access properties Object is opened or created Object is accessed, possibly many times Object is closed

H5F : File interface e.g., H5Fopen The General HDF5 API C, Fortran 90, Java, and C++ bindings (part of HDF5 distribution) H5py, Julia, R, ADA (community) C routines begin with prefix H5? ? is a character corresponding to the type of object the function acts on Example Functions: H5D : Dataset interface e.g., H5Dread H5F : File interface e.g., H5Fopen H5S : dataSpace interface e.g., H5Sclose

Basic Functions H5Fcreate (H5Fopen) create (open) File H5Screate_simple/H5Screate create dataSpace H5Dcreate (H5Dopen) create (open) Dataset H5Dread, H5Dwrite access Dataset H5Dclose close Dataset H5Sclose close dataSpace H5Fclose close File

Tuning for performance Copyright © 2015 The HDF Group. All rights reserved.

Groups and Links

Use group’s creation properties to save space and boost performance Links storage types Compact (in 1.8.* versions) Used with a few members (default under 8) Dense (default behavior) Used with many (>16) members (default) Tunable size for a local heap Save space by providing estimate for size of the storage required for links names Can be compressed Many links with similar names (XXX-abc, XXX-d, XXX-efgh, etc.) Requires more time to compress/uncompress data

Use latest file format (see H5Pset_libver_bound function in RM) Hints Use latest file format (see H5Pset_libver_bound function in RM) Save space when creating a lot of groups in a file Save time when accessing many objects (>1000) Caution: Tools built with the HDF5 versions prior to 1.8.0 will not work on the files created with this property

Create up to 10^6 groups with one dataset in each group Benchmark Create a file Create up to 10^6 groups with one dataset in each group Compare files sizes and performance of HDF5 1.8.1 using the latest group format with the performance of HDF5 1.8.1 (default, old format) and 1.6.7 Note: “Default” 1.8.1 and 1.6.7 became very slow after 700000 groups HDF5 Workshop at PSI May 30-31, 2012

Time to Open and Read a Dataset This graph shows primarily that the access time with old format groups grows almost linearly with the number of groups, while it is nearly constant with the new groups. At the upper end of the test, old groups are 2-3 orders of magnitude slower than new groups. Plato: metadata cache felt enough pressure from cache misses that it resized the cache larger.  I'm guessing that if you were able to keep going, it would start sloping upward again. May 30-31, 2012 HDF5 Workshop at PSI

Time to Close the File This graph again shows that performance of the new groups is relatively unaffected by the number of groups, though the difference is not as dramatic as cold-cache access. The "old" group format, with a single [huge] local heap is probably the reason - it's being flushed to the file when the group closes.  The "new" group format which uses a fractal heap will never get a single block of heap data that's so large - smaller heap blocks will get evicted from the cache and flushed to disk as no more group entries are added to them.  It could be that the new v2 B-tree is much more efficient than the older B*-trees, but I'm guessing that's a smaller effect. May 30-31, 2012 HDF5 Workshop at PSI

File Size This shows the greater space efficiency of the new compact groups. The new indexed groups are also more space efficient, but that does not make as much difference as the compact groups. May 30-31, 2012 HDF5 Workshop at PSI

Chunking and compression Datasets Chunking and compression Copyright © 2015 The HDF Group. All rights reserved.

Chunked Dataset I/O Application memory space HDF5 File Chunked dataset Chunk cache DT conversion A C C B Filter pipeline HDF5 File B A C Datatype conversion is performed before chunked placed in cache on write Datatype conversion is performed after chunked is placed in application buffer Chunk is written when evicted from cache Compression and other filters are applied on eviction or on bringing chunk into cache Copyright © 2015 The HDF Group. All rights reserved.

H5DOwrite_chunk 40 times speedup in our benchmarks

How chunking and compression can kill performance Example How chunking and compression can kill performance

Example of chunking strategy JPSS uses granule (satellite scan) size as chunk size “ES_ImaginaryLW” is stored using 15 chunks (2.9 MB) with the size 4x30x9x717 …….…….. Copyright © 2015 The HDF Group. All rights reserved.

Compressed with GZIP level 6 Problem SCRIS_npp_d20140522_t0754579_e0802557_b13293_c20140522142425734814_noaa_pop.h5 DATASET "/All_Data/CrIS-SDR_All/ES_ImaginaryLW" { DATATYPE H5T_IEEE_F32BE DATASPACE SIMPLE { ( 60, 30, 9, 717 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED ) } STORAGE_LAYOUT { CHUNKED ( 4, 30, 9, 717 ) SIZE 46461600 } Dataset is read once, by contiguous 1x1x1x717 selections, i.e., 717 elements 16200 times. The time it takes to read the whole dataset is: Compressed with GZIP level 6 No compression ~345 seconds ~0.1 seconds Copyright © 2015 The HDF Group. All rights reserved.

Performance may depend on combinations of factors Solutions Performance may depend on combinations of factors I/O access patterns Chunk size and layout Chunk cache size Memory usage, compression, etc. Copyright © 2015 The HDF Group. All rights reserved.

Chunks are too small Pitfall – chunk size File has too many chunks Extra metadata increases file size Extra time to look up each chunk More I/O since each chunk is stored independently Larger chunks results in fewer chunk lookups, smaller file size, and fewer I/O operations

Chunks are too large Pitfall – chunk size Entire chunk has to be read and uncompressed before performing any operations Great performance penalty for reading a small subset Entire chunk has to be in memory and may cause OS to page memory to disk, slowing down the entire system Use case: Set chunk size to be the same as the dataset size to “enable” compression on contiguous dataset Copyright © 2015 The HDF Group. All rights reserved.

HDF5 raw data chunk cache The only raw data cache in HDF5 Chunk cache is per dataset (1MB default) Improves performance whenever the same chunks are read or written multiple times

HDF5 chunk cache documentation http://www.hdfgroup.org/HDF5/doc/Advanced.html

Chunk doesn’t fit into cache Application buffer Chunk cache Gran_1 Gran_1 Gran_2 All data in chunk is copied to application buffer before chunk is discarded …………… Chunks in HDF5 file Gran_15

Chunk fits into cache Chunk cache Gran_1 Gran_1 H5Dread Application buffer Chunk cache Gran_1 Gran_1 H5Dread Gran_2 Chunk stays in cache until all data is read and copied. It is discarded to bring in new chunk. …………… Chunks in HDF5 file Gran_15

Solution (Data Consumers) Increase chunk cache size For our example dataset, we increased chunk cache size to 3MB - big enough to hold one 2.95 MB chunk Caution: Big chunk cache Increases application memory footprint Compressed with GZIP level 6 1MB cache (default) No compression 1MB (default) or 3MB cache 3MB cache ~345 seconds ~0.09 seconds ~0.37 seconds Copyright © 2015 The HDF Group. All rights reserved.

Solution (Data Consumers) Change access pattern Keep default cache size (1MB) We read our example dataset using a selection that corresponds to the whole chunk 4x9x30x717 Compressed with GZIP level 6 Selection 1x1x1`x717 No compression Selection 1x1x1`x717 Selection 4x9x30x717 ~345 seconds ~0.1 seconds ~0.04 seconds ~0.36 seconds Copyright © 2015 The HDF Group. All rights reserved.

Solution (Data Providers) Change chunk size Write original files with the smaller chunk size We recreated our example dataset using chunk size 1x30x9x717 (~0.74MB) We used default cache size 1MB Read by 1x1x1x717 selections 16200 times Performance improved 1000 times Compressed with GZIP level 6 chunk size 4x9x30x717 No compression Selection 4x9x30x717 chunk size 1x9x30x717 ~345 seconds ~0.04 seconds ~0.08 seconds ~0.36 seconds Compare with the original result Copyright © 2015 The HDF Group. All rights reserved.

August 12, 2014 Parallel HDF5

Parallel HDF5 Allows multiple processes to perform I/O to an HDF5 file Supports Message Passing Interface (MPI) programming (MPICH, OpenMPI w/ROMIO) PHDF5 files compatible with serial HDF5 files Shareable between different serial or parallel platforms GPFS, Lustre, PVFS

PHDF5 implementation layers HDF5 Application Compute node Compute node Compute node HDF5 Library MPI Library HDF5 file on Parallel File System PHDF5 is built on top of MPI I/O APIs Switch network + I/O servers Disk architecture and layout of data on disk

PHDF5 opens a parallel file with an MPI communicator Programming model PHDF5 opens a parallel file with an MPI communicator Returns a file handle Future access to the file via the file handle All processes must participate in collective PHDF5 APIs Calls that modify HDF5 structural metadata Group, dataset creation Attributes creation Datasets extensions Different files can be opened via different communicators Arrays data transfer can be collective or independent Collectiveness is indicated by function parameters to H5Dwrite, H5Dread

After a file is opened by the processes of a communicator What does PHDF5 support ? After a file is opened by the processes of a communicator All parts of file are accessible by all processes All objects in the file are accessible by all processes Multiple processes may write to the same data array Each process may write to individual data array C and F90, 2003 language interfaces Most platforms with MPI-IO supported. e.g., IBM AIX Linux clusters Cray XT Windows, MAC, Linux

Example of PHDF5 C program Parallel HDF5 program has extra calls MPI_Init(&argc, &argv); fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(fapl_id, comm, info); file_id = H5Fcreate(FNAME,…, fapl_id); space_id = H5Screate_simple(…); dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT, space_id,…); xf_id = H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE);status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…); MPI_Finalize();

Parallel HDF5 tutorial examples For simple examples how to write different data patterns see http://support.hdfgroup.org/HDF5/Tutor/parallel.html

The hyperslab parameters define the portion of the dataset to write to Programming model Each process defines memory and file subsets of data using H5Sselect_hyperslab Each process executes a write/read call using subsets (hyperslabs) defined, which can be either collective or independent The hyperslab parameters define the portion of the dataset to write to Contiguous hyperslab Regularly spaced data (column or row) Pattern Blocks

Four processes writing by rows HDF5 "SDS_row.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13

Two processes writing by columns HDF5 "SDS_col.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200

Four processes writing by pattern HDF5 "SDS_pat.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 1, 3, 2, 4, 2, 4, 2, 4, 2, 4

Four processes writing by blocks HDF5 "SDS_blk.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4

Complex data patterns HDF5 doesn’t have restrictions on data patterns and data balance 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63 8 16 24 32 40 48 56 64 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63 8 16 24 32 40 48 56 64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

Thank You!