March 9, 200910th International LCI Conference - HDF5 Tutorial1 HDF5 Advanced Topics.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Chapter 4 : File Systems What is a file system?
File Systems.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
Lab 3: Malloc Lab. “What do we need to do?”  Due 11/26  One more assignment after this one  Partnering  Non-Honors students may work with one other.
11/6/07HDF and HDF-EOS Workshop XI, Landover, MD1 Introduction to HDF5 HDF and HDF-EOS Workshop XI November 6-8, 2007.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
Ceng Operating Systems
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
The HDF Group Introduction to HDF5 Barbara Jones The HDF Group The 13 th HDF & HDF-EOS Workshop November 3-5, HDF/HDF-EOS Workshop.
NetCDF4 Performance Benchmark. Part I Will the performance in netCDF4 comparable with that in netCDF3? Will the performance in netCDF4 comparable with.
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
March 9, th International LCI Conference - HDF5 Tutorial1 Tutorial II: HDF5 and NetCDF-4 10 th International LCI Conference Albert Cheng, Neil Fortner.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004.
The HDF Group April 17-19, 2012HDF/HDF-EOS Workshop XV1 Introduction to HDF5 Barbara Jones The HDF Group The 15 th HDF and HDF-EOS Workshop.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Sep , 2010HDF/HDF-EOS Workshop XIV1 HDF5 Advanced Topics Neil Fortner The HDF Group The 14 th HDF and HDF-EOS Workshop September 28-30, 2010.
Prof. Yousef B. Mahdy , Assuit University, Egypt File Organization Prof. Yousef B. Mahdy Chapter -4 Data Management in Files.
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
1 Introduction to HDF5 Data Model, Programming Model and Library APIs HDF and HDF-EOS Workshop VIII October 26, 2004.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
HDF 1 New Features in HDF Group Revisions HDF and HDF-EOS Workshop IX November 30, 2005.
April 28, 2008LCI Tutorial1 Introduction to HDF5 Tools Tutorial Part II.
The HDF Group October 28, 2010NetcDF Workshop1 Introduction to HDF5 Quincey Koziol The HDF Group Unidata netCDF Workshop October 28-29,
October 15, 2008HDF and HDF-EOS Workshop XII1 What will be new in HDF5?
1 N-bit and ScaleOffset filters MuQun Yang National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Urbana, IL
1 HDF5 Life cycle of data Boeing September 19, 2006.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
File Systems cs550 Operating Systems David Monismith.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
FITSIO, HDF4, NetCDF, PDB and HDF5 Performance Some Benchmarks Results Elena Pourmal Science Data Processing Workshop February 27, 2002.
The HDF Group 10/17/15 1 HDF5 vs. Other Binary File Formats Introduction to the HDF5’s most powerful features ICALEPCS 2015.
Intro to Parallel HDF5 10/17/151ICALEPCS /17/152 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis.
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
The HDF Group 10/17/151 Introduction to HDF5 ICALEPCS 2015.
Chapter 10 Chapter 10 Implementing Subprograms. Implementing Subprograms  The subprogram call and return operations are together called subprogram linkage.
1 Introduction to HDF5 Programming and Tools Boeing September 19, 2006.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
DYNAMIC MEMORY ALLOCATION. Disadvantages of ARRAYS MEMORY ALLOCATION OF ARRAY IS STATIC: Less resource utilization. For example: If the maximum elements.
- 1 - Overview of Parallel HDF Overview of Parallel HDF5 and Performance Tuning in HDF5 Library NCSA/University of Illinois at Urbana- Champaign.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Module 11: File Structure
Moving from HDF4 to HDF5/netCDF-4
Parallel HDF5 Introductory Tutorial
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
HDF5 Metadata and Page Buffering
Introduction to HDF5 Tutorial.
File System Structure How do I organize a disk into a file system?
What NetCDF users should know about HDF5?
Lecture 10: Buffer Manager and File Organization
HDF and HDF-EOS Workshop XII
Chapter 11: File System Implementation
HDF5 Virtual Dataset Elena Pourmal Copyright 2017, The HDF Group.
Introduction to HDF5 Mike McGreevy The HDF Group
Moving applications to HDF
Hierarchical Data Format (HDF) Status Update
Presentation transcript:

March 9, th International LCI Conference - HDF5 Tutorial1 HDF5 Advanced Topics

March 9, th International LCI Conference - HDF5 Tutorial2 Outline Part I Overview of HDF5 datatypes Part II Partial I/O in HDF5 Hyperslab selection Dataset region references Chunking and compression Part III Performance issues (how to do it right) Part IV Performance benefits of HDF5 version 1.8

March 9, th International LCI Conference - HDF5 Tutorial3 Part I HDF5 Datatypes Quick overview of the most difficult topics

March 9, th International LCI Conference - HDF5 Tutorial4 HDF5 Datatypes HDF5 has a rich set of pre-defined datatypes and supports the creation of an unlimited variety of complex user-defined datatypes. Datatype definitions are stored in the HDF5 file with the data. Datatype definitions include information such as byte order (endianess), size, and floating point representation to fully describe how the data is stored and to insure portability across platforms. Datatype definitions can be shared among objects in an HDF file, providing a powerful and efficient mechanism for describing data.

March 9, th International LCI Conference - HDF5 Tutorial5 Example Array of integers on IA32 platform Native integer is little-endian, 4 bytes H5T_SDT_I32LE H5Dwrite Array of integers on SPARC64 platform Native integer is big-endian, 8 bytes H5T_NATIVE_INT H5Dread Little-endian 4 bytes integer VAX G-floating H5Dwrite

March 9, th International LCI Conference - HDF5 Tutorial6 Storing Variable Length Data in HDF5

March 9, th International LCI Conference - HDF5 Tutorial7 Data Time Data Time HDF5 Fixed and Variable Length Array Storage

March 9, th International LCI Conference - HDF5 Tutorial8 Storing Strings in HDF5 Array of characters (Array datatype or extra dimension in dataset) Quick access to each character Extra work to access and interpret each string Fixed length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, size); Wasted space in shorter strings Can be compressed Variable length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, H5T_VARIABLE); Overhead as for all VL datatypes Compression will not be applied to actual data

March 9, th International LCI Conference - HDF5 Tutorial9 Storing Variable Length Data in HDF5 Each element is represented by C structure typedef struct { size_t length; void *p; } hvl_t; Base type can be any HDF5 type H5Tvlen_create(base_type)

March 9, th International LCI Conference - HDF5 Tutorial10 Data Example hvl_t data[LENGTH]; for(i=0; i<LENGTH; i++) { data[i].p=malloc((i+1)*sizeof(unsigned int)); data[i].len=i+1; } tvl = H5Tvlen_create (H5T_NATIVE_UINT); data[0].p data[4].len

March 9, th International LCI Conference - HDF5 Tutorial11 Reading HDF5 Variable Length Array hvl_t rdata[LENGTH]; /* Create the memory vlen type */ tvl = H5Tvlen_create (H5T_NATIVE_UINT); ret = H5Dread(dataset,tvl,H5S_ALL,H5S_ALL, H5P_DEFAULT, rdata); /* Reclaim the read VL data */ H5Dvlen_reclaim(tvl,H5S_ALL,H5P_DEFAULT,rdata ); On read HDF5 Library allocates memory to read data in, application only needs to allocate array of hvl_t elements (pointers and lengths).

March 9, th International LCI Conference - HDF5 Tutorial12 Storing Tables in HDF5 file

March 9, th International LCI Conference - HDF5 Tutorial13 Example a_name (integer) b_name (float) c_name (double) Multiple ways to store a table Dataset for each field Dataset with compound datatype If all fields have the same type: 2-dim array 1-dim array of array datatype continued….. Choose to achieve your goal! How much overhead each type of storage will create? Do I always read all fields? Do I need to read some fields more often? Do I want to use compression? Do I want to access some records?

March 9, th International LCI Conference - HDF5 Tutorial14 HDF5 Compound Datatypes Compound types Comparable to C structs Members can be atomic or compound types Members can be multidimensional Can be written/read by a field or set of fields Not all data filters can be applied (shuffling, SZIP)

March 9, th International LCI Conference - HDF5 Tutorial15 HDF5 Compound Datatypes Which APIs to use? H5TB APIs Create, read, get info and merge tables Add, delete, and append records Insert and delete fields Limited control over table’s properties (i.e. only GZIP compression, level 6, default allocation time for table, extendible, etc.) PyTables Based on H5TB Python interface Indexing capabilities HDF5 APIs H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a compound datatype H5Dcreate, etc. See H5Tget_member* functions for discovering properties of the HDF5 compound datatype

March 9, th International LCI Conference - HDF5 Tutorial16 Creating and Writing Compound Dataset h5_compound.c example typedef struct s1_t { int a; float b; double c; } s1_t; s1_t s1[LENGTH];

March 9, th International LCI Conference - HDF5 Tutorial17 Creating and Writing Compound Dataset /* Create datatype in memory. */ s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t)); H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a), H5T_NATIVE_INT); H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c), H5T_NATIVE_DOUBLE); H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b), H5T_NATIVE_FLOAT); Note: Use HOFFSET macro instead of calculating offset by hand. Order of H5Tinsert calls is not important if HOFFSET is used.

March 9, th International LCI Conference - HDF5 Tutorial18 Creating and Writing Compound Dataset /* Create dataset and write data */ dataset = H5Dcreate(file, DATASETNAME, s1_tid, space, H5P_DEFAULT, H5P_DEFAULT); status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s1); Note: In this example memory and file datatypes are the same. Type is not packed. Use H5Tpack to save space in the file. status = H5Tpack(s1_tid); status = H5Dcreate(file, DATASETNAME, s1_tid, space, H5P_DEFAULT, H5P_DEFAULT);

March 9, th International LCI Conference - HDF5 Tutorial19 File Content with h5dump HDF5 "SDScompound.h5" { GROUP "/" { DATASET "ArrayOfStructures" { DATATYPE { H5T_STD_I32BE "a_name"; H5T_IEEE_F32BE "b_name"; H5T_IEEE_F64BE "c_name"; } DATASPACE { SIMPLE ( 10 ) / ( 10 ) } DATA { { [ 0 ], [ 1 ] }, { [ 1 ], …

March 9, th International LCI Conference - HDF5 Tutorial20 Reading Compound Dataset /* Create datatype in memory and read data. */ dataset = H5Dopen(file, DATASETNAME, H5P_DEFAULT); s2_tid = H5Dget_type(dataset); mem_tid = H5Tget_native_type (s2_tid); s1 = malloc(H5Tget_size(mem_tid)*number_of_elements); status = H5Dread(dataset, mem_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s1); Note: We could construct memory type as we did in writing example. For general applications we need to discover the type in the file, find out corresponding memory type, allocate space and do read.

March 9, th International LCI Conference - HDF5 Tutorial21 Reading Compound Dataset by Fields typedef struct s2_t { double c; int a; } s2_t; s2_t s2[LENGTH]; … s2_tid = H5Tcreate (H5T_COMPOUND, sizeof(s2_t)); H5Tinsert(s2_tid, "c_name", HOFFSET(s2_t, c), H5T_NATIVE_DOUBLE); H5Tinsert(s2_tid, “a_name", HOFFSET(s2_t, a), H5T_NATIVE_INT); … status = H5Dread(dataset, s2_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s2);

March 9, th International LCI Conference - HDF5 Tutorial22 New Way of Creating Datatypes Another way to create a compound datatype #include H5LTpublic.h ….. s2_tid = H5LTtext_to_dtype( "H5T_COMPOUND {H5T_NATIVE_DOUBLE \"c_name\"; H5T_NATIVE_INT \"a_name\"; }", H5LT_DDL);

March 9, th International LCI Conference - HDF5 Tutorial23 Need Help with Datatypes? Check our support web pages mples-by-api/api18-c.html mples-by-api/api16-c.html

March 9, th International LCI Conference - HDF5 Tutorial24 Part II Working with subsets

Collect data one way …. Array of images (3D) March 9, th International LCI Conference - HDF5 Tutorial

Stitched image (2D array) Display data another way … March 9, th International LCI Conference - HDF5 Tutorial

Data is too big to read…. March 9, th International LCI Conference - HDF5 Tutorial

Need to select and access the same elements of a dataset Refer to a region… March 9, th International LCI Conference - HDF5 Tutorial

March 9, th International LCI Conference - HDF5 Tutorial29 HDF5 Library Features HDF5 Library provides capabilities to Describe subsets of data and perform write/read operations on subsets Hyperslab selections and partial I/O Store descriptions of the data subsets in a file Object references Region references Use efficient storage mechanism to achieve good performance while writing/reading subsets of data Chunking, compression

March 9, th International LCI Conference - HDF5 Tutorial30 Partial I/O in HDF5

March 9, th International LCI Conference - HDF5 Tutorial31 How to Describe a Subset in HDF5? Before writing and reading a subset of data one has to describe it to the HDF5 Library. HDF5 APIs and documentation refer to a subset as a “selection” or “hyperslab selection”. If specified, HDF5 Library will perform I/O on a selection only and not on all elements of a dataset.

March 9, th International LCI Conference - HDF5 Tutorial32 Types of Selections in HDF5 Two types of selections Hyperslab selection Regular hyperslab Simple hyperslab Result of set operations on hyperslabs (union, difference, …) Point selection Hyperslab selection is especially important for doing parallel I/O in HDF5 (See Parallel HDF5 Tutorial)

March 9, th International LCI Conference - HDF5 Tutorial33 Regular Hyperslab Collection of regularly spaced equal size blocks

March 9, th International LCI Conference - HDF5 Tutorial34 Simple Hyperslab Contiguous subset or sub-array

March 9, th International LCI Conference - HDF5 Tutorial35 Hyperslab Selection Result of union operation on three simple hyperslabs

March 9, th International LCI Conference - HDF5 Tutorial36 Hyperslab Description Start - starting location of a hyperslab (1,1) Stride - number of elements that separate each block (3,2) Count - number of blocks (2,6) Block - block size (2,1) Everything is “measured” in number of elements

March 9, th International LCI Conference - HDF5 Tutorial37 Simple Hyperslab Description Two ways to describe a simple hyperslab As several blocks Stride – (1,1) Count – (2,6) Block – (2,1) As one block Stride – (1,1) Count – (1,1) Block – (4,6) No performance penalty for one way or another

March 9, th International LCI Conference - HDF5 Tutorial38 H5Sselect_hyperslab Function space_id Identifier of dataspace op Selection operator H5S_SELECT_SET or H5S_SELECT_OR start Array with starting coordinates of hyperslab stride Array specifying which positions along a dimension to select count Array specifying how many blocks to select from the dataspace, in each dimension block Array specifying size of element block (NULL indicates a block size of a single element in a dimension)

March 9, th International LCI Conference - HDF5 Tutorial39 Reading/Writing Selections Programming model for reading from a dataset in a file 1.Open a dataset. 2.Get file dataspace handle of the dataset and specify subset to read from. a.H5Dget_space returns file dataspace handle a.File dataspace describes array stored in a file (number of dimensions and their sizes). b.H5Sselect_hyperslab selects elements of the array that participate in I/O operation. 3.Allocate data buffer of an appropriate shape and size

March 9, th International LCI Conference - HDF5 Tutorial40 Reading/Writing Selections Programming model (continued) 4.Create a memory dataspace and specify subset to write to. 1.Memory dataspace describes data buffer (its rank and dimension sizes). 2.Use H5Screate_simple function to create memory dataspace. 3.Use H5Sselect_hyperslab to select elements of the data buffer that participate in I/O operation. 5.Issue H5Dread or H5Dwrite to move the data between file and memory buffer. 6.Close file dataspace and memory dataspace when done.

March 9, th International LCI Conference - HDF5 Tutorial41 Example : Reading Two Rows Data in a file 4x6 matrix Buffer in memory 1-dim array of length 14

March 9, th International LCI Conference - HDF5 Tutorial42 Example: Reading Two Rows start = {1,0} count = {2,6} block = {1,1} stride = {1,1} filespace = H5Dget_space (dataset); H5Sselect_hyperslab (filespace, H5S_SELECT_SET, start, NULL, count, NULL)

March 9, th International LCI Conference - HDF5 Tutorial43 Example: Reading Two Rows start[1] = {1} count[1] = {12} dim[1] = {14} memspace = H5Screate_simple(1, dim, NULL); H5Sselect_hyperslab (memspace, H5S_SELECT_SET, start, NULL, count, NULL)

March 9, th International LCI Conference - HDF5 Tutorial44 Example: Reading Two Rows H5Dread (…, …, memspace, filespace, …, …);

March 9, th International LCI Conference - HDF5 Tutorial45 Things to Remember Number of elements selected in a file and in a memory buffer must be the same H5Sget_select_npoints returns number of selected elements in a hyperslab selection HDF5 partial I/O is tuned to move data between selections that have the same dimensionality; avoid choosing subsets that have different ranks (as in example above) Allocate a buffer of an appropriate size when reading data; use H5Tget_native_type and H5Tget_size to get the correct size of the data element in memory.

March 9, th International LCI Conference - HDF5 Tutorial46 HDF5 Region References and Selections

Need to select and access the same elements of a dataset Saving Selected Region in a File March 9, th International LCI Conference - HDF5 Tutorial

March 9, th International LCI Conference - HDF5 Tutorial48 Reference Datatype Reference to an HDF5 object Pointer to a group or a dataset in a file Predefined datatype H5T_STD_REG_OBJ describe object references Reference to a dataset region (or to selection) Pointer to the dataspace selection Predefined datatype H5T_STD_REF_DSETREG to describe regions

March 9, th International LCI Conference - HDF5 Tutorial49 Reference to Dataset Region REF_REG.h5 Root Region ReferencesMatrix

March 9, th International LCI Conference - HDF5 Tutorial50 Reference to Dataset Region Example dsetr_id = H5Dcreate(file_id, “REGION REFERENCES”, H5T_STD_REF_DSETREG, …); H5Sselect_hyperslab(space_id, H5S_SELECT_SET, start, NULL, …); H5Rcreate(&ref[0], file_id, “MATRIX”, H5R_DATASET_REGION, space_id); H5Dwrite(dsetr_id, H5T_STD_REF_DSETREG, H5S_ALL, H5S_ALL, H5P_DEFAULT,ref);

March 9, th International LCI Conference - HDF5 Tutorial51 Reference to Dataset Region HDF5 "REF_REG.h5" { GROUP "/" { DATASET "MATRIX" { …… } DATASET "REGION_REFERENCES" { DATATYPE H5T_REFERENCE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): DATASET /MATRIX {(0,3)-(1,5)}, (1): DATASET /MATRIX {(0,0), (1,6), (0,8)} }

March 9, th International LCI Conference - HDF5 Tutorial52 Chunking in HDF5

March 9, th International LCI Conference - HDF5 Tutorial53 HDF5 Chunking Dataset data is divided into equally sized blocks (chunks). Each chunk is stored separately as a contiguous block in HDF5 file. Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data ADCB header Chunk index A B CD

March 9, th International LCI Conference - HDF5 Tutorial54 HDF5 Chunking Chunking is needed for Enabling compression and other filters Extendible datasets

March 9, th International LCI Conference - HDF5 Tutorial55 HDF5 Chunking If used appropriately chunking improves partial I/O for big datasets Only two chunks are involved in I/O

March 9, th International LCI Conference - HDF5 Tutorial56 HDF5 Chunking Chunk has the same rank as a dataset Chunk’s dimensions do not need to be factors of dataset’s dimensions

March 9, th International LCI Conference - HDF5 Tutorial57 Creating Chunked Dataset 1.Create a dataset creation property list. 2.Set property list to use chunked storage layout. 3.Create dataset with the above property list. dcpl_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(dcpl_id, rank, ch_dims); dset_id = H5Dcreate (…, dcpl_id); H5Pclose(dcpl_id);

March 9, th International LCI Conference - HDF5 Tutorial58 Writing or Reading Chunked Dataset 1.Chunking mechanism is transparent to application. 2.Use the same set of operation as for contiguous dataset, for example, H5Dopen(…); H5Sselect_hyperslab (…); H5Dread(…); 3.Selections do not need to coincide precisely with the chunks boundaries.

March 9, th International LCI Conference - HDF5 Tutorial59 HDF5 Filters HDF5 filters modify data during I/O operations Available filters: 1.Checksum (H5Pset_fletcher32) 2.Shuffling filter (H5Pset_shuffle) 3.Data transformation (in 1.8.*) 4.Compression Scale + offset (in 1.8.*) N-bit (in 1.8.*) GZIP (deflate), SZIP (H5Pset_deflate, H5Pset_szip) User-defined filters (BZIP2) Example of a user-defined compression filter can be found

March 9, th International LCI Conference - HDF5 Tutorial60 Creating Compressed Dataset 1.Create a dataset creation property list 2.Set property list to use chunked storage layout 3.Set property list to use filters 4.Create dataset with the above property list crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); H5Pset_deflate(crp_id, 9); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id);

March 9, th International LCI Conference - HDF5 Tutorial61 Writing Compressed Dataset CB A ………….. Default chunk cache size is 1MB. Filters including compression are applied when chunk is evicted from cache. Chunks in the file may have different sizes ABC C File Chunk cache (per dataset)Chunked dataset Filter pipeline

March 9, th International LCI Conference - HDF5 Tutorial62 Chunking Basics to Remember Chunking creates storage overhead in the file. Performance is affected by Chunking and compression parameters Chunking cache size ( H5Pset_cache call) Some hints for getting better performance Use chunk size not smaller than block size (4k) on a file system. Use compression method appropriate for your data. Avoid using selections that do not coincide with the chunking boundaries.

March 9, th International LCI Conference - HDF5 Tutorial63 Example Creates a compressed 1000x20 integer dataset in a file %h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 20, 20 ) SIZE 5316 }

March 9, th International LCI Conference - HDF5 Tutorial64 Example (continued) FILTERS { COMPRESSION DEFLATE { LEVEL 6 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE 0 } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR }

March 9, th International LCI Conference - HDF5 Tutorial65 Example (bigger chunk) Creates a compressed integer dataset 1000x20 in a file; better compression ratio is achieved. h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 200, 20 ) SIZE 2936 }

March 9, th International LCI Conference - HDF5 Tutorial66 Part III Performance Issues (How to Do it Right)

March 9, th International LCI Conference - HDF5 Tutorial67 Performance of Serial I/O Operations Next slides show the performance effects of using different access patterns and storage layouts. We use three test cases which consist of writing a selection to an array of characters. Data is stored in a row-major order. Tests were executed on THG Linux x86_64 box using h5perf_serial and HDF5 version 1.8.0

March 9, th International LCI Conference - HDF5 Tutorial68 Serial Benchmarking Tool Benchmarking tool, h5perf_serial, publicly released with HDF Features inlcude: Support for POSIX and HDF5 I/O calls. Support for datasets and buffers with multiple dimensions. Entire dataset access using a single or several I/O operations. Selection of contiguous and chunked storage for HDF5 operations.

March 9, th International LCI Conference - HDF5 Tutorial69 Contiguous Storage (Case 1) Rectangular dataset of size 48K x 48K, with write selections of 512 x 48K. HDF5 storage layout is contiguous. Good I/O pattern for POSIX and HDF5 because each selection is contiguous. POSIX: 5.19 MB/s HDF5: 5.36 MB/s

March 9, th International LCI Conference - HDF5 Tutorial70 Contiguous Storage (Case 2) Rectangular dataset of 48K x 48K, with write selections of 48K x 512. HDF5 storage layout is contiguous. Bad I/O pattern for POSIX and HDF5 because each selection is noncontiguous. POSIX: 1.24 MB/s HDF5: 0.05 MB/s …….

March 9, th International LCI Conference - HDF5 Tutorial71 Chunked Storage Rectangular dataset of 48K x 48K, with write selections of 48K x 512. HDF5 storage layout is chunked. Chunks and selections sizes are equal. Bad I/O case for POSIX because selections are noncontiguous. Good I/O case for HDF5 since selections are contiguous due to chunking layout settings. POSIX: 1.51 MB/s HDF5: 5.58 MB/s ……. POSIX HDF5

March 9, th International LCI Conference - HDF5 Tutorial72 Conclusions Access patterns with small I/O operations incur high latency and overhead costs many times. Chunked storage may improve I/O performance by affecting the contiguity of the data selection.

Writing Chunked Dataset 1000x100x100 dataset 4 byte integers Random values x100x100 chunks (20 total) Chunk size: 2 MB Write the entire dataset using 1x100x100 slices Slices are written sequentially March 9, th International LCI Conference - HDF5 Tutorial

Test Setup 20 Chunks 1000 slices Chunk size is 2MB March 9, th International LCI Conference - HDF5 Tutorial

Test Setup (continued) Tests performed with 1 MB and 5MB chunk cache size Cache size set with H5Pset_cache function H5Pget_cache (fapl, NULL, &rdcc_nelmts, &rdcc_nbytes, &rdcc_w0); H5Pset_cache (fapl, 0, rdcc_nelmts, 5*1024*1024, rdcc_w0); Tests performed with no compression and with gzip (deflate) compression March 9, th International LCI Conference - HDF5 Tutorial

Effect of Chunk Cache Size on Write Cache sizeI/O operationsTotal data written File size 1 MB (default) MB38.15 MB 5 MB MB38.15 MB No compression Gzip compression Cache sizeI/O operationsTotal data written File size 1 MB (default) MB ( MB read) MB 5 MB MB March 9, th International LCI Conference - HDF5 Tutorial

Effect of Chunk Cache Size on Write With the 1 MB cache size, a chunk will not fit into the cache All writes to the dataset must be immediately written to disk With compression, the entire chunk must be read and rewritten every time a part of the chunk is written to Data must also be decompressed and recompressed each time Non sequential writes could result in a larger file Without compression, the entire chunk must be written when it is first written to the file If the selection were not contiguous on disk, it could require as much as 1 I/O operation for each element March 9, th International LCI Conference - HDF5 Tutorial

Effect of Chunk Cache Size on Write With the 5 MB cache size, the chunk is written only after it is full Drastically reduces the number of I/O operations Reduces the amount of data that must be written (and read) Reduces processing time, especially with the compression filter March 9, th International LCI Conference - HDF5 Tutorial

Conclusion It is important to make sure that a chunk will fit into the raw data chunk cache If you will be writing to multiple chunks at once, you should increase the cache size even more Try to design chunk dimensions to minimize the number you will be writing to at once March 9, th International LCI Conference - HDF5 Tutorial

Reading Chunked Dataset Read the same dataset, again by slices, but the slices cross through all the chunks 2 orientations for read plane Plane includes fastest changing dimension Plane does not include fastest changing dimension Measure total read operations, and total size read Chunk sizes of 50x100x100, and 10x100x100 1 MB cache March 9, th International LCI Conference - HDF5 Tutorial

Chunks Read slices Vertical and horizontal Test Setup March 9, th International LCI Conference - HDF5 Tutorial

Results Read slice includes fastest changing dimension Chunk sizeCompressionI/O operationsTotal data read 50Yes MB 10Yes MB 50No MB 10No MB March 9, th International LCI Conference - HDF5 Tutorial

Results (continued) Read slice does not include fastest changing dimension Chunk sizeCompressionI/O operationsTotal data read 50Yes MB 10Yes MB 50No MB 10No MB March 9, th International LCI Conference - HDF5 Tutorial

Effect of Cache Size on Read When compression is enabled, the library must always read each entire chunk once for each call to H5Dread. When compression is disabled, the library’s behavior depends on the cache size relative to the chunk size. If the chunk fits in cache, the library reads each entire chunk once for each call to H5Dread If the chunk does not fit in cache, the library reads only the data that is selected More read operations, especially if the read plane does not include the fastest changing dimension Less total data read March 9, th International LCI Conference - HDF5 Tutorial

Conclusion In this case cache size does not matter when reading if compression is enabled. Without compression, a larger cache may not be beneficial, unless the cache is large enough to hold all of the chunks. The optimum cache size depends on the exact shape of the data, as well as the hardware. March 9, th International LCI Conference - HDF5 Tutorial

Hints for Chunk Settings Chunk dimensions should align as closely as possible with hyperslab dimensions for read/write Chunk cache size ( rdcc_nbytes ) should be large enough to hold all the chunks in the selection If this is not possible, it may be best to disable chunk caching altogether (set rdcc_nbytes to 0) rdcc_nelmts should be a prime number that is at least 10 to 100 times the number of chunks that can fit into rdcc_nbytes rdcc_w0 should be set to 1 if chunks that have been fully read/written will never be read/written again March 9, th International LCI Conference - HDF5 Tutorial86

March 9, th International LCI Conference - HDF5 Tutorial87 Part IV Performance Benefits of HDF5 version 1.8

What Did We Do in HDF5 1.8? Extended File Format Specification Reviewed group implementations Introduced new link object Revamped metadata cache implementation Improved handling of datasets and datatypes Introduced shared object header message Extended error handling Enhanced backward/forward APIs and file format compatibility March 9, th International LCI Conference - HDF5 Tutorial88

What Did We Do in HDF5 1.8? And much more good stuff to make HDF5 March 9, th International LCI Conference - HDF5 Tutorial89 Better and Faster

March 9, th International LCI Conference - HDF5 Tutorial90 HDF5 File Format Extension

March 9, th International LCI Conference - HDF5 Tutorial91 HDF5 File Format Extension Why: Address deficiencies of the original file format Address space overhead in an HDF5 file Enable new features What: New routine that instructs the HDF5 library to create all objects using the latest version of the HDF5 file format (cmp. with the earliest version when object became available, for example, array datatype)

March 9, th International LCI Conference - HDF5 Tutorial92 HDF5 File Format Extension Example /* Use the latest version of a file format for each object created in a file */ fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_libver_bounds(fapl_id, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST); fid = H5Fcreate(…,…,…,fapl_id); or fid = H5Fopen(…,…,fapl_id);

March 9, th International LCI Conference - HDF5 Tutorial93 Group Revisions

March 9, th International LCI Conference - HDF5 Tutorial94 Better Large Group Storage Why: Faster, more scalable storage and access for large groups What: New format and method for storing groups with many links

March 9, th International LCI Conference - HDF5 Tutorial95 Informal Benchmark Create a file and a group in a file Create up to 10^6 groups with one dataset in each group Compare files sizes and performance of HDF using the latest group format with the performance of HDF (default, old format) and Note: Default and became very slow after groups

Time to Open and Read a Dataset March 9, th International LCI Conference - HDF5 Tutorial96

File Size March 9, th International LCI Conference - HDF5 Tutorial97

March 9, th International LCI Conference - HDF5 Tutorial98 Questions?