Download presentation
Presentation is loading. Please wait.
1
Introduction to HDF5 Tutorial
2
Help new users to start with HDF5
Goals Help new users to start with HDF5 HDF5 concepts: data and programming models, terminology, major features Help everyone to avoid HDF5 pitfalls Performance tuning is for everyone, not for experts only
3
HDF = Hierarchical Data Format
April 29-30, 2009 HDF = Hierarchical Data Format HDF5 is the second HDF format Development started in 1996 First release was in 1998 Supported by The HDF Group HDF4 is the first HDF format Originally called HDF Development started in 1987 Still supported by The HDF Group HDF5 Technical Consulting Meeting for EMRG Program
4
April 29-30, 2009 HDF5 is like… 5 HDF5 Technical Consulting Meeting for EMRG Program 4
5
for high volume and/or complex data
April 29-30, 2009 HDF5 is designed … for high volume and/or complex data for every size and type of system (portable) for flexible, efficient storage and I/O to enable applications to evolve in their use of HDF5 and to accommodate new models to support long-term data preservation HDF5 Technical Consulting Meeting for EMRG Program
6
April 29-30, 2009 HDF5 Technology HDF5 Data Model Defines the “building blocks” for data organization and specification Files, Groups, Datasets, Attributes, Datatypes, Dataspaces, … HDF5 Library (C, Fortran 90, C++ APIs, Java, Python, Julia, R) High Level Libraries HDF5 Binary File Format Bit-level organization of HDF5 logical file Defined by HDF5 File Format Specification Tools For Accessing Data in HDF5 Format h5dump, h5repack, HDFView, … HDF5 Technical Consulting Meeting for EMRG Program
7
April 29-30, 2009 Where to start? HDF5 Technical Consulting Meeting for EMRG Program
8
HDF5 Resources The HDF Group Page: HDF5 Home Page: Software (source code and binaries) Documentation Examples HDF Helpdesk: HDF Mailing Lists:
9
New Users USE Anaconda to install h5py and HDF5 software h5dump: Tool to “dump” or display contents of HDF5 files If using other languages leverage scripts h5cc, h5c++, h5fc to compile applications Other tools: h5ls, h5repack, h5copy
10
-H, --header Display header only – no data
h5dump Utility h5dump [options] [file] -H, --header Display header only – no data -d <names> Display the specified dataset(s) -g <names> Display the specified group(s) and all members -p Display properties. <names> is one or more appropriate object names. Other tools: h5ls, h5repack, h5copy
11
Code: Create a file and a dataset (h5_crtdata.py)
>>> import h5py >>> file = h5py.File('dset.h5','w') >>> dataset = file.create_dataset("dset", (4, 6), h5py.h5t.STD_I32BE) >>> ... >>> dataset[...] = data >>> file.close()
12
C code #include "hdf5.h" #define FILE "dset.h5” …. hid_t file_id, dset_id, dspace_id; /* identifiers */ …. file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); dims[0] = 4; dims[1] = 6; dspace_id = H5Screate_simple(2, dims, NULL); dset_id = H5Dcreate2(file_id, "/dset", H5T_STD_I32BE, dspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); H5Dwrite(dset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data); H5Dclose(dset_id); H5Sclose(dspace_id); H5Fclose(file_id);
13
Example of h5dump Output
HDF5 "dset.h5" { GROUP "/" { DATASET "dset" { DATATYPE { H5T_STD_I32BE } DATASPACE { SIMPLE ( 4, 6 ) / ( 4, 6 ) } DATA { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 }
14
HDF5 Tutorial and Examples
HDF5 Tutorial: HDF5 Example Code:
15
April 29-30, 2009 HDF5 Data Model HDF5 Technical Consulting Meeting for EMRG Program
16
An HDF5 file is a container that holds data objects.
April 29-30, 2009 HDF5 File lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 An HDF5 file is a container that holds data objects. Experiment Notes: Serial Number: Date: 3/13/09 Configuration: Standard 3 HDF5 Technical Consulting Meeting for EMRG Program
17
The two primary HDF5 objects are:
HDF5 Group: A grouping structure containing zero or more HDF5 objects HDF5 Dataset: Array of data elements, together with information that describes them (There are other HDF5 objects that help support Groups and Datasets.)
18
/ HDF5 Groups and Links HDF5 groups and links organize data objects.
April 29-30, 2009 HDF5 Groups and Links / HDF5 groups and links organize data objects. Experiment Notes: Serial Number: Date: 3/13/09 Configuration: Standard 3 Viz SimOut lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 HDF5 Technical Consulting Meeting for EMRG Program
19
HDFView – HDF Data Browser in Java
20
HDF Compass – HDF Data Browser in Python
21
Groups and Links
22
Example h5_links.py / links.h5 Groups is a container for links of different types A B dangling a soft a External Example h5_links.py creates a file links.h5 and two groups “A” and “B” in it. Then it creates a one-dimensional array “a” in group “A”. After the datasets was created a hard link “a” was added to the root group. (It is one dimensional in example and doesn’t have data). Also soft link with the value “/A/a” was added to the root group along with the dangling soft link “dangling”. External link “External” was added to group B. It points to a dataset “dset” in dset.h5. Dataset can be “reached” using three paths /A/a /a /soft dset.h5 Dataset is in a different file HDF5 Workshop at PSI May 30-31, 2012
23
Links (Name, Value) pair Name
UTF-8 string; example: “A”, “B”, “a”, “dangling”, “soft” Unique within a group; “/” are not allowed in names Depending on Value the links are called: Hard Link Value is object’s address in a file Created automatically when object is created Can be added to point to existing object Soft Link Value is a string , for example, “/A/a” Used to create aliases External Value is a pair of strings , for example, (“dset.h5”, “/dset” ) Used to access data in other HDF5 files
24
Datasets
25
HDF5 Datasets HDF5 Datasets organize and contains “raw data values”. They consist of: Data array Metadata describing the data array - Datatype - Dataspace (shape) - Properties (characteristics of the data, e.g., compressed) - Attributes (additional optional information that describes the data)
26
Metadata Data HDF5 Dataset Dataspace Rank Dimensions Datatype
3 Rank Dim_2 = 5 Dim_1 = 4 Dimensions Time = 32.4 Pressure = 987 Temp = 56 (optional) Attributes Chunked Compressed Dim_3 = 7 Properties Integer Datatype Data Array is an ordered collection of identically typed data items distinguished by their indices Metadata: Dataspace – Rank, dimensions; spatial info about dataset Datatype – Information on how to interpret your data Storage Properties – How array is organized Attributes – User-defined metadata (optional) 26
27
HDF5 Dataspaces An HDF5 Dataspace can be one of the following:
Array (or simple dataspace) multiple elements in dataset organized in a multi-dimensional (rectangular) array maximum number of elements in each dimension may be fixed or unlimited NULL no elements in dataset Scalar single element in dataset
28
Spatial information (shape) of an array stored in a file:
HDF5 Dataspaces Two roles: Spatial information (shape) of an array stored in a file: Rank and dimensions Permanent part of a dataset definition Partial I/0: Dataspace describes application’s data buffer and data elements participating in I/O 1 2 3 4 5 Rank = 2 Dimensions = 4x6 Rank = 1 Dimension = 10 1 2 3 4 5
29
HDF5 hyperslab and data selections
Mechanism to describe elements for partial I/O Selections are result of set operations (defined by APIs) on hyperslabs Definition of hyperslab: Everything is “measured” in number of elements Start - starting location of a hyperslab (1,1) Stride - number of elements that separate each block (3,2) Count - number of blocks (2,6) Block - block size (2,1)
30
HDF5 Datatypes The HDF5 datatype describes how to interpret individual data elements. HDF5 datatypes include: integer, float, unsigned, bitfield, … user-definable (e.g., 13-bit integer) variable length types (e.g., strings) references to objects/dataset regions enumerations - names mapped to integers opaque compound (similar to C structs)
31
HDF5 Dataset with Compound Datatype
3 5 V V V V int8 int4 int16 2x3x2 array of float32 Compound Datatype: Dataspace: Rank = Dimensions = 5 x 3
32
HDF5 Dataset Storage Properties
Data elements stored physically adjacent to each other Contiguous (default) Better access time for subsets; extensible Chunked Improves storage efficiency, transmission speed Chunked & Compressed Allows to use old data with new tools h5py Old binary file External
33
Chunking is required for several HDF5 features
HDF5 Chunking Chunking is required for several HDF5 features Applying compression and other filters like checksum FLETCHER32 SHUFFLE SCALEOFFSET NBIT GZIP SZIP (some licensing issues) Expanding/shrinking dataset dimensions and adding/”deleting” data Copyright © 2015 The HDF Group. All rights reserved.
34
Only two chunks are involved in I/O
HDF5 Chunking Chunking improves partial I/O for big datasets Only two chunks are involved in I/O Copyright © 2015 The HDF Group. All rights reserved.
35
HDF5 Chunking Limitations
Chunk dimensions cannot be bigger than dataset dimensions Number of elements a chunk is limited to 4GB H5Pset_chunk fails otherwise Total size chunk is limited to 4GB Total size = (number of elements) * (size of the datatype) H5Dwrite fails later on More data will be written in this case Ghost zones are filled with fill value unless fill value is disabled Copyright © 2015 The HDF Group. All rights reserved.
36
HDF5 Attributes An HDF5 attribute has a name and a value
Attributes typically contain user metadata Attributes may be associated with - HDF5 groups - HDF5 datasets - HDF5 named datatypes An attribute’s value is described by a datatype and shape (dataspace) Attributes are analogous to datasets except… - they are NOT extensible - they do NOT support compression or partial I/O
37
HDF5 Abstract Data Model Summary
April 29-30, 2009 HDF5 Abstract Data Model Summary The Objects in the Data Model are the “building blocks” for data organization and specification Files, Groups, Links, Datasets, Datatypes, Dataspaces, Attributes, … Projects using HDF5 “map” their data concepts to these HDF5 Objects HDF5 Technical Consulting Meeting for EMRG Program
38
April 29-30, 2009 HDF5 Software HDF5 Technical Consulting Meeting for EMRG Program
39
HDF5 Software Layers & Storage
Tools … High Level APIs API h5dump tool h5repack tool HDFview tool HDF5 Library Language Interfaces HDF5 Data Model Objects Groups, Datasets, Attributes, … Tunable Properties Chunk Size, I/O Driver, … C, Fortran, JNI, C++ Memory Mgmt Datatype Conversion Chunked Storage Version Compatibility and so on… Internals Filters Virtual File Layer Posix I/O Split Files MPI I/O Custom I/O Drivers Storage HDF5 File Format File on Parallel Filesystem ? Split Files File Other
40
HDF5 Programming Model and APIs
41
Operations Supported by the API
Create objects (groups, datasets, attributes, complex data types, …) Assign storage and I/O properties to objects Perform complex sub-setting during read/write Use variety of I/O “devices” (parallel, remote, etc.) Transform data during I/O Make inquiries on file and object structure, content, properties
42
General Programming Paradigm
Properties of object are optionally defined Creation properties Access properties Object is opened or created Object is accessed, possibly many times Object is closed
43
H5F : File interface e.g., H5Fopen
The General HDF5 API C, Fortran 90, Java, and C++ bindings (part of HDF5 distribution) H5py, Julia, R, ADA (community) C routines begin with prefix H5? ? is a character corresponding to the type of object the function acts on Example Functions: H5D : Dataset interface e.g., H5Dread H5F : File interface e.g., H5Fopen H5S : dataSpace interface e.g., H5Sclose
44
Basic Functions H5Fcreate (H5Fopen) create (open) File H5Screate_simple/H5Screate create dataSpace H5Dcreate (H5Dopen) create (open) Dataset H5Dread, H5Dwrite access Dataset H5Dclose close Dataset H5Sclose close dataSpace H5Fclose close File
45
Tuning for performance
Copyright © 2015 The HDF Group. All rights reserved.
46
Groups and Links
47
Use group’s creation properties to save space and boost performance
Links storage types Compact (in 1.8.* versions) Used with a few members (default under 8) Dense (default behavior) Used with many (>16) members (default) Tunable size for a local heap Save space by providing estimate for size of the storage required for links names Can be compressed Many links with similar names (XXX-abc, XXX-d, XXX-efgh, etc.) Requires more time to compress/uncompress data
48
Use latest file format (see H5Pset_libver_bound function in RM)
Hints Use latest file format (see H5Pset_libver_bound function in RM) Save space when creating a lot of groups in a file Save time when accessing many objects (>1000) Caution: Tools built with the HDF5 versions prior to will not work on the files created with this property
49
Create up to 10^6 groups with one dataset in each group
Benchmark Create a file Create up to 10^6 groups with one dataset in each group Compare files sizes and performance of HDF using the latest group format with the performance of HDF (default, old format) and 1.6.7 Note: “Default” and became very slow after groups HDF5 Workshop at PSI May 30-31, 2012
50
Time to Open and Read a Dataset
This graph shows primarily that the access time with old format groups grows almost linearly with the number of groups, while it is nearly constant with the new groups. At the upper end of the test, old groups are 2-3 orders of magnitude slower than new groups. Plato: metadata cache felt enough pressure from cache misses that it resized the cache larger. I'm guessing that if you were able to keep going, it would start sloping upward again. May 30-31, 2012 HDF5 Workshop at PSI
51
Time to Close the File This graph again shows that performance of the new groups is relatively unaffected by the number of groups, though the difference is not as dramatic as cold-cache access. The "old" group format, with a single [huge] local heap is probably the reason - it's being flushed to the file when the group closes. The "new" group format which uses a fractal heap will never get a single block of heap data that's so large - smaller heap blocks will get evicted from the cache and flushed to disk as no more group entries are added to them. It could be that the new v2 B-tree is much more efficient than the older B*-trees, but I'm guessing that's a smaller effect. May 30-31, 2012 HDF5 Workshop at PSI
52
File Size This shows the greater space efficiency of the new compact groups. The new indexed groups are also more space efficient, but that does not make as much difference as the compact groups. May 30-31, 2012 HDF5 Workshop at PSI
53
Chunking and compression
Datasets Chunking and compression Copyright © 2015 The HDF Group. All rights reserved.
54
Chunked Dataset I/O Application memory space HDF5 File Chunked dataset
Chunk cache DT conversion A C C B Filter pipeline HDF5 File B A C Datatype conversion is performed before chunked placed in cache on write Datatype conversion is performed after chunked is placed in application buffer Chunk is written when evicted from cache Compression and other filters are applied on eviction or on bringing chunk into cache Copyright © 2015 The HDF Group. All rights reserved.
55
H5DOwrite_chunk 40 times speedup in our benchmarks
56
How chunking and compression can kill performance
Example How chunking and compression can kill performance
57
Example of chunking strategy
JPSS uses granule (satellite scan) size as chunk size “ES_ImaginaryLW” is stored using 15 chunks (2.9 MB) with the size 4x30x9x717 …….…….. Copyright © 2015 The HDF Group. All rights reserved.
58
Compressed with GZIP level 6
Problem SCRIS_npp_d _t _e _b13293_c _noaa_pop.h5 DATASET "/All_Data/CrIS-SDR_All/ES_ImaginaryLW" { DATATYPE H5T_IEEE_F32BE DATASPACE SIMPLE { ( 60, 30, 9, 717 ) / ( H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED ) } STORAGE_LAYOUT { CHUNKED ( 4, 30, 9, 717 ) SIZE } Dataset is read once, by contiguous 1x1x1x717 selections, i.e., 717 elements times. The time it takes to read the whole dataset is: Compressed with GZIP level 6 No compression ~345 seconds ~0.1 seconds Copyright © 2015 The HDF Group. All rights reserved.
59
Performance may depend on combinations of factors
Solutions Performance may depend on combinations of factors I/O access patterns Chunk size and layout Chunk cache size Memory usage, compression, etc. Copyright © 2015 The HDF Group. All rights reserved.
60
Chunks are too small Pitfall – chunk size File has too many chunks
Extra metadata increases file size Extra time to look up each chunk More I/O since each chunk is stored independently Larger chunks results in fewer chunk lookups, smaller file size, and fewer I/O operations
61
Chunks are too large Pitfall – chunk size
Entire chunk has to be read and uncompressed before performing any operations Great performance penalty for reading a small subset Entire chunk has to be in memory and may cause OS to page memory to disk, slowing down the entire system Use case: Set chunk size to be the same as the dataset size to “enable” compression on contiguous dataset Copyright © 2015 The HDF Group. All rights reserved.
62
HDF5 raw data chunk cache
The only raw data cache in HDF5 Chunk cache is per dataset (1MB default) Improves performance whenever the same chunks are read or written multiple times
63
HDF5 chunk cache documentation
64
Chunk doesn’t fit into cache
Application buffer Chunk cache Gran_1 Gran_1 Gran_2 All data in chunk is copied to application buffer before chunk is discarded …………… Chunks in HDF5 file Gran_15
65
Chunk fits into cache Chunk cache Gran_1 Gran_1 H5Dread
Application buffer Chunk cache Gran_1 Gran_1 H5Dread Gran_2 Chunk stays in cache until all data is read and copied. It is discarded to bring in new chunk. …………… Chunks in HDF5 file Gran_15
66
Solution (Data Consumers)
Increase chunk cache size For our example dataset, we increased chunk cache size to 3MB - big enough to hold one 2.95 MB chunk Caution: Big chunk cache Increases application memory footprint Compressed with GZIP level 6 1MB cache (default) No compression 1MB (default) or 3MB cache 3MB cache ~345 seconds ~0.09 seconds ~0.37 seconds Copyright © 2015 The HDF Group. All rights reserved.
67
Solution (Data Consumers)
Change access pattern Keep default cache size (1MB) We read our example dataset using a selection that corresponds to the whole chunk 4x9x30x717 Compressed with GZIP level 6 Selection 1x1x1`x717 No compression Selection 1x1x1`x717 Selection 4x9x30x717 ~345 seconds ~0.1 seconds ~0.04 seconds ~0.36 seconds Copyright © 2015 The HDF Group. All rights reserved.
68
Solution (Data Providers)
Change chunk size Write original files with the smaller chunk size We recreated our example dataset using chunk size 1x30x9x717 (~0.74MB) We used default cache size 1MB Read by 1x1x1x717 selections times Performance improved 1000 times Compressed with GZIP level 6 chunk size 4x9x30x717 No compression Selection 4x9x30x717 chunk size 1x9x30x717 ~345 seconds ~0.04 seconds ~0.08 seconds ~0.36 seconds Compare with the original result Copyright © 2015 The HDF Group. All rights reserved.
69
August 12, 2014 Parallel HDF5
70
Parallel HDF5 Allows multiple processes to perform I/O to an HDF5 file
Supports Message Passing Interface (MPI) programming (MPICH, OpenMPI w/ROMIO) PHDF5 files compatible with serial HDF5 files Shareable between different serial or parallel platforms GPFS, Lustre, PVFS
71
PHDF5 implementation layers
HDF5 Application Compute node Compute node Compute node HDF5 Library MPI Library HDF5 file on Parallel File System PHDF5 is built on top of MPI I/O APIs Switch network + I/O servers Disk architecture and layout of data on disk
72
PHDF5 opens a parallel file with an MPI communicator
Programming model PHDF5 opens a parallel file with an MPI communicator Returns a file handle Future access to the file via the file handle All processes must participate in collective PHDF5 APIs Calls that modify HDF5 structural metadata Group, dataset creation Attributes creation Datasets extensions Different files can be opened via different communicators Arrays data transfer can be collective or independent Collectiveness is indicated by function parameters to H5Dwrite, H5Dread
73
After a file is opened by the processes of a communicator
What does PHDF5 support ? After a file is opened by the processes of a communicator All parts of file are accessible by all processes All objects in the file are accessible by all processes Multiple processes may write to the same data array Each process may write to individual data array C and F90, 2003 language interfaces Most platforms with MPI-IO supported. e.g., IBM AIX Linux clusters Cray XT Windows, MAC, Linux
74
Example of PHDF5 C program
Parallel HDF5 program has extra calls MPI_Init(&argc, &argv); fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(fapl_id, comm, info); file_id = H5Fcreate(FNAME,…, fapl_id); space_id = H5Screate_simple(…); dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT, space_id,…); xf_id = H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE);status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…); MPI_Finalize();
75
Parallel HDF5 tutorial examples
For simple examples how to write different data patterns see
76
The hyperslab parameters define the portion of the dataset to write to
Programming model Each process defines memory and file subsets of data using H5Sselect_hyperslab Each process executes a write/read call using subsets (hyperslabs) defined, which can be either collective or independent The hyperslab parameters define the portion of the dataset to write to Contiguous hyperslab Regularly spaced data (column or row) Pattern Blocks
77
Four processes writing by rows
HDF5 "SDS_row.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13
78
Two processes writing by columns
HDF5 "SDS_col.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) } DATA { 1, 2, 10, 20, 100, 200, 1, 2, 10, 20, 100, 200
79
Four processes writing by pattern
HDF5 "SDS_pat.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 3, 1, 3, 2, 4, 2, 4, 2, 4, 2, 4
80
Four processes writing by blocks
HDF5 "SDS_blk.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 4 ) / ( 8, 4 ) } DATA { 1, 1, 2, 2, 3, 3, 4, 4, 3, 3, 4, 4
81
Complex data patterns HDF5 doesn’t have restrictions on data patterns and data balance 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63 8 16 24 32 40 48 56 64 1 9 17 25 33 41 49 57 2 10 18 26 34 42 50 58 3 11 19 27 35 43 51 59 4 12 20 28 36 44 52 60 5 13 21 29 37 45 53 61 6 14 22 30 38 46 54 62 7 15 23 31 39 47 55 63 8 16 24 32 40 48 56 64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
82
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.