HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004
HDF 2Topics General Introduction to HDF5 properties HDF5 Dataset properties I/O and Storage Properties (filters) HDF5 File properties I/O and Storage Properties (drivers) Datatypes Compound Variable Length Reference to object and dataset region
HDF 3 General Introduction to HDF5 Properties
HDF 4 Properties Definition Mechanism to control different features of the HDF5 objects –Implemented via H5P Interface (‘Property lists’) –HDF5 Library sets objects’ default features –HDF5 ‘Property lists’ modify default features At object creation time (creation properties) At object access time (access or transfer properties)
HDF 5 Properties Definitions A property list is a list of name-value pairs –Values may be of any datatype A property list is passed as an optional parameters to the HDF5 APIs Property lists are used/ignored by all the layers of the library, as needed
HDF 6 Type of Properties Predefined and User defined property lists Predefined: –File creation –File access –Dataset creation –Dataset access Will cover each of these
HDF 7 Properties (Example) HDF5 File H5Fcreate(…,creation_prop_id,…) Creation properties (how file is created?) –Library’s defaults no user’s block predefined sizes of offsets and addresses of the objects in the file (64-bit for DEC Alpha, 32-bit on Windows) –User’s settings User’s block 32-bit sizes on 64-bit platform Control over B-trees for chunking storage (split factor)
HDF 8 Properties (Example) HDF5 File H5Fcreate(…,access_prop_id) Access properties or drivers (How is file accessed? What is the physical layout on the disk?) –Library defaults STDIO Library (UNIX fwrite, fread ) –User’s defined MPI I/O for parallel access Family of files (100 Gb HDF5 represented by 50 2Gb UNIX files) Size of the chunk cache
HDF 9 Properties (Example) HDF5 Dataset H5Dcreate(…,creation_prop_id) Creation properties (how dataset is created) –Library’s defaults Storage: Contiguous Compression: None Space is allocated when data is first written No fill value is written –User’s settings Storage: Compact, or chunked, or external Compression Fill value Control over space allocation in the file for raw data –at creation time –at write time
HDF 10 Properties (Example) HDF5 Dataset H5Dwrite (…,access_prop_id) Access (transfer) properties –Library defaults 1MB conversion buffer Error detection on read (if was set during write) MPI independent I/O for parallel access –User defined MPI collective I/O for parallel access Size of the datatype conversion buffer Control over partial I/O to improve performance
HDF 11 Properties Programming model Use predefined property type –H5P_FILE_CREATE –H5P_FILE_ACCESS –H5P_DATASET_CREATE –H5P_DATASET_ACCESS Create new property instance –H5Pcreate –H5Pcopy –H5*get_access_plist; H5*get_create_plist Modify property (see H5P APIs) Use property to modify object feature Close property when done –H5Pclose
HDF 12 Properties Programming model General model of usage: get plist, set values, pass to library hid_t plist = H5Pcreate(copy)(predefined_plist); OR hid_t plist = H5Xget_create(access)_plist(…); H5Pset_foo( plist, vals); H5Xdo_something( Xid, …, plist); H5Pclose(plist);
HDF 13 HDF5 Dataset Creation Properties and Predefined Filters HDF5 Dataset Creation Properties and Predefined Filters
HDF 14 Dataset Creation Properties Storage –Contiguous (default) –Compact –Chunked –External Filters applied to raw data –Compression –Checksum Fill value Space allocation for raw data in the file
HDF 15 Dataset Creation Properties Storage Layouts Storage layout is important for I/O performance and size of the HDF5 files Contiguous (default) Used when data will be written/read at once H5Dcreate(…,H5P_DEFAULT) Compact Used for small datasets (order of O(bytes)) for better I/O Raw data is written/read at the time when dataset is open File is less fragmented To create a compact dataset follow the ‘Properties programming model’
HDF 16 Creating Compact Dataset Create a dataset creation property list Set property list to use compact storage layout Create dataset with the above property list plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_layout(plist, H5D_COMPACT); dset_id = H5Dcreate (…, “Compact”,…, plist); H5Pclose(plist);
HDF 17 Creating chunked Dataset Chunked layout is needed for –Extendible datasets –Compression and other filters –To improve partial I/O for big datasets Better subsetting access time; extendible chunked Only two chunks will be written/read
HDF 18 Creating Chunked Dataset Create a dataset creation property list Set property list to use chunked storage layout Create dataset with the above property list plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk(plist, rank, ch_dims); dset_id = H5Dcreate (…, “Chunked”,…, plist); H5Pclose(plist);
HDF 19 Dataset Creation Properties Compression and other I/O Pipeline Filters HDF5 provides a mechanism (“I/O filters”) to manipulate data while transferring it between memory and disk H5Z and H5P interfaces HDF5 predefined filters (H5P interface) –Compression (gzip, szip) –Shuffling and checksum filters User defined filters (H5Z and H5P interfaces) –Example: Bzip2 compression
HDF 20 Compression and other I/O Pipeline Filters (continued) Currently used only with chunked datasets Filters can be combined together –GZIP + shuffle+checksum filters –Checksum filter + user define encryption filter Filters are called in the order they are defined on writing and in the reverse order on reading User is responsible for “filters pipeline sanity” –GZIP +SZIP + shuffle doesn’t make sense –Shuffle + SZIP does
HDF 21 Creating compressed Dataset Compression –Improves transmission speed –Improves storage efficiency –Requires chunking –May increase CPU time needed for compression Compressed Memory File
HDF 22 Creating compressed datasets Create a dataset creation property list Set chunking (and specify chunk dimensions) Set compression method Create dataset with the above property list plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk (plist, ndims, chkdims); H5Pset_deflate (plist, level); /*GZIP */ OR H5Pset_szip (plist, options-mask, numpixels);/*SZIP*/ dset_id = H5Dcreate (file_id, “comp-data”, “H5T_NATIVE_FLOAT,space_id, plist);
HDF 23 Creating external Dataset Dataset’s raw data is stored in an external file Easy to include existing data into HDF5 file Easy to export raw data if application needs it Disadvantage: user has to keep track of additional files to preserve integrity of the HDF5 file Metadata for “A” Dataset “A ” HDF5 file External file Raw data for “A ” Raw data can be stored in external file
HDF 24 Creating External Dataset Create a dataset creation property list Set property list to use external storage layout Create dataset with the above property list plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_external(plist, “raw_data.ext”, offset, size); dset_id = H5Dcreate (…, “Chunked”,…, plist); H5Pclose(plist);
HDF 25 Example of External Files This example shows how a contiguous, one-dimensional dataset is partitioned into three parts and each of those parts is stored in a segment of an external file. plist = H5Pcreate (H5P_DATASET_CREATE); HPset_external (plist, “raw.data”, 3000, 1000); H5Pset_external (plist, “raw.data”, 0, 2500); H5Pset_external (plist, “raw.data”, 4500, 1500);
HDF 26 Checksum Filter HDF5 includes the Fletcher32 checksum algorithm for error detection. It is automatically included in HDF5 To use this filter you must add it to the filter pipeline with H5Pset_filter. Checksum value Memory
HDF 27 Enabling Checksum Filter Create a dataset creation property list Set chunking (and specify chunk dimensions) Add the filter to the pipeline Create your dataset specifying this property list Close property list plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk (plist, ndims, chkdims); H5Pset_filter (plist, H5Z_FILTER_FLETCHER32, 0, 0, NULL); H5Dcreate (…,”Checksum”,…,plist) H5Pclose(plist);
HDF 28 Shuffling filter Predefined HDF5 filter Not a compression; change of byte order in a stream of data Example – Hexadecimal form –0x01 0x17 0x2B Big-endian machine –0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x17 0x00 0x00 0x00 0x2B Shuffling –0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x17 0x2B
HDF B B B
HDF 30 Enabling Shuffling Filter Create a dataset creation property list Set chunking (and specify chunk dimensions) Add the filter to the pipeline Define compression filter Create your dataset specifying this property list Close property list plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk (plist, ndims, chkdims); H5Pset_shuffle(plist); H5Pset_deflate(plist,level); H5Dcreate (…,”BetterComp”,…,plist) H5Pclose(plist);
HDF 31 Effect of data shuffling (H5Pset_shuffle + H5Pset_deflate) File sizeTotal timeWrite Time No Shuffle102.9MB Shuffle67.34MB Compression combined with shuffling provides Better compression ratio Better I/O performance Write 4-byte integer dataset 256x256x1024 (256MB) Using chunks of 256x16x1024 (16MB) Values: random integers between 0 and 255
HDF 32 HDF5 Dataset Access (Transfer) Properties HDF5 Dataset Access (Transfer) Properties
HDF 33 Dataset Access/Transfer Properties Improve performance H5Pset_buffer –Sets the size of the datatype conversion buffer during I/O –Size should be large enough to hold the slice along the slowest changing dimension –Example: Hyperslab 100x200x300, buffer 200x300 H5Pset_hyper_vector_size –Sets the number of hyperslab offset and length pairs –Improves performance for partial I/O
HDF 34 Dataset Access/Transfer Properties H5Pset_edc_check –For datasets created with error detection filter enabled –Enables error checking during read operation –H5Z_ENABLE_EDC (default) –N5Z_DISABLE_EDC H5Pset_dxpl_mpio –Sets data transfer mode for parallel I/O –H5FD_MPIO_INDEPENDENT (default) –H5FD_MPIO_COLLECTIVE
HDF 35 User-defined Filters
HDF 36 Standard Interface for User-defined Filters H5Zregister : Register filter so that HDF5 knows about it H5Zunregister: Unregister a filter H5Pset_filter: Adds a filter to the filter pipeline H5Pget_filter: Returns information about a filter in the pipeline H5Zfilter_avail: Check if filter is available
HDF 37 File Creation Properties File Creation Properties
HDF 38 File Creation Properties H5Pset_userblock –User block stores user-defined information (e.g ASCII text to describe a file) at the beginning of the file –Cat my.txt hdf5.h5 > myhdf5.h5 –Sets the size of the user block –512 bytes, 1024 bytes, 2^N H5Pset_sizes –Sets the byte size of the offsets and lengths used to address objects in the file H5Pset_sym_k –Controls the rank of groups B-trees for groups –Default is 16 H5Pset_istore_k –Controls the rank of groups B-trees for chunked datasets –Default is 32
HDF 39 File Access Properties File Access Properties
HDF 40 File Access Properties (Performance) H5Pset_cache –Sets metadata cache and raw data chunk parameters –Improper size will degrade performance H5Pset_meta_block_size –Reduces the number of small objects in the file –Block of metadata is written in a single I/O operation (default 2K) –VFL driver has to set H5FD_AGGREGATE_METADATA H5Pset_sieve_buffer –Improves partial I/O –Need a picture VFL layer: file drivers
HDF 41 File Access Properties (Physical storage and Usage of Low-level I/O Libraries) VFL layer: file drivers Define physical storage of the HDF5 file –Memory driver (HDF5 file in the application’s memory) –Stream driver (HDF5 file written to a socket) –Split(multi) files driver –Family driver Define low level I/O library –MPI I/O driver for parallel access –STDIO vs. SEC2
HDF 42 Files needn’t be files - Virtual File Layer VFL: A public API for writing I/O drivers memory mpiostdio Hid_t Files Memory “File” Handle I/O drivers network Network VFL: Virtual File I/O Layer “Storage” split family SRB SRB Repository
HDF 43 Split Files Allows you to split metadata and data into separate files May reside on different file systems for better I/O Disadvantage: User has to keep track of the files Dataset “A” Dataset “B” Data A Data B Metadata file Raw data file HDF5 file
HDF 44 Creating Split Files Create a file access property list Set up file access property list to use split files Create the file with this property list Close the property plist = H5Pcreate (H5P_FILE_ACCESS); H5Pset_fapl_family(plist, “.met”, H5P_DEFAULT,”.dat”, H5P_DEFAULT); file = H5Fcreate (H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist); H5Pclose(plist);
HDF 45 File Families Allows you to access files larger than 2GB on file systems that don't support large files Any HDF5 file can be split into a family of files and vice versa A family member size must be a power of two
HDF 46 Creating a File Family Create a file access property list Set up file access property list to use file family Create the file with this property list plist = H5Pcreate (H5P_FILE_ACCESS); H5Pset_fapl_family (plist, family_size, H5P_DEFAULT); file = H5Fcreate (H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist); H5Pclose(plist);
HDF 47 HDF5 Datatypes HDF5 Datatypes
HDF 48 Datatypes A datatype is –A classification specifying the interpretation of a data element –Specifies for a given data element the set of possible values it can have the operations that can be performed how the values of that type are stored –May be shared between different datasets in one file
HDF 49 HDF5 datatypes Atomic types –standard integer & float –user-definable scalars (e.g. 13-bit integer) –bitfields –variable length types (e.g. strings) –pointers - references to objects/dataset regions –enumeration - names mapped to integers
HDF 50 General Operations on HDF5 Datatypes Create –H5Tcreate creates a datatype of the HT_COMPOUND, H5T_OPAQUE, and H5T_ENUM classes Copy –H5Tcopy creates another instance of the datatype; can be applied to any datatypes Commit –H5Tcommit creates an Datatype Object in the HDF5 file; comitted datatype can be shared between different datatsets Open –H5Topen opens the datatypes stored in the file Close –H5Tclose closes datatype object
HDF 51 Programming model for HDF5 Datatypes Use predefined HDF5 types –No need to close OR –Create Create a datatype (by copying existing one or by creating from the one of H5T_COMPOUND(ENAUM,OPAQUE) classes) Create a datatype by queering datatype of a dataset –Open committed datatype from the file (Optional) Discover datatype properties (size, precision, members, etc.) Use datatype to create a dataset/attribute, to write/read dataset/attribute, to set fill value (Optional) Save datatype in the file Close
HDF 52 HDF5 Compound Datatypes Compound types –Comparable to C structs –Members can be atomic or compound types –Members can be multidimensional –Can be written/read by a field or set of fields –Non all data filters can be applied (shuffling, SZIP) –H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a compound datatype –See H5Tget_member* functions for discovering properties of the HDF5 compound datatype
HDF 53 Data Time Data Time HDF5 Fixed and Variable length array storage
HDF 54 HDF5 Variable Length Datatypes Programming issues Each element is represented by C struct typedef struct { size_t length; void *p; } hvl_t; Base type can be any HDF5 type
HDF 55 HDF5 Variable Length Datatypes Global heap Dataset with variable length datatype Raw data
HDF 56 HDF Information HDF Information Center – HDF Help address HDF users mailing list