May 30-31, 2012 HDF5 Workshop at PSI May Shared Object Headers Dana Robinson The HDF Group Efficient Use of HDF5 With High Data Rate X-Ray Detectors Paul Scherrer Institut
May 30-31, 2012 HDF5 Workshop at PSI Overview Datasets, committed datatypes, and groups store metadata and attributes as object header messages. We can save space by storing duplicated object header information once instead of many times and maintaining multiple references to the information.
May 30-31, 2012 HDF5 Workshop at PSI Before: Datasets with identical attributes After: Datasets with a shared attribute attribute
May 30-31, 2012 HDF5 Workshop at PSI Overview + Saves space Possibly a lot but depends on application and what/how data are stored. - Overhead (lookups, etc.) Depends on heterogeneity of potential shared data and other factors. - Breaks locality Extra seeks may be needed to retrieve object header data.
May 30-31, 2012 HDF5 Workshop at PSI Object Header Indexes Stores references to existing sharable object headers. Sharing is across the entire file. One per type of object to be shared. Contain hash values for the shared objects. Automatically switch from an unsorted list to a B-tree based on size. Only stores large messages. Small messages are stored locally for faster access.
May 30-31, 2012 HDF5 Workshop at PSI Shared Message Types Dataspace Datatype Fill Value Filter Pipeline Attributes Most other object header messages are unlikely to be large enough to justify the overhead of sharing them.
May 30-31, 2012 HDF5 Workshop at PSI Array to B-Tree Transition # entries high marklow mark 0 As the number of references to a shared object increases, the index structure switches from an unsorted list to a B-tree.
May 30-31, 2012 HDF5 Workshop at PSI Array to B-Tree Transition # entries high marklow mark 0 When the number of references drops below a second threshold, the index reverts to an unsorted list.
May 30-31, 2012 HDF5 Workshop at PSI Array to B-Tree Transition # entries high marklow mark 0 Note that this works like a thermostat – the high and low cutoffs are not the same to avoid thrashing when the number of references hovers around a single cutoff point.
May 30-31, 2012 HDF5 Workshop at PSI New API Calls Set the number of indexes herr_t H5Pset_shared_mesg_nindexes(hid_t plist_id, unsigned nindexes) Set the properties for each message index herr_t H5Pset_shared_mesg_index(hid_t plist_id, unsigned index_num, unsigned mesg_type_flags, unsigned min_mesg_size) Set the low and high marks for array->tree transitions herr_t H5Pset_shared_mesg_phase_change(hid_t plist_id, unsigned max_list, unsigned min_btree)
May 30-31, 2012 HDF5 Workshop at PSI Implementation Notes Version feature Requires file format changes Files containing shared object headers will not be readable by older versions of HDF5. Disabled by default Users will be able to optionally set the message size cutoff