The HDF Group January 8, ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking, Clusters Aleksandar Jelenak, John Readey, H. Joe Lee, Ted Habermann 1
ESIP Winter Meeting Hardware 2 Using Open Science Data Cloud Griffin cluster Xeon systems with 1-16 cores 60 compute nodes 10Gb Ethernet Ephemeral local POSIX file system Shared persistent storage (Ceph object store, S3 API)
ESIP Winter Meeting Software 3 HDF5 library v Compression libraries: MAFISC/GZIP/BLOSC Operating system: Ubuntu Linux Linux development tools Any HDF5-supported C compiler HDF5 tools: h5dump, h5repack, etc. Python 3 Python packages: h5py, NumPy, ipyparallel, PyTables
ESIP Winter Meeting Data 4 NCEP/DOE Reanalysis II, for GSSTF, Daily Grid, v3 0.25×0.25 deg, global Time span ,850 daily files, 120GB NOAA Coral Reef Temperature Anomaly Database (CoRTAD) version × deg (~4km), global Time range , weekly time step 8 files, 253GB
ESIP Winter Meeting Workflow 5 Download data as HDF5 files from archive and transfer to S3 object store Repack original file(s) using HDF5 chunking and compression, transfer to S3 store Collate data from original files into one file with HDF5 chunking & compression, transfer to S3 store Launch a number of VMs and connect them into a ipyparallel cluster Data Ingest/Preprocessing Data Analysis Distribute input HDF5 data from S3 store to cluster VMs Execute data analysis task on cluster VMs Collect data analysis results from cluster VMs and prepare the report Shut down the cluster and VMs Index data in file(s) by collecting descriptive statistics (min, max, etc.) for each HDF5 chunk.
ESIP Winter Meeting System Architecture 6
ESIP Winter Meeting HDF5 Chunks 7 Chunking is one of storage layouts for HDF5 datasets HDF5 dataset’s byte stream is broken up in chunks and stored at various locations in the file Chunks are of equal size in dataset’s dataspace but may not be of equal byte size in the file HDF5 filtering works on chunks only Filters for compression/decompression, scaling, checksum calculation, etc.
ESIP Winter Meeting Findings: Chunking 8 Two different chunking algorithms: Unidata’s optimal chunking formula for 3D datasets h5py formula Three different chunk sizes chosen for the collated NCEP data set: Synoptic map: 1×72×144 Data rod: 7850×1×1 Data cube: 25×20×20
ESIP Winter Meeting Findings: Chunking 9 Input was collated NCEP data file: 7850×720×1440, 5 datasets, 121 gigabytes Outputs: Chunk SizeFilterFile Size Change (%)Runtime (hour) 1×72×144GZIP level ×1×1GZIP level ×20×20GZIP level
ESIP Winter Meeting Findings: Compression 10 Compression filters: GZIP, SZIP, MAFISC, Blosc NCEP data set: 7,850 files Chunk size: 45×180 FilterTotal File Size Change (%)Runtime (hour) GZIP, level SZIP MAFISC Blosc
ESIP Winter Meeting Data Indexing 11 Value range information (min, max) captured for each HDF5 dataset chunk These value, plus chunk dataset dataspace coordinates stored in a PyTables file ~30 minutes to collect index data from the collated NCEP data file Work on incorporating this information in processing is on-going
ESIP Winter Meeting Findings: Parallel 12 Load time improved up to 16 nodes Run time improved super-linearly with more nodes (up to 64)
ESIP Winter Meeting Conclusion 13 Using a computing environment where POSIX file system is not persistent storage poses unique challenges Chunk size does influence runtime Compression filter performance: Blosc < GZIP9 < MAFISC Increasing number of compute nodes reduces the observed differences in runtime