HDF5 A new file format & software for high performance scientific data management
High performance data requirements larger datasets (> terabyte) bigger, faster machines and storage systems varied architectures and I/O paradigms parallel computing environments complex subsetting complex data
HDF5 based on lessons learned from: Existing standards –HDF, PDB, AIO, netCDF, MPI-IO and others Computer science ASCI physics applications and users Earth science applications and users Other users
… and ASCI Requirements Compatibility with vector bundle model Collective access MPI-IO Transform data between memory & storage Parallel file systems: PIOFS, HPSS, etc.
Data model Datatypes (array elements) –integer & float –strings & pointers –compound (record structures) Aggregate object types –“dataset:” multidimensional array –grouping structure –each object has a name & attributes
Basic data object: array of records Record int8int4int16 float32 Dimensionality: 5 x 3 Number type: 3 5
Storage Capacity Store large objects Store large numbers of objects Limit: 2 gigabytes no limit HDF4HDF5 Limit: 20,000 objects no limit
Dataset components a multidimensional array of data elements header with metadata –datatype –dataspace –attributes –storage info Metadata header Dataset “Fred” Data int16 time = 32.4 pressure = 987 temp = 56 Datatype Attributes Dataspace 2 Dim_3=2 Dim_2=4 Dim_1=5 Rank Dimensions Chunked; compressed Storage info
Groups Group structure for organizing the file Every file starts with a root group Like directories in file system Groups have attributes “/” “/foo” “/foo/bar”
Special Storage Options chunked compressed extendable split file Metadata for Fred Dataset “Fred” File A File B Data for Fred Improves subsetting access time Improves storage efficiency Arrays can be extended individually Metadata in one file. Raw data in another.
The HDF5 Library New API and programming model Smaller, better, faster Able to support parallel I/O better OO compatible C & Fortran still primary, others considered I/O performance emphasized Current platforms –ASCI: IBM SP2, SGI Origin 2000, Intel Teraflop –Solaris, Linux, HPUX, IRIX, NT
Sub-selection Options Flexibility in mappings between data in memory and object in file Selection regions can be –points –hyperslabs –unions of hyperslabs Selection region in memory can be different shape from selection in file Supports I/O needs for parallel computation
Mappings between file dataspaces/selections and memory dataspaces/selections. (c) A sequence of points from a 2D array to a sequence of points in a 3D array. (d) Union of hyperslabs in file to union of hyperslabs in memory. Number of elements must be equal. (b) A regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array (a) A hyperslab from a 2D array to the corner of a smaller 2D array
HDF5 Raw Data Pipeline Handles all aspects of data storage and transfer of data between file and application. Deals with multiple storage options –chunking, compression, number conversion,... Optimized performance for common usage Hooks for new filters –compression schemes, encryption, checksum,... –user-specified filters
Performance tuning Facilities for performance measurement –timing tests in test suite –Pablo instrumentation Caching –app can set cache size for metadata & chunks Parallel optimizations –efficient metadata management –chunking –can control placement on physical media
HDF5 and ASCI Applications Multi-lab collaboration –DOE Tri-lab: Livermore, Sandia, Los Alamos –NCSA, Limit Point Systems Motivation: –Data sharability –Application interoperability Leverage experiences –EXODUS (SNL), SILO & PDB (LLNL) –HDF (NCSA), netCDF (UCAR)
ASCI DMF Data Abstraction ASCI DMF Data Abstraction Objectives –Sound data model with robust data abstractions –Computational mechanics data: meshes & fields –Based on mathematical field of fiber bundles –Common format allows common tools & sharing –Common API shield apps from model complexities MPI IO (ANL) HDF5 (NCSA) Fiber Bundle Kernel (LLNL) Data Structure Layer (LLNL) Mesh APIs (SNL/LANL) APPLICATION
HDF5 driver projects ProjectApplicationTypes of data ASCI: Computational Fields on meshes: mechanicsstructured, unstructured hierarchical CANIS: UIUC Digital Concept space Object store for large Library Project analysis of medical collection of small abstracts objects (noun phrases) TRAPPIST: non-Non-destructiveNDT experiment data: destructive testing testingtomography and consortium radiology NASA Earth Observing Earth Science data Remote sensing: System management swath, grid and point data