File Formats, Conventions, and Data Level Interoperability ESDSWG New Orleans, Oct 20, 2010 Joe Glassy, Chris Lynnes ESDSWG Tech Infusion
Introduction & overview Outline of objectives: – Discuss role of standard, self-describing “File formats” in data level interoperability – Summarize common file formats in use, their properties, & benefits --“data life cycle economics” – Discuss criteria for choosing a file format, matching it to needs of consumer/producers. – Discuss critical role of Conventions – any file format needs good recipes to make them interoperable! – Examples: NASA Measures F/T, SMAP, AIRs, Aura
Role(s) Of File Formats in Interoperability File formats represent versatile “packages” for multi-dimensional science data and metadata. Offer self-describing “well-known structures” to codify desired, common conventions and practices. Offer well-documented reference cases to encapsulate specific data models. Standard file formats dock with format-aware tools to offer users a seamless end-to-end experience and platform portability Enhance Mission-to-Mission continuity
…investment life-cycle economics…
Why (and how) are file formats important? Standard formats – Come with thorough documentation – Provide good Reference implementations Common formats – More datasets in a format more tools that read that format Canonical structures and names general purpose handlers for coordinates, etc. smarter tools
A generic work flow… Consider user community needs and culture, fit within architecture, institutional policies & preferences Choose a standard file format (or sub-variant) Design a convention-enabled, specific internal layout with metadata interfaces Prototype: Implement in prototype, evaluate Implement in production context Integrate within discovery and catalog environments (Catalog interoperability…)
Examples of standard file formats HDF5 – a file format on its own, as well as a broad foundation for others netCDF v4 (stable at v4.1.1, newest : v4.1.2-beta1) – v4 Classic (widespread adoption, some limitations…) – v4 Enhanced (support Groups, User-defined, variable length types, and more) netCDF v3 Classic (legacy+, tools+, but limited) HDFEOS2, HDFEOS5 – EOS Terra, Aqua, Aura… HDF4 – legacy, extensive use by MODIS Terra, Aqua Many other domain-specific, less generic formats abound… (need transform tools to/from HDF?)
Some selection criteria… Do file-format’s capabilities support required functionality? What is breadth of acceptance, adoption within larger community? (and/or, does institutional policy dictate a specific format?) Presence and quality of documentation (reference, examples and especially tutorials), API software, and community support? Contribution to investment, data life-cycle economics? What is the level of standardization? Adaptability of format to widely used conventions like CF 1.x, or other accepted convention(s)?
Internal Layout / Design (once format is chosen & adopted…) Define &refine High level organization /structure /DATA /METADATA Distinguish ‘data’ from ‘metadata’, core structure vs. ‘attributes’ – Dimensions, Coordinate Variables, projection attributes – Missing_data, _Fillvalue vs. internal fill value – Units, Gain, offset, min, max, range, etc. Prototype it! – Leverage script environments (Python H5Py, PyTables, etc) – Panoply, HDFView also quick, useful for prototyping, feedback
Using “Groups” HDF5 (and NetCDF v4-Enhanced) support full use of groups e.g. /DATA vs. /METADATA, etc. Groups useful in partitioning out functionally related sets of data or attributes; Hierarchical view mimics file-system Facilitates appropriate information-hiding, highlights needed info, shield other (principle of least privilege…) Well supported by modern tools (Panoply, HDFViews, PyTables, H5Py) and low-lev APIs.
Example(s) of File Formats In Action HDF5 – NASA Measures – NASA Measures Freeze/Thaw (soon available at NSIDC) – AQUA AIRS Level 2 (from earlier talk) : – 0/285/AIRS L2.RetStd.v G hdf 0/285/AIRS L2.RetStd.v G hdf Aura TES ( TES-Aura_L3-CH4_r _F01_05.he5 )
Example: NASA Measures Freeze/Thaw, Daily in HDF5 Metadata Block: Attributes
Example: NASA Measures Daily Freeze/Thaw in HDF5 Data Variable (FT_SSMI) and Attributes
Example: NASA Level 2 AIRS (Swath) in HDF4
Example: NetCDF, (tos) Sea surface temperatures collected by PCMDI for use by the IPCC, illustrating CF v1.0 layoutIPCC
Example: TES (HDFEOS5) illustrating CF v1.0 layout
CF Conventions & file formats: --how they contribute to interoperability. CF v1.4.x -- the term “CF” is now broader than just climate-forecasting! Standard Name Table -- a step towards wider adoption of names, controlled vocabularies, units terminology CF v1.4.x provides tool-makers with helpful “lingua- franca” guidance. Within a file-format, adopting conventions like CF promotes common layout, names, semantics, for dataset-to-dataset compatibility -- a key to wider data level interoperability.
Attributes vs. Metadata? one man’s ceiling is another man’s floor… Collection level vs. Data Set vs. Granule level Structural vs. science-content Swath vs. grid vs. point Commonly used attributes: – CONVENTIONS attrib, communicates which convention was used – Basic globals: title, history, institution, source, references – Coordinate variables, axis, formula_terms – Units, _Fillvalue, missing_data, valid_range – Short_name, long_name, other provenance – (gain,offset /scale_factor,addOffset), etc.
Challenges? (just a few remain…) Evolution, bifurcation, asymmetric support can result in occasional user confusion: – HDF v1.8.x vs. v1.6.x families? – NetCDF v4 Enhanced vs. NetCDF v4 Classic vs. v3? – HDFEOS5 vs. HDFEOS2? Both GUI tool and API support tends to vary by platform (Linux, Mac, Win7) and sub-flavor… Multi-library dependency stacks beg for fully bundled, version-matched end-to-end install pkg! Conventions community (CF v1.4.x) and metadata standards communities also in motion (but that’s good too…)
Resources : URLs Climate Forecast (CF) Conventions (now at 1.4.x): – – HDF: – HDFEOS – – NetCDF: – – ml ml General: – Describing_Formats –
Resources: File format related Tools Panoply: HDFView: OpenDAP : IDV : McIDAS : Python : – h5py : – PyTables: Perl : PDL-IO-HDF5, and Biohdf? Many others: HEG, MTD, HDFEOS plug-in for HDFview, HDFLook, (ncdump, h5dump, and cousins), GRADS, Matlab, binary APIs
A provisional DOI, UUID Strategy What we used for NASA Measures Freeze/Thaw, daily (v2) just delivered: – DOI: assigned to our reference paper, by IEEE Transactions in Geoscience and Remote Sensing – UUID recipe, seedString = Import uuid uuid= uuid.uuid5(seedString)