1 Writing NetCDF Files: Formats, Models, Conventions, and Best Practices Russ Rew, UCAR Unidata June 28, 2007.

Slides:



Advertisements
Similar presentations
Recent Work in Progress
Advertisements

Preparing CMOR for CMIP6 and other WCRP Projects
ESCI/CMIP5 Tools - Jeudi 2 octobre CMIP5 Tools Earth System Grid-NetCDF4- CMOR2.0-Gridspec-Hyrax …
The Future of NetCDF Russ Rew UCAR Unidata Program Center Acknowledgments: John Caron, Ed Hartnett, NASA’s Earth Science Technology Office, National Science.
Making earth science data more accessible: experience with chunking and compression Russ Rew January rd Annual AMS Meeting Austin, Texas.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
NetCDF 3.6: What’s New Russ Rew Unidata Program Center University Corporation for Atmospheric Research
NetCDF Ed Hartnett Unidata/UCAR
Introduction to NetCDF Russ Rew, UCAR Unidata ICTP Advanced School on High Performance and Grid Computing 13 April 2011.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Developing a NetCDF-4 Interface to HDF5 Data
Introduction to NetCDF4 MuQun Yang The HDF Group 11/6/2007HDF and HDF-EOS Workshop XI, Landover, MD.
Developing a NetCDF-4 Interface to HDF5 Data Russ Rew (PI), UCAR Unidata Mike Folk (Co-PI), NCSA/UIUC Ed Hartnett, UCAR Unidata Quincey Kozial, NCSA/UIUC.
1 Russ Rew, Ed Hartnett, John Caron UCAR Unidata Program Center Mike Folk, Robert McGrath, Quincey Kozial NCSA and The HDF Group, Inc. Final Project Review,
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
Unidata’s TDS Workshop TDS Overview – Part II October 2012.
HDF5 A new file format & software for high performance scientific data management.
DM_PPT_NP_v01 SESIP_0715_AJ HDF Product Designer Aleksandar Jelenak, H. Joe Lee, Ted Habermann Gerd Heber, John Readey, Joel Plutchak The HDF Group HDF.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Unidata’s Common Data Model John Caron Unidata/UCAR Nov 2006.
Mid-Course Review: NetCDF in the Current Proposal Period Russ Rew
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
N P O E S S I N T E G R A T E D P R O G R A M O F F I C E NPP/ NPOESS Product Data Format Richard E. Ullman NOAA/NESDIS/IPO NASA/GSFC/NPP Algorithm Division.
Accomplishments and Remaining Challenges: THREDDS Data Server and Common Data Model Ethan Davis Unidata Policy Committee Meeting May 2011.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
October 15, 2008HDF and HDF-EOS Workshop XII1 What will be new in HDF5?
Integrating netCDF and OPeNDAP (The DrNO Project) Dr. Dennis Heimbigner Unidata Go-ESSP Workshop Seattle, WA, Sept
DAP4 James Gallagher & Ethan Davis OPeNDAP and Unidata.
Advanced Utilities Extending ncgen to support the netCDF-4 Data Model Dr. Dennis Heimbigner Unidata netCDF Workshop August 3-4, 2009.
1 HDF5 Life cycle of data Boeing September 19, 2006.
NetCDF Data Model Issues Russ Rew, UCAR Unidata NetCDF 2010 Workshop
NetCDF file generated from ASDC CERES SSF Subsetter ATMOSPHERIC SCIENCE DATA CENTER Conversion of Archived HDF Satellite Level 2 Swath Data Products to.
The CF Conventions: Options for Sustained Support Involving Unidata Russ Rew Unidata Policy Committee May 12, 2008.
NPOESS Enhanced Description Tool - “ned” Richard E. Ullman NASA/GSFC/NPP NOAA/NESDIS/IPO Data / Information Architecture Algorithm / System Engineering.
CF Metadata Conventions: Governance, Support, and Future Karl E. Taylor Program for Climate Model Diagnosis and Intercomparison Lawrence Livermore National.
The HDF Group Data Interoperability The HDF Group Staff Sep , 2010HDF/HDF-EOS Workshop XIV1.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
NetCDF-4: Software Implementing an Enhanced Data Model for the Geosciences Russ Rew, Ed Hartnett, and John Caron UCAR Unidata Program, Boulder
NetCDF and Scientific Data Durability Russ Rew, UCAR Unidata ESIP Federation Summer Meeting
00/XXXX 1 Data Processing in PRISM Introduction. COCO (CDMS Overloaded for CF Objects) What is it. Why is COCO written in Python. Implementation Data Operations.
Data File Formats: netCDF by Tom Whittaker University of Wisconsin-Madison SSEC/CIMSS 2009 MUG Meeting June, 2009.
Advances in the NetCDF Data Model, Format, and Software Russ Rew Coauthors: John Caron, Ed Hartnett, Dennis Heimbigner UCAR Unidata December 2010.
Grids and Beyond: netCDF-CF and ISO/OGC Features and Coverages Ethan Davis, John Caron, Ben Domenico UCAR/Unidata AMS IIPS, 23 January 2008.
UC 2006 Tech Session 1 NetCDF in ArcGIS 9.2. UC 2006 Tech Session2 Overview Introduction to Multidimensional DataIntroduction to Multidimensional Data.
Unidata Technologies Relevant to GO-ESSP: An Update Russ Rew
CF 2.0 Coming Soon? (Climate and Forecast Conventions for netCDF) Ethan Davis ESO Developing Standards - ESIP Summer Mtg 14 July 2015.
Developing Conventions for netCDF-4 Russ Rew, UCAR Unidata June 11, 2007 GO-ESSP.
Development of a CF Conventions API Russ Rew GO-ESSP Workshop, LLNL
NetCDF: Data Model, Programming Interfaces, Conventions and Format Adapted from Presentations by Russ Rew Unidata Program Center University Corporation.
Update on Unidata Technologies for Data Access Russ Rew
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
NcBrowse: A Graphical netCDF File Browser Donald Denbo NOAA-PMEL/UW-JISAO
Utilities for netCDF-4 Dr. Dennis Heimbigner Unidata Advanced netCDF Workshop July 25, 2011.
The CF Conventions: Governance and Community Issues in Establishing Standards for Representing Climate, Forecast, and Observational Data Russ Rew 1, Bob.
Unidata Infrastructure for Data Services Russ Rew GO-ESSP Workshop, LLNL
Libcf – A CF Convention Library for NetCDF Ed Hartnett Unidata Program Center Boulder Colorado June 11, 2007.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Other Projects Relevant (and Not So Relevant) to the SODA Ideal: NetCDF, HDF, OLE/COM/DCOM, OpenDoc, Zope Sheila Denn INLS April 16, 2001.
Adding CF Attributes to an HDF5 File
Moving from HDF4 to HDF5/netCDF-4
SRNWP Interoperability Workshop
Plans for an Enhanced NetCDF-4 Interface to HDF5 Data
What is FITS? FITS = Flexible Image Transport System
Unidata & NetCDF BoF Scientific File Formats
Unidata Advanced netCDF Workshop
Status for Endeavor 6: Improved Scientific Data Access Infrastructure
Libcf – A CF Convention Library for NetCDF
GSICS Baseline Review: Product meta-data and structures
NCL variable based on a netCDF variable model
Presentation transcript:

1 Writing NetCDF Files: Formats, Models, Conventions, and Best Practices Russ Rew, UCAR Unidata June 28, 2007

2 Overview Formats, conventions, and models NetCDF-3 limitations NetCDF-4 features: examples and potential uses Compatibility issues Conventions issues Recommendations

3 Data Abstraction Levels: Formats, Conventions, and Models Data Conventions Unidata Obs Unidata Obs netCDF User Guide netCDF User Guide CF-1.0 ARGO Data Models netCDF classic netCDF/CF CDM (netCDF-4) HDF5 Data Formats HDF-EOS netCDF classic HDF5 netCDF-4 BUFR GRIB1 GRIB2 CDL

4 NetCDF Formats bit offset variant 1988 “classic” format CDL (text-based) 2002 NcML (XML-based) 2007 netCDF-4 (HDF5-based)

5 Commitment to Backward Compatibility sacrosanct Because preserving access to archived data for future generations is sacrosanct : l Data access: New netCDF software will provide read and write access to all earlier forms of netCDF data. l APIs and programs: Existing C, Fortran, and Java netCDF programs will be supported by new netCDF software (possibly after recompiling). l Commitment: Future versions of netCDF software will continue to support data access, API, and conventions compatibility.

6 Purpose of Data Conventions To capture meaning in data To make files self-describing To faithfully represent intent of data provider To foster interoperability To add value to formats –Raise level of abstraction (e.g. adding coordinate systems) –Customize format for discipline or community (e.g. climate modeling) Unidata Obs Unidata Obs netCDF User Guide netCDF User Guide CF-1.0 ARGO … …

7 NetCDF conventions Users Guide conventions: –Simple coordinate variables (same name for dimension and variable) –Common attributes: units, long_name, valid_range, scale_factor, add_offset, _FillValue, history, Conventions, … –Not just for earth-science data Followed by lots of community conventions: COARDS, GDT, NCAR-RAF, ARGO, AMBER, PMEL-EPIC, NODC, …, CF Unidata Obs Conventions for netCDF-3 (supported by Java interface) Climate and Forecast conventions (CF) endorsed by Unidata (2005) Unidata committed to development of libcf (2006)

8 CF Conventions (cfconventions.org) Clear, comprehensive, consistent (thanks to Eaton, Gregory, Drach, Taylor, Hankin) standard_name attribute for identifying quantities, comparison of variables from different sources Coordinate systems support Grid cell bounds and measures Acceptance by community: IPCC AR4 archive, … Governance and stewardship: GO-ESSP, BADC, PCMDI, WCRP/WGCM (pending)

9 CF Conventions Issues cf-metadata mailing list cfconventions.org site: documents, forums, wiki, Trac system GO-ESSP annual meetings Recent CF issues and proposed CF extensions –Structured grids, staggered grids, subgrids, curvilinear coordinates (Balaji) –Unstructured grids (Gross) –Forecast time axis (Gregory, Caron) –Means and subgrid variation and anomaly modifier for standard names –Additions needed for observational data –NetCDF-4 issues –Needs for IPCC AR5 model output archives

10 Scientific Data Models Tabular data –Relational model –Tuples, types, queries, operations, normalization, integrity constraints Geographic data –GIS models –Features and coverages, observations and measurements –Adds spatial location to relational model Multidimensional array data –Basis of netCDF, HDF models –Dimensions, variables, attributes Scientific data types –Coordinate systems, groups, types: structures, varlens, enums –N-dimensional grids, in situ point observations, profiles, time series, trajectories, swaths, … netCDF classic CDM (netCDF-4) Relational HDF5GIS

11 NetCDF Data Models “Classic” netCDF model (netCDF-3 and earlier) –Dimensions, Variables, and Attributes –Character arrays and a few numeric types –Simple, flat CDM (netCDF-4 and later) –Dimensions, Variables, Attributes, Groups, Types –Additional primitive types including strings –User-defined types support structures, variable- length values, enumerations –Power of recursive structures: hierarchical groups, nested types

12 Classic NetCDF Data Model Variables and attributes have one of six primitive data types. DataType char byte short int float double Dimension name: String length: int isUnlimited( ) Attribute name: String type: DataType values: 1D array Variable name: String shape: Dimension[ ] type: DataType array: read( ), … File location: Filename create( ), open( ), … A file has named variables, dimensions, and attributes. A variable may also have attributes. Variables may share dimensions, indicating a common grid. One dimension may be of unlimited length.

13 Some Limitations of Classic NetCDF Data Model and Format Little support for data structures, just multidimensional arrays and lists No nested structures or “ragged arrays” Only one shared unlimited dimension for appending new data efficiently Flat name space for dimensions and variables Character arrays rather than strings Small set of numeric types Constraints on sizes of large variables No compression, just packing Schema additions may be very inefficient Big-endian bias may hamper performance on little- endian platforms

14 A file has a top-level unnamed group. Each group may contain one or more named subgroups, user-defined types, variables, dimensions, and attributes. Variables also have attributes. Variables may share dimensions, indicating a common grid. One or more dimensions may be of unlimited length. Dimension name: String length: int isUnlimited( ) Attribute name: String type: DataType values: 1D array Variable name: String shape: Dimension[ ] type: DataType array: read( ), … Group name: String File location: Filename create( ), open( ), … Variables and attributes have one of twelve primitive data types or one of four user-defined types. DataType PrimitiveType char byte short int int64 float double unsigned byte unsigned short unsigned int unsigned int64 string UserDefinedType typename: String Compound VariableLength Enum Opaque NetCDF-4 Data Model

15 NetCDF-4 Format and Data Model Benefits New data model provides: Groups for nested scopes User-defined enumeration types User-defined compound types User-defined variable- length types Multiple unlimited dimensions String type Additional numeric types HDF5-based format provides: Per-variable compression Per-variable multidimensional tiling (chunking) Ample variable sizes Reader-makes-right conversion Efficient dynamic schema additions Parallel I/O

16 Chunking chunked index order Allows efficient access of multidimensional data along multiple axes Compression applies separately to each chunk Can improve I/O performance for very large arrays and for compressed variables Default chunking parameters are based on a size of one in each unlimited dimension

17 NetCDF-4 Data Model Features Examples in “CDL-4” –Groups –Compound types –Enumerations –Variable-length types Not necessarily best practices Other potential known uses Advice on known limitations Potential conventions issues

18 Example Use of Groups Organize data by named property, e.g. region: group Europe { group France { dimensions: time = unlimited, stations = 47; variables: float temperature(time, stations); } group England{ dimensions: time = unlimited, stations = 61; variables: float temperature(time, stations); } group Germany { dimensions: time = unlimited, stations = 53; variables: float temperature(time, stations); } … dimensions: time = unlimited; variables: float average_temperature(time); }

19 Potential Uses for Groups Factoring out common information –Containers for data within regions, ensembles –Model metadata Organizing a large number of variables Providing name spaces for multiple uses of same names for dimensions, variables, attributes Modeling large hierarchies

20 types: compound wind_vector_t { float eastward ; float northward ; } dimensions: lat = 18 ; lon = 36 ; pres = 15 ; time = 4 ; variables: wind_vector_t gwind(time, pres, lat, lon) ; wind:long_name = "geostrophic wind vector" ; wind:standard_name = "geostrophic_wind_vector" ; data: gwind = {1, -2.5}, {-1, 2}, {20, 10}, {1.5, 1.5},...; Example Use of Compound Type Vector quantity, such as wind:

21 types: compound ob_t { int station_id ; double time ; float temperature ; float pressure ; } dimensions: nstations = unlimited ; variables: ob_t obs(nstations) ; data: obs = {42, 0.0, 20.5, 950.0}, … ; Another Compound Type Example Point observations :

22 Potential Uses for Compound Types Representing vector quantities like wind Modeling relational database tuples Representing objects with components Bundling multiple in situ observations together (profiles, soundings) Providing containers for related values of other user- defined types (strings, enums, …) Representing C structures portably CF Conventions issues: –should type definitions or names be in conventions? –should member names be part of convention? –should quantities associated with groups of compound standard names be represented by compound types?

23 Drawbacks with Compound Types Member fields have type and name, but are not netCDF variables Can’t directly assign attributes to compound type members –New proposed convention solves this problem, but requires new user-defined type for each attribute Compound type not as useful for Fortran developers, member values must be accessed individually

24 types: compound wind_vector_t { float eastward ; float northward ; } compound wv_units_t { string eastward ; string northward ; } dimensions: station = 5; variables: wind_vector_t wind(station) ; wv_units_t wind:units = {"m/s", "m/s"} ; wind_vector_t wind:_FillValue = {-9999, -9999} ; data: wind = {1, -2.5}, {-1, 2}, {20, 10},... ; Example Convention for Member Attributes

25 Example Use of Enumerations Named flag values for improving self-description: types: byte enum cloud_t { Clear = 0, Cumulonimbus = 1, Stratus = 2, Stratocumulus = 3, Cumulus = 4, Altostratus = 5, Nimbostratus = 6, Altocumulus = 7, Missing = 127 }; dimensions: time = unlimited; variables: cloud_t primary_cloud(time); cloud_t primary_cloud:_FillValue = Missing; data: primary_cloud = Clear, Stratus, Cumulus, Missing, …;

26 Potential Uses for Enumerations Alternative for using strings with flag_values and flag_meanings attributes for quantities such as soil_type, cloud_type, … Improving self-description while keeping data compact CF Conventions issues: –standardize on enum type definitions and enumeration symbols? –include enum symbol in standard name table? –standardize way to store descriptive string for each enumeration symbol?

27 Example Use of Variable-Length Types In situ observations: types: compound obs_t { // type for a single observation float pressure ; float temperature ; float salinity ; } obs_t some_obs_t(*) ; // type for some observations compound profile_t { // type for a single profile float latitude ; float longitude ; int time ; some_obs_t obs ; } profile_t some_profiles_t(*) ; // type for some profiles compound track_t { // type for a single track string id ; string description ; some_profiles_t profiles; } dimensions: tracks = 42; variables: track_t cruise(tracks); // this cruise has 42 tracks

28 Potential Uses for Variable-Length Type Ragged arrays In situ observational data (profiles, soundings, time series)

29 Notes on netCDF-4 Variable-Length Types Variable length value must be accessed all at once (e.g. whole row of a ragged array) Any base type may be used (including compound types and other variable-length types) No associated shared dimension, unlike multiple unlimited dimensions Due to atomic access, using large base types may not be practical

30 Recommendations and Best Practices …

31 NetCDF Data Models and File Formats 1.Use netCDF-3: classic data model and classic format 2.Use richer netCDF-4 data model and netCDF-4 format and a third less obvious choice: 3.Use classic data model with the netCDF-4 format Data providers writing new netCDF data have two obvious alternatives:

32 Third Choice: “Classic model” netCDF-4 Psuedo format supported by netCDF-4 library with file creation flag Ensures data can be read by netCDF-3 software (relinked to netCDF-4 library) Compatible with current conventions Writers get performance benefits of new format Readers can –access compressed or chunked variables transparently –get performance benefits of reader-makes-right –use HDF5 tools on files

33 NetCDF-4 Format and Data Model Benefits New data model provides: Groups for nested scopes User-defined enumeration types User-defined compound types User-defined variable- length types Multiple unlimited dimensions String type Additional numeric types HDF5-based format provides: Per-variable compression Per-variable multidimensional tiling (chunking) Ample variable sizes Reader-makes-right conversion Efficient dynamic schema additions Parallel I/O

34 Why Not Make Use of NetCDF-4 Data Model Now? C-based netCDF-4 software still only in beta release (depending on HDF5 1.8 release) Few netCDF utilities or applications adapted to full netCDF-4 model yet Development of useful conventions will take experience, time Significant performance improvements available now, without netCDF-4 data model –using classic model with netCDF-4 format

35 When to Use NetCDF-4 Data Model On “greenfield projects” (lacking legacy issues or constraints of prior work) If non-classic primitive types needed –64-bit integers for statistical applications –unsigned bytes, shorts, or ints for wider range –real strings instead of fixed-length char arrays If making data self-descriptive requires new user- defined types –compound –variable-length –enumerations –nested combinations of types If multiple unlimited dimensions needed If groups needed for organizing data in hierarchical name scopes

36 Recommendations for Data Providers Continue using classic data model and format, if suitable Evaluate practicality and benefits of classic model with netCDF-4 format Test and explore uses of extended netCDF-4 data model features Help evolve netCDF-4 conventions and Best Practices based on experience with what works

37 Best Practices: Where to Go From Here We’re updating current netCDF-3 Best Practices document before Workshop in July New “Developing Conventions for NetCDF-4” document is under development Benchmarks may help with guidance on compression, chunking parameters, use of compound types We depend on community experience for distillation into new Best Practices

38 Adoption of NetCDF-4: A Three-Stage Chicken and Egg Problem Data providers –Won’t be first to use features not supported by applications or standardized by conventions Application developers –Won’t expend effort needed to support features not used by data providers and not standardized as published conventions Convention creators –Likely to wait until data providers identify needs for new conventions –Must consider issues application developers will confront to support new conventions

39 Thanks! Questions?

40 netCDF-4 C library Operating system netCDF-3 C library netCDF-3 F77 library JVM netCDF Java library netCDF-3 C++ library netCDF-3 Perl, Python, Ruby, … libraries F77 apps for netCDF-3 F90 library F90 apps for netCDF-3 C apps for netCDF-3 C++ apps for netCDF-3 Perl, Python, Ruby, … apps for netCDF-3 HDF5 C library MPI I/Ozlib, … Java Apps for netCDF netCDF-4 F90 library C apps for netCDF-4 F90 apps for netCDF-4 C apps for HDF5 Java library Java apps for HDF5 F90 library F90 apps for HDF5 Posix I/O