Metadata Standards for Gridded Climate Data in the Earth System Grid Robert Drach LLNL/PCMDI UCRL-PRES
Drach2Sept. 10, 2002 Overview I. Earth System Grid: Grid Access to Climate Research Data II. Metadata Standards for Gridded Climate Data
Part I ESG: Grid Access to Climate Research Data
Drach4Sept. 10, 2002 The goal of ESG is to make climate data – particularly climate model data – an easily accessible community resource. The project is funded by the SciDAC program: Scientific Discovery through Advanced Computing. Enabling researchers to understand and make effective use of very large, distributed climate datasets is critical. The broad strategy is to develop a collection of server-side capabilities – minimize the amount of data movement. Multiple interfaces to ESG will allow researchers to focus on science rather than issues of data transfer, format, and data set manipulation. Foundation is Globus Grid technology Earth System Grid Overview
Drach5Sept. 10, 2002 Globus middleware supports linkage of distributed data archives, supercomputers, workstations, local disk caches into data/computational grids. GridFTP: high-performance, secure, robust data transfer mechanism: protocol, server, client library. ESG is integrating OpenDAP (DODS protocol) with GridFTP protocol. Single sign-on using Grid Security Infrastructure Proxy certificates Community Authorization Service (CAS) Replica Location Service: manages copying and placement of files in a distributed environment. Logical vs. physical files ESG uses Globus Grid technology.
Drach6Sept. 10, 2002 ESG: U.S. Collaborations & Development ORNL: Climate storage & computational resources ORNL: Climate storage & computational resources ANL: Computational grids, & grid-based applications ANL: Computational grids, & grid-based applications USC/ISI: Computational grids, & grid-based applications USC/ISI: Computational grids, & grid-based applications NCAR: Climate change predication and scenarios NCAR: Climate change predication and scenarios LBNL: Climate storage Facility and access LBNL: Climate storage Facility and access LLNL: Model diagnostics & inter-comparison LLNL: Model diagnostics & inter-comparison
Drach7Sept. 10, 2002 Program for Climate Model Diagnosis and Intercomparison Validation and intercomparison of atmospheric general circulation models, coupled ocean-atmosphere models Development of analysis software, quality control, archiving, distribution of model results. Climate Data Analysis Tools (CDAT) is a Python-based analysis and visualization system. Global warming detection studies CMIP (coupled models) and AMIP (atmospheric GCMs) gather model simulation results from thirty modeling groups worldwide.
Drach8Sept. 10, 2002 PCMDI and Model Development Modeling groups PCMDI Diagnosis, quality control, data archival Simulation data Controlled simulation runs Feedback to modelers Gridded observation data Observations Data assimilation PCMDI
Drach9Sept. 10, 2002 ESG-II Architecture Portals Servers Middleware
Drach10Sept. 10, 2002 ESG: Metadata Services METADATA EXTRACTION METADATA EXTRACTION METADATA DISPLAY METADATA DISPLAY METADATA BROWSING METADATA BROWSING METADATA QUERY METADATA QUERY ESG CLIENTS API & USER INTERFACES Data & Metadata Catalog Dublin Core Database CF Database mirror Dublin Core XML Files COMMENTS XML Files METADATA HOLDINGS METADATA ANNOTATION METADATA ANNOTATION METADATA VALIDATION METADATA VALIDATION METADATA ACCESS (update, insert, delete, query) METADATA ACCESS (update, insert, delete, query) SERVICE TRANSLATION LIBRARY SERVICE TRANSLATION LIBRARY CORE METADATA SERVICES METADATA AGGREGATION METADATA AGGREGATION METADATA DISCOVERY METADATA DISCOVERY METADATA & DATA REGISTRATION METADATA & DATA REGISTRATION PUBLISHING HIGH LEVEL METADATA SERVICES SEARCH & DISCOVERY ADMINISTRATION BROWSING & DISPLAY ANALYSIS & VISUALIZATION
Drach11Sept. 10, 2002 OpenDAP (DODS): Distributed Oceanographic Data System (Unidata) Integrations of Globus GridFTP, DODS data access THREDDS: THematic Real ‑ time Environmental Distributed Data Services (Unidata) LAS: Live Access Server (NOAA Pacific Marine Environmental Laboratory) Works with CDAT, Ferret, GrADS, … CDAT: Climate Data Analysis Tools (PCMDI), includes CDMS: Climate Data Management System, VCDAT visualization Community Data Portal project (NCAR) NCL (NCAR) Globus Grid technology(ANL, ISI): GridFTP, CAS Community Authorization Service ESG is leveraging off existing software and projects.
Drach12Sept. 10, 2002 CDAT: Example of an ESG GUI Client Access
Drach13Sept. 10, 2002 LAS/CDAT: Example of a Web- based Data Portal Technology: Web Based (end user requirements) LAS, DODS, ESG (i.e., Globus), CDAT Portal should hide/simplify the Grid for users Single sign-on Community-based authorization Simplified resource location Remote job submission, management Accesses the ESG Grid Testbed
Part II Metadata Standards for Gridded Climate Data
Drach15Sept. 10, 2002 Most climate simulation data are in the form of gridded datasets: collections of variables as a function of longitude, latitude, time, and vertical level. A dataset is a logical container: A file An aggregation of files A collection of database tables Model-generated data Model data Derived data: zonal averages, global averages, virtual variables Observational data, including reanalyses Attributes in the form of (name, value) pairs, array values Climate Model Datasets
Drach16Sept. 10, 2002 Suitable basis for storing data, but lack the metadata to support certain application requirements netCDF (UCAR) array data model flexible attribute/value metadata model simple API HDF (NCSA, NASA) collection of APIs, can be tailored to specific data models including scientific data sets, satellite data, point data Binary formats
Drach17Sept. 10, 2002 GRIB (WMO, ECMWF, NCEP) mixed sequential/array data model tailored for simulation output, supports common horizontal grid types hardwired metadata model good compression capabilities lacks a standard API Binary formats
Drach18Sept. 10, 2002 Self-describing binary formats are flexible, but underconstrain representation of coordinate systems. Coordinate Systems Index Space Variable Space Coordinate Space Coordinate System Time(i) Latitude(j,k) Longitude(j,k) V = Temperature(Time, Latitude, Longitude) V’ = Temperature(i,j,k)
Drach19Sept. 10, 2002 Curvilinear grid - Los Alamos POP ocean model Horizontal Grids Temperature(i,j) Latitude(i,j) Longitude(i,j) Lat_bounds(i,j,4) Lon_bounds(i,j,4)
Drach20Sept. 10, 2002 Reduced grid Horizontal Grids Temperature(i,j) Latitude(i) Longitude(i,j) Lat_bounds(i,2) Lon_bounds(i,j,4)
Drach21Sept. 10, 2002 General grid – Colorado State geodesic grid Horizontal Grids Temperature(npts) Latitude(npts) Longitude(npts) Lat_bounds(npts,6) Lon_bounds(npts,6)
Drach22Sept. 10, 2002 Applications must be able to recognize the spatial/temporal coordinate axes. Visualization: continental overlays Data: selection by axis type Spatial/temporal location file = cdms.open(‘sample.nc’) temperature = file[‘temperature’] data = temperature(latitude=(-45.0, 45.0)) file = cdms.open(‘sample.nc’) temperature = file[‘temperature’] data = temperature(latitude=(-45.0, 45.0))
Drach23Sept. 10, 2002 Climate simulations use different types of calendars ‘proleptic’ Gregorian Julian Mixed Gregorian/Julian No leap years (noleap) 30-day months Climatologies represent multi-year averages. Time representation and calendars
Drach24Sept. 10, 2002 Several conventions have been developed to augment the netCDF data model. Represent a balance between needs of data producers and data consumers. COARDS convention 1D coordinates axes, rectilinear horizontal grids axis identification based on units variables limited to four dimensions ordering of dimensions fixed Metadata conventions
Drach25Sept. 10, 2002 CF (Climate and Forecast) convention Based on earlier conventions, COARDS and GDT multidimensional coordinates (auxiliary coordinate variables) simplified axis identification specific representation for several horizontal grid types rectilinear curvilinear reduced grids variables can have an arbitrary number of dimensions no constraint on ordering of dimensions non-Gregorian calendars standard name table Metadata conventions
Drach26Sept. 10, 2002 Ability to recognize comparable quantities is fundamental to model intercomparison. CF defines a schema for standard name tables XML representation used for table of standard variable names and descriptions standard_name attribute is optional. No restriction on variable names. Relationship to ontology development? Comparability of quantities Program for Climate Model Diagnosis and Intercomparison Pa Pressure defined at the level of the mean topography within the grid box. air_pressure_at_sea_level
Drach27Sept. 10, 2002 ESG has adopted the netCDF data model and the CF convention as standards Other standards and conventions will follow. NcML markup language. ESG metadata
Drach28Sept. 10, 2002 CF and NcML apply to data aggregates as well as files Data aggregation: collections of files/datasets are treated as single entities. array model netCDF-like tailored for extraction of 'hyperslabs' of data Aspects of aggregation: combining/merging variables joining variables creating new coordinate axes overlaying/adding metadata nesting datasets Aggregation
Drach29Sept. 10, 2002 Aggregation maps well to multifile datasets: multifile datasets can be thought of as 'partitioned' into files. Variables may 'span' multiple files. Usually a dataset is partitioned on time and/or vertical level axes. PCMDI CDAT supports aggregations via the cdscan utility, uses XML representation THREDDS/DODS aggregation server ( s/THREDDS/) Aggregation Time Level Variable
Drach30Sept. 10, 2002 The Earth System Grid project is developing metadata services to support a variety of schemas and conventions. The initial focus of ESG is to enable climate researchers to make effective use of distributed, model-generated datasets. The netCDF schema and CF convention are the foundation for representation of this data. Summary