Emerging Provenance/Context Content Standard Discussion at Data Stewardship Committee Session at ESIP Federation Meeting January 5, 2012 H. K. “Rama” Ramapriyan and John Moses ESDIS Project NASA Goddard Space Flight Center
Session Objectives Provide update on status Identify applicable representation standards Discuss use cases we currently have Show sample mapping of use cases to content items in PCCS matrix Identify any new use cases needed Discuss next steps
Status (1 of 3) Received inputs from Ted Habermann et al (NOAA) Merged with NASA’s inputs based on USGCRP Workshop (1998) Report, and discussions ( ) with EOS instrument teams (GLAS, HIRDLS) and TOMS instrument PI – Note – USGCRP workshop, jointly sponsored by NASA and NOAA, identified a number of scenarios (use cases) from which content recommendations were derived Content matrix was developed and posted on ESIP Data Stewardship and Preservation Cluster wiki – Initial version – March 1, 2011 – Latest version (incorporates comments from cluster members) – June 8, 2011 – Focused mostly on satellite remote sensing data; need to ensure we cover other types of data (aircraft, in situ) – Developed use cases during last ESIP meeting (July 2011)
Status (2 of 3) NASA Earth Science Data Preservation Content Specification – Data System Requirements are included in all satellite mission Level 1 Requirements. Preservation requirement is stated as: The > shall transfer to the > all the information and documentation required for long-term preservation of knowledge about the products resulting from >, as defined in the NASA Earth Science Data Preservation Content Specification document published at and shall baseline to a specific initial version. – Document has been prepared and baselined (available at URL above) as of November 23, 2011 Summarizes content of PCCS matrix including item descriptions and rationale Current baselined version has been reviewed by a broad set of organizations – DAACs, SMAP, ICESat-II, EOS Instrument Teams, MEaSUREs, DAAC User Working Groups, ESDIS, NASA HQ Accounts for concerns of data providers as well as users Provides some flexibility, allowing for variations from project to project in approach to meeting the specifications – requires projects to provide checklist to show how specifications are met
Status (3 of 3) NASA Earth Science Data Preservation Content Specification – An evolution path has been provided for the specification - schedules and versions are intentionally defined to balance flight project needs for early specificity with deliberative and evolving nature of standards’ work by TIWG’s Data Stewardship Group, SPG, ESIP, IEEE, ISO – Provides opportunity for flight projects to select a version to baseline early, but review (as part of ESDIS CCB process) later versions and adopt if feasible – Revisions take into account concerns by existing projects – CCB-baselined version (version 2) has been posted (mid-Nov. 2011) – These will be reflected in an updated matrix (ESIP version) – Updates scheduled for Spring 2012, 2014 and 2016
Standards that Prescribe and/or Guide Construction of Preservation Content What are the prominent standards (and groups) that guide or prescribe content which overlap or contribute to preservation content requirements? Metadata for interoperability in dataset search and access (ISO TC 211) ISO Collection and Granule Metadata for geographic information (MD package) ISO NOAA-NASA sponsored revision for imagery and gridded data (MI package ISO Geographic information for feature entity and attribute cataloguing) ISO Metadata for services ISO How to describe, evaluate and report data quality ANSI Content and Collection Management Standards and FGDC Digital Geospatial Metadata (ISO NAP) standards feed into ISO Open Geospatial Consortium (OGC) Service Web Enablement (SWE) Sensor Observation Service and WCS, WFS, WMS serving netCDF and HDF sensorML (instrument specifications) Consultive Committee on Space Data Systems (CCSDS) Open Archive Information Systems (OAIS) Reference Framework (nomenclature) Satellite data requirements (i.e., Level 0 data content), XML Formatted Data Unit Library and Publishing applications (e.g., NISO, NARA, Library of Congress) Dublin Core Metadata Exchange and Transfer Standard PREMIS
ISO Metadata standards supporting documentation content CategoryContent ItemPath to ISO Standard Preflight/Pre -Operations Calibration Instrument Description /gmi:MI_Metadata/gmi:acquisitionInformation/gmi:MI_AcquisitionInformation/gmi:instrument Pre-operational Calibration Data Science Data Products Raw Data and Derived Products Metadata ISO Geographic Information (MD package), ISO Extensions for imagery and gridded data (MI package); ISO Data Quality, ISO Services Science Data Product Documentati on Product Team/gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:pointOfContact or gmd:credit Product Requirements (basis, structure & format) /gmi:MI_Metadata/gmi:acquisitionInformation/gmi:MI_AcquisitionInformation/gmi:acquisitionRequirement /gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:abstract; purpose; environmentDescription /gmi:MI_Metadata/gmd:distributionInfo/gmd:MD_Distribution/gmd:distributor/gmd:MD_Distributor/gmd:distributorFormat /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:processStep/gmi:LE_ProcessStep/gm i:processingInformation/gmi:LE_Processing/gmi:algorithm /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:source/gmd:LI_Source/gmd:sourceSt ep/gmi:LE_ProcessStep/gmd:description Processing and Algorithm Version History /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:processStep /gmi:MI_Metadata/gmd:identificationInfo/gmd:MD_DataIdentification/gmd:resourceMaintenance/gmd:MD_MaintenanceInformation/g md:maintenanceNote Product Generation Algorithm /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:source/gmd:LI_Source/gmd:sourceSt ep/gmi:LE_ProcessStep/gmd:description /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:processStep/gmi:LE_ProcessStep/gm i:processingInformation/gmi:LE_Processing/gmi:softwareReference and /gmi:algorithm Product Quality /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:report/gmd:DQ_QuantitativeAttributeAccuracy/gmd:evaluationMet hodDescription /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:report/gmd:DQ_QuantitativeAttributeAccuracy/gmd:result/gmi:QE _CoverageResult/gmi:resultContentDescription/gmi:MI_CoverageDescription/gmd:dimension /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:report/gmi:QE_Usability/gmd:result/gmd:DQ_ConformanceResult Product Application Mission Data Calibration Calibration Method /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:source/gmd:LI_Source/gmd:sourceSt ep/gmi:LE_ProcessStep/gmd:description Calibration Data Product Software software documentation /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:processStep/gmi:LE_ProcessStep/gm i:processingInformation/gmi:LE_Processing/gmi:softwareReference and/gmi:algorithm Product software Algorithm Inputs Ancillary data documentation /gmi:MI_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:source gmd:dimension (CoverageContentTypeCode=auxilliaryData) Ancillary data
Use Cases - Purpose Identify what content items are required and why Mapping of use cases to content items helps with rationale and priority for preserving them If no use cases are identified for some of the items – Look harder for use cases, or – Eliminate item from PCCS
Use Cases – ESIP Summer Meeting (2011) Creating a data set: – Creating a long term trend data set from multiple data sets A research 100 years in the future is examining the historical record. – Giving credit to people involved in the data set – Describing appropriate uses of a data set Asserting quality of data set Intellectual property rights Policies Creating citations for published data set Validation of data Using a data set: – Obtaining Data Obtaining Data Discovering data – Choosing a data set from multiple similar choices. Choosing a data set from multiple similar choices. Assessing data – Applications for data Analysis of new version of data set – How do the improvements affect our use of the data? Sharing data set for collaboration Comparing multiple data sets Reproducing a dataset Verification of an experiment
Use Cases – current list USGCRP Workshop Report (1998) – Ozone Trends and Tropospheric Aerosols – Precipitation Trend Analysis – Near-Surface Temperature Analysis – Ocean Frontal Analysis – Snow Cover Analysis – Ocean Topography ESIP Summer Meeting (2011) – Long list developed (see next chart) – Obtaining Data Involves all of the ways users obtain data. The range of mechanisms considered range from ad-hoc methods such as consulting your colleagues at a meeting, finding out about a data set and contacting the scientist who has it to get it, to using one of the major data centers systems (or GCMD) to find and assess and obtain relevant data sets. – Choosing a data set from multiple similar choices A research user needs to pick the data set from multiple similar data sets that best meets the user’s requirements for their intended application. E.g., Polar bear ecologist choosing a data set on sea ice conditions in a region of the Hudson Bay from the multiple data sets listed at NSIDC. – Reproducing a dataset
Preservation Content Use Cases Content ItemCategory Instrument Description Preflight/Pre- Operations Calibration Preflight/Pre-operational Calibration Data Raw Data and Derived Products Science Data Products Metadata Product Team Science Data Product Documentation Product Requirements (basis, structure & format) Processing and Algorithm Version History Product Generation Algorithm Product Quality Product Application Calibration Method Mission Data Calibration Calibration Data Science data product generation software and software documentation Science Data Product Software Ancillary data and documentation Science Data Product Algorithm Inputs Datasets and documentation Science Data Product Validation Software and documentation Science Data Software Tools Ozone Trends and Tropospheric Aerosols Intercomparison with new in-situ observation instruments suggest previously disregarded satellite measurements were real. Recalculated TOMS measurements indicate earlier ozone depletion over Antarctica. Precipitation Trend Analysis Including weather satellite data and accounting for measurement [station location and orbit] disparities to produce a reliable record. Near-Surface Temperature Analysis Resolving discrepancy between land-surface weather station measurements and temperatures derived from satellite microwave instruments. Ocean Frontal Analysis Detecting seasonal location of physical and biological mixing properties using long time series satellite derived sea surface temperature fields from Level 1 data and in situ measurements. Snow Cover Analysis Improvements in snow cover extent seasonal changes and long term trends resulted from capability to reprocess with new land-ocean mask and snow cover filters. Ocean Topography Radar altimeters map oceans surface topo to find small changes over a long period of time. Results depend on reducing Earth location errors. Use case: Every example of successful re-analysis of archived data has depended on quality documentation (re: pg 1, GCSR for LTA) Key Items
Category Descriptions (1 of 2) 1. Category2. Content Item3. Definition/Description Preflight/Pre- Operations Calibration Instrument Description Documentation of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, instrument geometric calibration (geo-location offsets), noise characteristics, etc.). Preflight/Pre- operational Calibration Data Numeric (digital data) files of Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, instrument geometric calibration (geo-location offsets), noise characteristics, etc.). Science Data Products Raw Data and Derived Products Raw data are data values at full resolution as directly measured by a spaceborne, airborne or in situ instrument. Derived products are higher level products (level 1b through 4) where calibration and geo-location transformations have been applied to generate sensor units, and/or algorithms have been applied to generate gridded geophysical parameters. Metadata Information about data to facilitate discovery, search, access, understanding and usage associated with each of the data products. Science Data Product Documentation Product Team Names of key science team leads and product team members (development, help desk and operations), roles, performing organization, contact information, sponsoring agencies or organizations and comments about the products. Product Requirements Requirements and designs for each science data product, either explicitly or by reference to the requirements/design documents. Product requirements and designs should include content, format, latency, accuracy and quality. Processing and Algorithm Version History For all products held in the archive, documentation of processing history and production version history, indicating which versions were used when, why different versions came about, and what the improvements were from version to version. For all products held in the archive, the versions of source code used to produce the products should be available at the archive. Product Generation Algorithm Detailed discussion of processing algorithms, outputs, error budgets and limitations. Processing algorithms and their theoretical (scientific and mathematical) basis, including complete description of any sampling or mapping algorithm used in creation of the product, geo-location, radiometric calibration, geophysical parameters, sampling or mapping algorithms used in creation of the product, algorithm software documentation, & high-level data flow diagrams. Description of how the algorithm is numerically implemented. Product Quality Description of the impact to product quality due to issues with computationally intensive operations (e.g., large matrix inversions, truncation and rounding). Documentation of product quality assessment (methods used, assessment summaries for each version of the datasets). Description of embedded data at the granule level including quality flags, product data uncertainty fields, data issues logs, etc. Relevant test reports, reviews, and appraisals. Product Application Useful references to published articles about the use of the data and user feedback received by the science and instrument teams about the products. Includes reports of any peculiarities or notable features observed in the products.
Category Descriptions (2 of 2) 1. Category2. Content Item3. Definition/Description Mission Data Calibration Calibration Method The methods used for instrument/sensor radiometric and geometric calibration while in operation (e.g., in orbit). The source code used in applying the calibration algorithms. Documentation of in-line changes to calibration or to instrument or platform operations or conditions that occur throughout the mission. Calibration Data Instrument and platform engineering data collected during operations (e.g., on orbit), including platform and instrument environment, events and maneuvers; attitude and ephemeris; aircraft position; acquisition logs that record data gaps; calibration look-up tables; and any significant external event data that may have impacted the observations. Science Data Product Software Science data product generation software and software documentation Source code used to generate products at all levels in the science data processing system. Software release notes, including references to versions of operating systems, compilers, commercial software libraries used in the code. Versions of science data product software should be archived for each major product release. A major product release is characterized by the appearance of peer reviewed publications where reported results are based on the product version. Science Data Product Algorithm Inputs Ancillary data and documentation Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product, either explicitly in data descriptions or by reference to appropriate publications. Ancillary data should be stored with the products unless it is available from another permanent archive facility. Science Data Product Validation Datasets and documentation Accuracy of products, as measured by validation testing, and compared to accuracy requirements. Description of validation process, including identification of validation data sets, measurement protocols, data collection, analysis and accuracy reporting. Science Data Software Tools Software and documentation Product access (reader) tools. Software source code that would facilitate use of the calibration data, ancillary data and the data products at all levels. Includes software source code useful for creating programs that will read and display the calibration data, ancillary data and product data and metadata values. Commercial tools should be identified with appropriate references.
European Space Agency Archive Content Descriptions NASA ES Category ESA LTDP Dataset Composition Content Descriptions Pre-MissionMission, Sensor/Instrument requirements Sensor/Instrument characteristics Measurements qualification process Calibration data and methods Data ProductsRaw data and/or Level 0 equivalent Processed data (e.g., L1, L2 or upper levels) Metadata Product Documentation Data/Products structures and formats descriptions Processing algorithms and scientific bases Processing evolution (e.g. changes, history) Product qualification process (data & methods) Known errors and/or limits in processing or algorithm’s application Mission Calibration Sensor/Instrument platform and its performances Instrument/Sensor calibration, tuning (data & methods) Sensor/Instrument evolution SoftwareProcessing algorithms and methods (e.g. technical descriptions, software coding, platform references, etc.) Algorithm InputsSupporting information for data processing (e.g. ancillary) Ancillary, Auxiliary support elements & evolution ValidationValidation campaign data and results
Back-up Charts
Content Matrix - Introduction Using column headings discussed at January 2011 ESIP meeting (mostly) Each row corresponds to a content item and provides details Content items are mapped into 8 categories (see next chart) One or more content items are defined in each of the categories Column headings – Item Number (C.N – category and number within category) – Category – Content Name – Definition / Description – Rationale (why content is needed) – Criteria (how good content should be) – Priority (H, M, L or critical, essential, desirable) – Source (who should provide content item) – Project phase for capture – User community (who would be most likely to need the content item – this column is mostly blank in this version; needs group inputs) – Representation (while focus in on “what”, brief comments are included here on whether items are word files, numeric files, pointers, etc.) – Distribution restrictions (potential proprietary or ITAR concerns associated with content item) – Source identifying item (where content item came from – NASA, NOAA or both)
Categories 1.Preflight/Pre-Operations: Instrument/Sensor characteristics including pre- flight/pre-operations performance measurements; calibration method; radiometric and spectral response; noise characteristics; detector offsets 2.Products (Data): Raw instrument data, Level 0 through Level 4 data products and associated metadata 3.Product Documentation: Structure and format with definitions of all parameters and metadata fields; algorithm theoretical basis; processing history and product version history; quality assessment information 4.Mission Calibration: Instrument/sensor calibration method (in operation) and data; calibration software used to generate lookup tables; instrument and platform events and maneuvers 5.Product Software: Product generation software and software documentation 6.Algorithm Input: Any ancillary data or other data sets used in generation or calibration of the data or derived product; ancillary data description and documentation 7.Validation: Record and data sets 8.Software Tools: product access (reader) tools.