Download presentation
Presentation is loading. Please wait.
Published byJuliana Cain Modified over 9 years ago
1
Michelle Gierach, PO.DAAC Project Scientist Eric Tauer, PO.DAAC Project System Engineer
2
We saw a theme spanning several 2011 UWG recommendations, (6, 14, 19, 20, 23). The theme spoke to a fundamental need/goal: Approach and handle datasets with consistency, and accept and/or deploy them only because it makes sense to do so. This is worth solving! We want to provide the right datasets, and we want users to be able to easily connect with the right datasets. We enthusiastically agree with the UWG recommendations! Therefore… Our intent is to capture the lifecycle policy, (including how we accept, handle, and characterize datasets), to ensure: Consistency in our approach, Soundness in our decisions, and the Availability of descriptive measures to our users.
3
In the next two discussions, we will address those 5 UWG Recommendations (6, 14, 19, 20, 23), via the following talks: 1. The proposed end-to-end lifecycle phases (enabling consistency), and assessment criteria (describing our data to our users) (Eric) 2. The results of the Gap Analysis, and the corresponding Dataset Assessment (Michelle)
4
Recommendation 6. Carry out the dataset gap analyses and create a reporting structure that categorizes what is available, what could be created, the potential costs involved, estimates of user needs, and other data management factors. This compilation should enable prioritization of efforts that will fill the most significant data voids. Recommendation 14. There needs to be a clear path for all datasets generated outside of PO.DAAC to be accepted and hosted by the PO.DAAC. The PSTs have a role in determining whether a dataset is valuable and of good quality. The processes and procedures should be published and readily available to potential dataset developers. All datasets should go through the same data acceptance path. A metric exclusively based on the number of peer-reviewed papers using the dataset is NOT recommended. Recommendation 19. The UWG has previously recommended that the PO.DAAC work on providing climatologies, anomalies, indices, and various dataset statistics for selected datasets. This does not include developing CDRs as part of the core PO.DAAC mission. This recommendation is repeated, because it could be partially complementary to the IPCC/CMIP5 efforts, e.g., these climatologists prefer to work with global monthly mean data fields. Contributions of CDR datasets to PO.DAAC from outside research should be considered. Recommendation 20. Better up front planning is required if NASA research program results are to be directed toward PO.DAAC. Datasets must meet format and metadata standards, and contribute to body of core data types. The Dataset Lifecycle Management plans are a framework for these decisions. Software must be designed to integrate with and beneficially augment the PO.DAAC systems. PO.DAAC should not accept orphan datasets or software projects. Recommendation 23. Guiding user to data: Explain and use Dataset Lifecycle Management vocabulary with appropriate linkages Clarify what Sort by 'Popularity...' means
6
The specification and documentation of the Dataset Lifecycle Policy stems from UWG Recommendations: 14, 20, 23 “There needs to be a clear path for all datasets generated outside of PO.DAAC to be accepted and hosted by the PO.DAAC” “All datasets should go through the same data acceptance path” “Better up front planning is required if NASA research program results are to be directed toward PO.DAAC” “Dataset Lifecycle Management plans are a framework for these decisions” “Explain and use Dataset Lifecycle Management vocabulary with appropriate linkages”
7
Consistency in our approach Match users to data Major Goal: Better describe our data to better map it to our users.
8
First: Define Lifecycle Phases to control consistency.
9
Dataset Lifecycle work underway internal and external to PO.DAAC. Internal: Significant research and work performed by Chris Finch (UWG 2010 Presentation) Work within PO.DAAC to streamline process; Mature teams with a very solid understanding of their roles Existing exit-criteria checklist for product release External: Quite a bit of reference available via Industry efforts and progress Models can be leveraged from implementations at other DAACs, Big Data, Data One Question: Any specific recommendations regarding lifecycle models appropriate to PO.DAAC?
10
Phase*Description 1Identify a Dataset of InterestThis phase controls the identification of a dataset and it’s submission as a candidate, performing some order of a cost/benefit analysis. 2Green-light the DatasetReview the information on a candidate dataset, indicating a go/no-go for inclusion at PO.DAAC 3Tailor the Dataset PolicySet the expectations with respect to the policy; identify areas for waiver or non- applicability, if any. 4Ingest the DatasetDetermine and verify the route to obtain data under this dataset; collection of data-set related meta-data. 5Archive the DatasetDetermine and verify how to process the data, identify reformatting needs, meta- data extraction and initial validation strategy. 6Register/Catalog the DatasetDo preparatory steps required to ultimately enroll this dataset into the catalogs and FTP site(s), etc. 7Distribute the Dataset(Was “Integrate”) Identify and complete work to tie the dataset in to search engines, visualization tools, and services. 8Verify the DatasetIdentify the test approach, plans, and procedures for verifying any and all PO.DAAC manipulation of this dataset’s granules. Define and document the level of verification to be performed. (Roll up all validation from prior steps.) 9Rollout the DatasetFinalize the dataset information page; review the dataset for readiness. Deploy to operations and notify community of availability. 10Maintain the DatasetThis phase controls actions that may be needed to maintain the data in-house over the longer term, including re-processing, superseding, and versioning. *Additionally, we include “Retire the Dataset”, but these are the primary operational phases.
11
Lifecycle Policy ESDIS Goals User Goals PO.DAAC Dataset Goals How we do business Procedures Consistent Approach How we Describe our Data MATURITY Controls…
12
Second: Define measurements related to Maturity Index.
13
We want to quantitatively evaluate our datasets We don’t want to claim datasets are “good” or “bad” NASA and NOAA call their evaluation: “Maturity.” Question: (Rhetorical, at this point) What does “maturity” mean to you? Do you prefer it to “Assessment and Characterization”? Assessment and Characterization? Maturity?
14
Over the lifecycle, various data points are collected Decisional (e.g., Uniqueness: Rare or hard-to-find data) Descriptive (e.g., Spatial Resolution) Those data points might control decisions or flow (exit criteria) and/or might be used to describe the “maturity” to the user. We think “maturity” means: A quantified characterization of dataset features. A higher number means more “mature”
15
Identify a Dataset of Interest Green-Light the Dataset Tailor the Dataset Policy Ingest the Dataset Archive the Dataset Register/Catalog the Dataset Distribute the Dataset Verify the Dataset Rollout the Dataset Maintain the Dataset Increasing Knowledge of Maturity - constant collection -
17
The creation of a PO.DAAC Dataset Maturity Model stems from UWG Recommendations: 6, 14, 20, 23 [Identify the] “potential costs involved, estimates of user needs, and other data management factors” “The PSTs have a role in determining whether a dataset is valuable and of good quality. The processes and procedures should be published and readily available to potential dataset developers” “A metric exclusively based on the number of peer-reviewed papers using the dataset is NOT recommended.” “Datasets must meet format and metadata standards” “PO.DAAC should not accept orphan datasets” “Clarify what Sort by 'Popularity...' means”
18
We adhere to the lifecycle for consistency, but a key outcome of the lifecycle must be maturity measures. Maturity Community Assessment:3 Technical Quality:4 Processing:3 Provenance:3 Documentation:5 Adherence to Process Guidelines:5 Toolkits:5 Relationships:4 Specification:4 Overarching Maturity Index:4
19
Beta Products intended to enable users to gain familiarity with the parameters and the data formats. Provisional Product was defined to facilitate data exploration and process studies that do not require rigorous validation. These data are partially validated and improvements are continuing; quality may not be optimal since validation and quality assurance are ongoing. Validated Products are high quality data that have been fully validated and quality checked, and that are deemed suitable for systematic studies such as climate change, as well as for shorter term, process studies. These are publication quality data with well-defined uncertainties, but they are also subject to continuing validation, quality assurance, and further improvements in subsequent versions. Users are expected to be familiar with quality summaries of all data before publication of results; when in doubt, contact the appropriate instrument team. Stage 1 Validation: Product accuracy is estimated using a small number of independent measurements obtained from selected locations and time periods and ground-truth/field program efforts. Stage 2 Validation: Product accuracy is estimated over a significant set of locations and time periods by comparison with reference in situ or other suitable reference data. Spatial and temporal consistency of the product and with similar products has been evaluated over globally representative locations and time periods. Results are published in the peer-reviewed literature. Stage 3 Validation: Product accuracy has been assessed. Uncertainties in the product and its associated structure are well quantified from comparison with reference in situ or other suitable reference data. Uncertainties are characterized in a statistically robust way over multiple locations and time periods representing global conditions. Spatial and temporal consistency of the product and with similar products has been evaluated over globally representative locations and periods. Results are published in the peer-reviewed literature. Stage 4 Validation: Validation results for stage 3 are systematically updated when new product versions are released and as the time-series expands.
20
MaturitySensor Use Algorithm Stability (including ancillary inputs) Metadata & QADocumentationValidationPublic Release Science and Applications 1Research Mission Sinificant changes likely IncompleteDraft ATBDMinimal Limited data availability to develop familiarity Little or none 2Research Mission Some changes expected Research grade (extensive) ATBD version 1+ Uncertainty estimated for select locations or times Data available but of unknown accuracy; caveats required for data use Limited or ongoing 3Research Mission Minimal change expected Research grade (extensive); meets international standards Public ATBD; peer- reviewed algorithm and product descriptions Uncertainty estimated over widely distributed times/locations by multiple investigators; differences understood Data available but of unknown accuracy; caveats required for data use Provisionally used in applications and assessments demonstrating positive value 4Operational Mission Minimal change expected Stable, Allows provenance tracking and reproducability; meets int'l standards Public ATBD; draft Operational Algorithm Description (OAD); peer-reviewed algorithm and product descriptions Uncertainty estimated over widely distributed times/locations by multiple investigators; differences understood Data available but of unknown accuracy; caveats required for data use Provisionally used in applications and assessments demonstrating positive value 5 All relevant research and operational missions; unified and coherent record demonstrated across different sensors Stable and reproducable Stable, Allows provenance tracking and reproducability; meeting int'l standards Public ATBD, OAD, and validation plan; peer- reviewed algorithm, product, and validaition articles Consistent uncertianties estimated over most environmental conditions by multiple investigators Multi-mission record is publicly available with associated uncertainty estimate Used in various published applications and assessments by different investigators 6 All relevant research and operational missions; unified and coherent record over complete series; record is considered scientifically irrefutable following extensive scrutiny Stable and reproducable; homogenous and published error budget Stable, Allows provenance tracking and reproducability; meeting int'l standards Product, algorithm, validation, processing, and metadata described in peer-reviewed literature. Observation strategy designed to reveal systematic errors through independent cross-checks, open inspection, and continuous interrogation Multi-mission record is publicly available from long term archive Used in various published applications and assessments by different investigators See: ftp://ftp.ncdc.noaa.gov/pub/data/sds/ms-privette-P1.3.conf.header.pdf
21
Community Assessment: Papers written / number of citations # of Likes # of downloads/views Technical Quality: QQC+Latency / Gappiness Accuracy Sampling issues? Caveats/known issues identified? Processing: Has it been manipulated? Cal/Val state? Verification state? Provenance: Maturity of platform/instrument/sensor Maturity of Program Parent datasets identified (if applicable) Is the sensor fully described? Is the context of the reading(s) fully described? State-of-the-Art technology? Documentation: What is the state of the documentation? Is the documentation captured (archived)? Adherence to Process Guidelines Did it get fast-tracked? Tons of waivers? Were all exit criteria met satisfactorily? Consistent use of units? Access: Readily available? Foreign repository? Behind firewalls or open FTP? Toolkits: Data visualization routine? Data reader? Verified reader/subroutine? Relationships: Sibling/child datasets identified? Motivation/justification identified? Rarity: Hard-to-find data? Atypical sensor/resolution/etc.? Specification: Resolution (spatial / temporal) Spatial coverage Start time End time Data format? Exotic structure? Sizing / volume expectation?
22
Community Assessment: Papers written / number of citations # of Likes # of downloads/views Technical Quality: QQC+Latency / Gappiness Accuracy Sampling issues? Caveats/known issues identified? Processing: Has it been manipulated? Cal/Val state? Verification state? Provenance: Maturity of platform/instrument/sensor Maturity of Program Parent datasets identified (if applicable) Is the sensor fully described? Is the context of the reading(s) fully described? State-of-the-Art technology? Documentation: What is the state of the documentation? Is the documentation captured (archived)? Adherence to Process Guidelines Did it get fast-tracked? Tons of waivers? Were all exit criteria met satisfactorily? Consistent use of units? Access: Readily available? Foreign repository? Behind firewalls or open FTP? Toolkits: Data visualization routine? Data reader? Verified reader/subroutine? Relationships: Sibling/child datasets identified? Motivation/justification identified? Rarity: Hard-to-find data? Atypical sensor/resolution/etc.? Specification: Resolution (spatial / temporal) Spatial coverage Start time End time Data format? Exotic structure? Sizing / volume expectation?
23
Maturity Community Assessment:3 Technical Quality:4 Processing:3 Provenance:3 Documentation:5 Adherence to Process Guidelines:5 Toolkits:5 Relationships:4 Specification:4 Overarching Maturity Index:4 Users would be presented with layers of information: Scores derived from the various criteria categories An ultimate maturity index (simple mathematical average) from the combined values: Ultimately could allow weighting At this point, seems it would overcomplicate ? ? Question: What does “maturity” mean to you? Do you prefer it to “Assessment and Characterization”? Does this provide better described datasets and better mapping of data to our users?
24
The lifecycle document, while capturing process, becomes a means to an even greater end. The driving current is consistency, and as our goals hinge on matching users to datasets, the lifecycle becomes the means to ensuring fully characterized datasets. We hope the approach is reasonable, and that we are accurate in our assessment that the policy aspects of the Dataset Lifecycle can and will help ensure conformity to process, and consistent availability of maturity data across all PO.DAAC holdings. Next steps: Need to ultimately identify (and if necessary, implement) the infrastructure needed to guide us through this lifecycle Still need to resolve some key questions, such as: How does the lifecycle morph with respect to different types of datasets? (Remote Datasets? Self-generated Datasets?)
25
Identify a Dataset of Interest Green-Light the Dataset Tailor the Dataset Policy Ingest the Dataset Archive the Dataset Register/Catalog the Dataset Distribute the Dataset Verify the Dataset Rollout the Dataset Maintain the Dataset Michelle’s discussion starts here…
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.