Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006
Outline Water and ecological data archives and other sources Typical small group collaboration needs Berkeley Water Center and Ameriflux collaboration Common problems
Unprecedented Data Availability
Soils Climate Remote Sensing Example Carbon-Climate Datasets Observatory datasets Spatially continuous datasets
5 Ameriflux Collaboration Overview 149 Sites across the Americas Each site reports a minimum of 22 common measurements. Communal science – each principle investigator acts independently to prepare and publish data. Second level data published to and archived at Oak Ridge. Total data reported to date on the order of 150M half-hourly measurements. T AIR T SOIL Onset of photosynthesis
Typical Data Flow Today Prior to analysis, data and ancillary data are must be assembled, checked, and cleaned –Some of this is mundane (eg unit conversions) –Some requires domain- specific knowledge including instrumentation or location knowledge –Ancillary data is often critical to understanding and using the data After all that, data are often misplaced, scattered, and even lost –Provenance is in the mind of the beholder –Everybody knows yet no one is sure Internet Data Archives Local Measurements Large Models Legacy Sources
Improved Data Flow Improved Data Flow Local repository for data and ancillary data assembled by a small scientific collaboration from a wide variety of sources –A common safe deposit box –Versioned and logged to provide basic provenance Simple interactions with existing and emerging internet portals for data and ancillary data download, and, over time, upload –Simplify data assembly by adding automation for tracking and data conversions Legacy Sources Internet Data Archives Local Measurements Large Models
Data Curation Today Well curated large government operated sites Clear protocols for measurement updates, recalibrations, changes –Emerging standards or long standing practices for measurement naming and reported units – ishttp://waterdata.usgs.gov/nw is Somewhat curated smaller organization sites –Best effort use of common measurement naming and units –As data sharing increases, best practices tend to emerge – ux/ ux/ Locator catalog sites –Helps locate similar data across websites – Everybody else –Naming, units, and recalibrations unclear –Moving to an ideal: IL/WRRI/neuse.html IL/WRRI/neuse.html
Data Curation Challenges Cross source and over time rationalization –Different naming and units conventions: –Distinguish derived and non-derived measurements: VPD computed from Rh Convert basic measurements to useful inputs for science –Algorithms still evolving for smoothing (obviously?) data and gap-filling –Archive tends to represent instrumentation; science tends to represent physical system Convert from basic science data to useful inputs for public policy –$40K acre-foot for Central Valley irrigation water; ~80% of that is energy cost Odd Microclimate Effects or Error in Time Reporting ? Average Air Temperature at Two Nearby Sites
Scientific Data Server Goals Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources –Simplify provenance by providing a common safe deposit box for assembled data Interact simply with existing and emerging internet portals for data and metadata download, and, over time, upload –Simplify data assembly by adding automation –Simplify name space confusion by adding explicit decode translation Support basic analyses across the entire dataset for both data cleaning and science –Simplify mundane data handling tasks –Simplify quality checking and data selection by enabling data browsing
Scientific Data Server Logical Overview
Data Staging Pipeline Data can be downloaded from internet sites regularly –Sometimes the only way to detect changed data is to compare with the data already archived –The download is relatively cheap, the subsequent staging is expensive New or changed data discovered during staging –Simple checksum before load –Chunk checksum after decode –Comparison query if requested Decode stage critical to handle the uncontrolled vocabularies –Measurement type, location offset, quality indicators, units, derivation methods often encoded in column headers Incremental copy moves staged data to one or more sitesets –Automated via siteset:site:source mapping
Column Decode Today [Datumtype] [repeat][_offset][_offset][extended datumtype][units] Datumtype: the short (<16 characters) name for the data. –Example: TA, PREC, or LE. Repeat: an optional number indicating that multiple measurements were taken at the same site and offset. –Example: include TA2. [_offset][_offset]: major and minor part of the z offset. –Example: SWC_10 (SWC at 10 cm) orTA_10_7 (TA at 10.7m). Extended datumtype: any remaining column text. –Example: fir, E, sfc, wangrot, _cum Units: measurement units. –Example: w/m2, or deg C unique column header strings now Roughly 70% of that due to offset or two extended datumtypes Another ~100 arriving now Quality and algorithm derivation provenance
Browsing for Data Availability Data Availability by Site Measuring temperature is easy; deriving ecosystem production problematic
Browsing for Data Applicability Real field data has both short term gaps and longer term outages due to instrument outages –The utility of the data depends on the nature of the science being performed –Browsing data counts can give rapid insight into how the data can be used before more complex analyses are performed Data often missing in the winter! Whats going on at higher latitudes? (It should be getting colder) Data Count
Curation Learnings To Date Ancillary data is as important as data –Comparing sites of like vegetation, climate as important as latitude or other physical quantity –Only some are numeric, most are debated, some vary with time –Curate the two together Controlled vocabularies are hard –Humans like making up names and have a hard time remembering 100+ names –Assume a decode step in the staging pipeline Data analysis and data cleaning are intertwined –Data cleaning is always on-going –Some measurements can be used as indicators of quality of other measurements –Share the simple tools and visualizations The saga continues at and BWC.htm BWC.htm
Acknowledgements Berkeley Water Center, University of California, Berkeley, Lawrence Berkeley Laboratory Deb Agarwal Monte Good Susan Hubbard James Hunt Matt Rodriguez Yoram Rubin Microsoft Jim Gray Tony Hey Dan Fay Stuart Ozer SQL product team Ameriflux Collaboration Dennis Baldocchi Beverly Law Gretchen Miller Tara Stiefl Mathias Goeckede Mattias Falk Tom Boden