Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006.

Slides:



Advertisements
Similar presentations
ASIAES Project Overview Satellite Image Network for Natural Hazard Management in ASEAN+3 region Pakorn Apaphant Geo-Informatics and Space Technology Development.
Advertisements

Researching Physics Web-based Research. Learning objectives Evaluate websites for reliability, level and bias. Reference websites to allow another person.
Researching Biology Web-based Research. Learning objectives In this lesson, you will learn to: evaluate websites for reliability, level and bias reference.
Technical overview Ameriflux Scientific Data Server Technology Overview Catharine van Ingen, MSFT 11 December 2006.
V Alyssa Rosemartin 1, Lee Marsh 1, Ellen Denny 1, Bruce Wilson USA National Phenology Network, Tucson, AZ; 2 - Oak Ridge National Laboratory, Oak.
Flux Data Server User Tutorial Deb Agarwal, Catharine van Ingen, Susan Holladay, and Misha Krassovski Berkeley Water Center (UCB, LBL), ORNL, and Microsoft.
Peter Griffith and Megan McGroddy 4 th NACP All Investigators Meeting February 3, 2013 Expectations and Opportunities for NACP Investigators to Share and.
The North American Carbon Program Google Earth Collection Peter C. Griffith, NACP Coordinator; Lisa E. Wilcox; Amy L. Morrell, NACP Web Group Organization:
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
Proposed Microsoft Water TCI+ Development of the AmeriFlux and Central Valley Data Portals conducted through a partnership with the Berkeley Water Center.
CUAHSI HIS Data Services Project David R. Maidment Director, Center for Research in Water Resources University of Texas at Austin (HIS Project Leader)
Berkeley Water Center Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal, LBL Catharine van Ingen,
Development of a Community Hydrologic Information System Jeffery S. Horsburgh Utah State University David G. Tarboton Utah State University.
Components and Architecture CS 543 – Data Warehousing.
Integrating Historical and Realtime Monitoring Data into an Internet Based Watershed Information System for the Bear River Basin Jeff Horsburgh David Stevens,
Deployment and Evaluation of an Observations Data Model Jeffery S Horsburgh David G Tarboton Ilya Zaslavsky David R. Maidment David Valentine
Data Processing A simple model and current UKDA practice Alasdair Crockett, Data Standards Manager, UKDA.
SAN DIEGO SUPERCOMPUTER CENTER Developing a CUAHSI HIS Data Node, as part of Cyberinfrastructure for the Hydrologic Sciences David Valentine Ilya Zaslavsky.
The Earth System Grid Discovery and Semantic Web Technologies Line Pouchard Oak Ridge National Laboratory Luca Cinquini, Gary Strand National Center for.
Deborah Agarwal BWC technical team 16 July Applications of eddy covariance measurements, Part 1: Lecture on Analyzing and Interpreting CO2 Flux.
® IBM Software Group © IBM Corporation IBM Information Server Service Oriented Architecture WebSphere Information Services Director (WISD)
Key integrating concepts Groups Formal Community Groups Ad-hoc special purpose/ interest groups Fine-grained access control and membership Linked All content.
Fluxdata.org FLUXNET Dataset Synthesis Support Deb Agarwal (LBNL) Catharine van Ingen (Microsoft) Fluxdata Team: Marty Humphrey (UVa), Norm Beekwilder.
CCSM Portal/ESG/ESGC Integration (a PY5 GIG project) Lan Zhao, Carol X. Song Rosen Center for Advanced Computing Purdue University With contributions by:
Chapter 1: Introduction to Web
Presentation on Osi & TCP/IP MODEL
Preserving the Scientific Record: Establishing Relationships with Archives Matthew Mayernik National Center for Atmospheric Research Version 1.0 Review.
COMPUTER PROGRAMMING Source: Computing Concepts (the I-series) by Haag, Cummings, and Rhea, McGraw-Hill/Irwin, 2002.
HTTP HTTP stands for Hypertext Transfer Protocol. It is an TCP/IP based communication protocol which is used to deliver virtually all files and other.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Deb Agarwal abd Marty Humphrey e Norman Beekwilder e Monte Goode abd
How to Adapt existing Archives to VO: the ISO and XMM-Newton cases Research and Scientific Support Department Science Operations.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
BioData a new bioassessment database for the USGS Briefing for the CDI
Enhancing Linkages Between Projects and Datasets: Examples from LBA-ECO for NACP Lisa Wilcox, Amy L. Morrell,
Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.
Fisheries Oceanography Collaboration Software Donald Denbo NOAA/PMEL-UW/JISAO Presented by Nancy Soreide NOAA/PMEL AMS 2002/IIPS 10.3.
Deb Agarwal (UCB and LBNL) Catharine van Ingen (MSFT) Berkeley Water Center Microsoft TCI IndoFlux Meeting, Chennai, India, July.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Beth Russell Scientific Communications and Data Specialist NOAA Science On a Sphere Data Updates for Science On a Sphere.
Deepcarbon.net Xiaogang (Marshall) Ma, Yu Chen, Han Wang, John Erickson, Patrick West, Peter Fox Tetherless World Constellation Rensselaer Polytechnic.
PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen Hsaio.
Event and Feature Catalogs in the Virtual Solar Observatory Joseph A. Hourclé and the VSO Team SP54A-07 : 2008 May 30.
End-to-End Data Services A Few Personal Thoughts Unidata Staff Meeting 2 September 2009.
CyVerse-enabled NCBI Sequence Read Archive (SRA) Submission Pipeline
Deb Agarwal (BWC), Marty Humphrey (Uva), and Norm Beekwilder (Uva)
Preservation Strategies: What goes into a long term archive? Ronald Weaver National Snow and Ice Data Center Version 1.0 Review Date.
5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS Bill KampBill Kamp, Lumnilogical Research Center,
Metadata Development in the Earth System Curator Spanning the Gap Between Models and Datasets Rocky Dunlap, Georgia Tech 5 th GO-ESSP Community Meeting.
Services-Oriented Architecture for Water Data David R. Maidment Fall 2009.
International Planetary Data Alliance Registry Project Update September 16, 2011.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Enhancements to Galaxy for delivering on NIH Commons
World Conference on Climate Change October 24-26, 2016 Valencia, Spain
Ergo User Tutorial Part 2: Getting to Know Ergo
NRAO VLA Archive Survey
Evolution of Internet.
Persistent Identifiers Implementation in EOSDIS
CUAHSI HIS Sharing hydrologic data
Networking for Home and Small Businesses – Chapter 6
Networking for Home and Small Businesses – Chapter 6
SRA Submission Pipeline
IDEALS at the University Of Illinois: A Case Study of Integration Between an IR and Library Discovery Systems Sarah L. Shreeves University of Illinois.
Metadata Construction in Collaborative Research Networks
REST APIs Maxwell Furman Department of MIS Fox School of Business
Bird of Feather Session
Networking for Home and Small Businesses – Chapter 6
GLOBAL LEARNING AND OBSERVATIONS TO BENEFIT THE ENVIRONMENT
Presentation transcript:

Early Experience Prototyping a Science Data Server for Environmental Data Deb Agarwal (LBL) Catharine van Ingen (MSFT) 25 October 2006

Outline Water and ecological data archives and other sources Typical small group collaboration needs Berkeley Water Center and Ameriflux collaboration Common problems

Unprecedented Data Availability

Soils Climate Remote Sensing Example Carbon-Climate Datasets Observatory datasets Spatially continuous datasets

5 Ameriflux Collaboration Overview 149 Sites across the Americas Each site reports a minimum of 22 common measurements. Communal science – each principle investigator acts independently to prepare and publish data. Second level data published to and archived at Oak Ridge. Total data reported to date on the order of 150M half-hourly measurements. T AIR T SOIL Onset of photosynthesis

Typical Data Flow Today Prior to analysis, data and ancillary data are must be assembled, checked, and cleaned –Some of this is mundane (eg unit conversions) –Some requires domain- specific knowledge including instrumentation or location knowledge –Ancillary data is often critical to understanding and using the data After all that, data are often misplaced, scattered, and even lost –Provenance is in the mind of the beholder –Everybody knows yet no one is sure Internet Data Archives Local Measurements Large Models Legacy Sources

Improved Data Flow Improved Data Flow Local repository for data and ancillary data assembled by a small scientific collaboration from a wide variety of sources –A common safe deposit box –Versioned and logged to provide basic provenance Simple interactions with existing and emerging internet portals for data and ancillary data download, and, over time, upload –Simplify data assembly by adding automation for tracking and data conversions Legacy Sources Internet Data Archives Local Measurements Large Models

Data Curation Today Well curated large government operated sites Clear protocols for measurement updates, recalibrations, changes –Emerging standards or long standing practices for measurement naming and reported units – ishttp://waterdata.usgs.gov/nw is Somewhat curated smaller organization sites –Best effort use of common measurement naming and units –As data sharing increases, best practices tend to emerge – ux/ ux/ Locator catalog sites –Helps locate similar data across websites – Everybody else –Naming, units, and recalibrations unclear –Moving to an ideal: IL/WRRI/neuse.html IL/WRRI/neuse.html

Data Curation Challenges Cross source and over time rationalization –Different naming and units conventions: –Distinguish derived and non-derived measurements: VPD computed from Rh Convert basic measurements to useful inputs for science –Algorithms still evolving for smoothing (obviously?) data and gap-filling –Archive tends to represent instrumentation; science tends to represent physical system Convert from basic science data to useful inputs for public policy –$40K acre-foot for Central Valley irrigation water; ~80% of that is energy cost Odd Microclimate Effects or Error in Time Reporting ? Average Air Temperature at Two Nearby Sites

Scientific Data Server Goals Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources –Simplify provenance by providing a common safe deposit box for assembled data Interact simply with existing and emerging internet portals for data and metadata download, and, over time, upload –Simplify data assembly by adding automation –Simplify name space confusion by adding explicit decode translation Support basic analyses across the entire dataset for both data cleaning and science –Simplify mundane data handling tasks –Simplify quality checking and data selection by enabling data browsing

Scientific Data Server Logical Overview

Data Staging Pipeline Data can be downloaded from internet sites regularly –Sometimes the only way to detect changed data is to compare with the data already archived –The download is relatively cheap, the subsequent staging is expensive New or changed data discovered during staging –Simple checksum before load –Chunk checksum after decode –Comparison query if requested Decode stage critical to handle the uncontrolled vocabularies –Measurement type, location offset, quality indicators, units, derivation methods often encoded in column headers Incremental copy moves staged data to one or more sitesets –Automated via siteset:site:source mapping

Column Decode Today [Datumtype] [repeat][_offset][_offset][extended datumtype][units] Datumtype: the short (<16 characters) name for the data. –Example: TA, PREC, or LE. Repeat: an optional number indicating that multiple measurements were taken at the same site and offset. –Example: include TA2. [_offset][_offset]: major and minor part of the z offset. –Example: SWC_10 (SWC at 10 cm) orTA_10_7 (TA at 10.7m). Extended datumtype: any remaining column text. –Example: fir, E, sfc, wangrot, _cum Units: measurement units. –Example: w/m2, or deg C unique column header strings now Roughly 70% of that due to offset or two extended datumtypes Another ~100 arriving now Quality and algorithm derivation provenance

Browsing for Data Availability Data Availability by Site Measuring temperature is easy; deriving ecosystem production problematic

Browsing for Data Applicability Real field data has both short term gaps and longer term outages due to instrument outages –The utility of the data depends on the nature of the science being performed –Browsing data counts can give rapid insight into how the data can be used before more complex analyses are performed Data often missing in the winter! Whats going on at higher latitudes? (It should be getting colder) Data Count

Curation Learnings To Date Ancillary data is as important as data –Comparing sites of like vegetation, climate as important as latitude or other physical quantity –Only some are numeric, most are debated, some vary with time –Curate the two together Controlled vocabularies are hard –Humans like making up names and have a hard time remembering 100+ names –Assume a decode step in the staging pipeline Data analysis and data cleaning are intertwined –Data cleaning is always on-going –Some measurements can be used as indicators of quality of other measurements –Share the simple tools and visualizations The saga continues at and BWC.htm BWC.htm

Acknowledgements Berkeley Water Center, University of California, Berkeley, Lawrence Berkeley Laboratory Deb Agarwal Monte Good Susan Hubbard James Hunt Matt Rodriguez Yoram Rubin Microsoft Jim Gray Tony Hey Dan Fay Stuart Ozer SQL product team Ameriflux Collaboration Dennis Baldocchi Beverly Law Gretchen Miller Tara Stiefl Mathias Goeckede Mattias Falk Tom Boden