GCE Data Toolbox -- metadata-based tools for automated data processing and analysis Wade Sheldon University of Georgia GCE-LTER
Rationale Data processing, quality control, data analysis and metadata generation traditionally carried out as separate activities, often in different time frames using different technologies Problems: Metadata may not reflect all processing steps Much routine data analysis done w/o Q/C, metadata No economy of scale – leads to “one-off” solutions Metadata generation should ideally occur throughout the data cycle and “inform” data analysis
Design Goals Develop Integrated Storage Standard Tabular Data QA/QC Information Metadata (overall data set & columns/attributes) Develop Software to Support Standard Code Library/API User Interfaces Apply Technology to Acquire, Manage, Distribute GCE-LTER Data Explore Use as Prototype Technology for Metadata-based Data Processing, Synthesis
Storage Standard Developed Using MATLAB ® Local expertise, large scientific user base Cross-platform (Win32, Solaris, *nix, Mac OS/x) Rapid development environment Supports multiple interfaces (interactive command line, batch- mode scripts, GUI, WWW) Good interoperability with other technologies (Java, PERL, SQL) Defined “GCE Data Structure” Spec. (based on MATLAB/C structures) Structure with 17 named fields Specific content rules for each field (software validation) Combines data, metadata, QA/QC, processing history
Storage Standard GCE Data Structure Specification (v1.1)
Software – GCE Data Toolbox Core Function Library Create, Validate Structures Import Data, Metadata (ASCII, MATLAB, SQL) Manipulate Data, Metadata (unit conversions, add/delete/update) Export Data, Metadata (various formats) Dynamic, Rule-base QA/QC Flagging Self-documenting Processing Operation Logging (Processing History) Transparent Metadata Creation/Updating Dynamic (JIT) Metadata Generation for Columns Support for Metadata “Templating” Application of Boilerplate Metadata based on Parameter Matching Supports Rapid Documentation of Routine Data Sources
Software – GCE Data Toolbox Support for Analysis Descriptive Statistics, Reports Visualization, Mapping Support for Synthesis Composite Data Set Creation Multiple Data Set Merge/Concatenation Relational Join Metadata Content Meshing Data Set Summarization Statistical Data Reduction/Re-sampling Data Set Standardization Unit Conversions (automatic, interactive) Template-based Semantic Mapping Automatic Semantic Mediation (prototype stage)
Software – User Interfaces Unattended Batch Mode Processing Interactive Command Line Processing (conventional MATLAB UI) Full help text for each function Well-defined input/output arguments GUI Applications Standard Forms, Dialogs, Controls No MATLAB Experience Required WWW – MATLAB Web Server HTML Forms, Querystring Input HTML Pages and/or Static File Output
Command-Line Interface
GUI Applications
WWW Interface
Current Applications Automated Data Processing Direct data import from data logger files, WWW data sources (USGS), SQL queries Automatic metadata creation (templates, data mining) Rule-based QA/QC flagging Data Set Packaging Batch processing to create/update data, metadata products On-demand generation of data, metadata, stat reports in custom formats (end-user scripts, GUI applications, WWW forms)
Current Applications Data Exploration/Analysis by PIs Descriptive Statistics based on attribute metadata Visualization with Interactive Filtering ( Frequency Histograms, 2D Plots, Map Plots) Data Reduction/Re-sampling to Provide Customized Data at Various “Scales” Aggregated Statistics Binned Statistics Query/Filtering (sub-selection)
Current Applications Data Harvesting (GCE) USGS Data (WWW real-time, daily, finalized data) Campbell Scientific Data Arrays (post-processing triggered after LoggerNet Retrieval) Sea-Bird Hydrographic Data USGS Data Harvesting Service for HydroDB Weekly harvest for 31 stations/7 LTER Sites Automatic Resampling, Unit Conversions, Q/C
Availability Description, Screen-shots, Fully-functional Toolbox Available on WWW: Requires MATLAB 5.3, 6.0, 6.5 (any platform) “Public” Version Compiled Source Code Requests Considered on Case-by- Case Basis
Future Development Plans EML 2.0 Support Metadata-mediated Data Set Integration Unit conversions Re-sampling More WWW Interface Development