Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia.

Slides:



Advertisements
Similar presentations
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Advertisements

Performance Testing - Kanwalpreet Singh.
Introduction to Maven 2.0 An open source build tool for Enterprise Java projects Mahen Goonewardene.
GCE Data Toolbox for MATLAB Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia John Chamblee & Richard Cary Coweeta LTER University of.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
T-FLEX DOCs PLM, Document and Workflow Management.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
Presented by IBM developer Works ibm.com/developerworks/ 2006 January – April © 2006 IBM Corporation. Making the most of Creating Eclipse plug-ins.
Introduction to Structured Query Language (SQL)
Distributed Systems: Client/Server Computing
Jeremy Boyd Director – Mindscape MSDN Regional Director
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
SQL Server ® 2008 ® Native Client. Agenda  Introduction to SQL Server Native Client  Building High-Performance Data Access Solutions  Going Beyond.
Professional Informatics & Quality Assurance Software Lifecycle Manager „Tools that are more a help than a hindrance”
Synthesis of Incomplete and Qualified Data using the GCE Data Toolbox Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia.
Application Extension In this presentation… –About extending your underlying iSeries application. –Using Designer to extend your application. –Introduction.
January, 23, 2006 Ilkay Altintas
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Copyright 2003 Accenture. All rights reserved. Accenture, its logo, and Accenture Innovation Delivered are trademarks of Accenture. Data Migration in Oracle.
WorkPlace Pro Utilities.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor Ms. Arwa.
SednaSpace A software development platform for all delivers SOA and BPM.
1 Designing a Data Exchange - Best Practices Data Exchange Scenarios –Sender vs. Receiver-initiated exchanges –Node Design Best Practices: –Handling Large.
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
GCE Data Toolbox -- metadata-based tools for automated data processing and analysis Wade Sheldon University of Georgia GCE-LTER.
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
Generic API Test tool By Moshe Sapir Almog Masika.
7 1 Chapter 7 Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Property of Jack Wilson, Cerritos College1 CIS Computer Programming Logic Programming Concepts Overview prepared by Jack Wilson Cerritos College.
_______________________________________________________________CMAQ Libraries and Utilities ___________________________________________________Community.
Editing Building Block (EBB) Validation Tool for FDI and ITS Balance of Payments Working Group 02 April 2012 Unit B4, IT for Statistical Production Georges.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Strategies for Adding EML Support to the GCE Data Toolbox for Matlab Wade Sheldon Georgia Coastal Ecosystems LTER (WWW: gce-lter.marsci.uga.edu/lter)
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
GCE Software Tools for Data Mining, Analysis and Synthesis Wade M. Sheldon Georgia Coastal Ecosystems LTER, University of Georgia, Athens, Georgia Introduction.
Distributed Computing With Triana A Short Course Matthew Shields, Ian Taylor & Ian Wang.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen Hsaio.
Survey of Current Practices for Reporting Missing, Qualified Data Wade Sheldon GCE-LTER.
STL CSSE 250 Susan Reeder. What is the STL? Standard Template Library Standard C++ Library is an extensible framework which contains components for Language.
Java Software Solutions Lewis and Loftus Chapter 6 1 Copyright 1997 by John Lewis and William Loftus. All rights reserved. Objects for Organizing Data.
PLANETS, OPF & SCAPE A summary of the tools from these preservation projects, and where their development is heading.
Recent Enhancements to Quality Assurance and Case Management within the Emissions Modeling Framework Alison Eyth, R. Partheepan, Q. He Carolina Environmental.
XML Tools (Chapter 4 of XML Book). What tools are needed for a complete XML application? n Fundamental components n Web infrasructure n XML development.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
Chapter – 8 Software Tools.
ECHO Technical Interchange Meeting 2013 Timothy Goff 1 Raytheon EED Program | ECHO Technical Interchange 2013.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
1 Middle East Users Group 2008 Self-Service Engine & Process Rules Engine Presented by: Ryan Flemming Friday 11th at 9am - 9:45 am.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
ECONOMETRICS ii – spring 2018
Populating a Data Warehouse
SharePoint Workflow Master Class
Use of Mathematics using Technology (Maltlab)
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Populating a Data Warehouse
Contents Preface I Introduction Lesson Objectives I-2
Integrated Statistical Production System WITH GSBPM
Presentation transcript:

Dynamic, Rule-based Quality Control Framework for Real-time Sensor Data Wade Sheldon Georgia Coastal Ecosystems LTER University of Georgia

Introduction Quality Control of high volume, real-time data from automated sensors is an emerging challenge  Traditional techniques (plotting, stats) often don’t scale well  Data validation and Q/C can be limiting factor in getting data “online”  Difficulties lead to release delays or posting provisional data Software developed at Georgia Coastal Ecosystems LTER has proven useful for Q/C of real-time data Designed to automate GCE data processing and metadata generation, but very generalized and supports any tabular data Provides dynamic, rule-based Q/C framework for data processing, analysis and synthesis

Framework Components Comprehensive data model  Implemented as hierarchical MATLAB ‘structure’ arrays  Package dataset & attribute metadata, data, Q/C rules, qualifier flags Metadata-based MATLAB software (GCE Data Toolbox)  Automatic (rule-based) and manual assignment of Q/C qualifier flags  Transparent management of flags throughout all data manipulation  Q/C-aware data management and analysis tools  Q/C-aware data integration and synthesis tools Modular implementation supports many scenarios  Interactive (command-line API and GUI forms)  Automated workflows (timed or triggered)  End-to-end (logger-to-scientist) or part of larger workflow  Runs natively on multiple platforms (PC, *nix, MacOS)

GCE Data Toolbox Data Model

Quality Control Rules Basic syntax: [logical expression]=’[flag code]’ Logical Expressions:  Any conditional statement or call to MATLAB function that returns logical array (0 = false, 1 = true)  Dataset columns referenced in statements as:  “x” – alias for current column (e.g. x<0)  “col_[name]” – any dataset column by name (e.g. “col_Depth<0”) Flag Codes:  Alphanumeric character to assign when expression true ( I, q, 9, *)  Codes defined in the dataset metadata (I = invalid value, …) Unlimited rules per attribute, multiple flags per value

Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)

Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)  Statistical:  x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean)

Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)  Statistical:  x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean)  Multi-column:  col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC)  col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight)  col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0)

Quality Control Rule Examples Numeric Comparisons:  Simple:  x<0=‘I’ (flags negative values)  x 100=‘I’;x 80=‘Q’ (overlapping bounds checks)  Statistical:  x>(mean(x)+3*std(x))=‘Q’;x<(mean(x)-3*std(x))=‘Q’ (flags values more than 3 standard deviations from column mean)  Multi-column:  col_DOC>col_TOC=‘I’ (in column DOC; flags DOC exceeding TOC)  col_Dry_Weight<(col_Wet_Weight-col_Ash_Weight)*0.90 =’I’ (flags dry weights below 90% wet weight – ash weight)  col_Depth<0=‘I’ (in column Salinity; flags Salinity when Depth < 0)  Compound (Boolean operators):  col_RH_Percent>100&col_Precip 100% except during significant precipitation events)

Quality Control Rule Examples (cont.) Text Comparisons:  “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists  flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’

Quality Control Rule Examples (cont.) Text Comparisons:  “IS”, “NOT” for string literals, “IN”, “NOT IN” for lists  flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ Algorithmic Criteria (custom functions):  fn(columns,parameters)=‘Q’  Various included Q/C functions  pattern checks, geographic checks, specialized algorithms (O2 saturation, etc)  User-defined functions:  Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc  Unlimited scope

Quality Control Rule Examples (cont.) Text Comparisons:  “IS”, “NOT” for strings, “IN”, “NOT IN” for lists  flag_notinlist(x,’Spartina,Juncus,Zizaniopsis’)=‘Q’ Algorithmic Criteria (custom functions):  fn(parameters)=‘Q’  Various included Q/C functions  pattern checks, geographic checks, specialized algorithms (O2 saturation, etc)  User-defined functions:  Any MATLAB code or “wrapped” calls to FORTRAN, Java, Python, etc  Unlimited scope Full suite of MATLAB numeric analysis capabilities supported, and extensible to use other technology

Q/C Rule Management Rule definitions can be defined in metadata “templates”, automatically applied to attributes when raw data imported Rules can also be created, managed using a GUI form

Q/C Flag Assignment Q/C criteria evaluated to assign/clear flags when:  Metadata template applied or Q/C criteria edited  New data records, columns added  Values edited (GUI) or columns updated (CLI)  Evaluation function (dataflag) invoked directly Flags can also be assigned/cleared manually by:  Clicking/dragging on plots with the mouse  Using a spreadsheet-like grid  Importing from text attributes (e.g. 3 rd party codes)  Propagating flags from source column(s) to dependent column(s) Manual assignment locks flags by inserting “manual” token in criteria, removing “manual” restores automatic evaluation

Q/C-Aware Data Management & Analysis Q/C flags can be visualized in data editor grid and plots Flagged values can be selectively removed from data sets Statistics can be generated with/without flagged values Flags can be instantiated as coded text columns for export Flagged, missing values can be summarized by parameter and date for metadata

Q/C-Aware Data Synthesis Flagged, missing values summarized in re-sampled data (aggregated, binned, date-time resampled), with automatic Q/C rule creation Flags automatically “locked” when merging multiple data sets (i.e. unions) All Q/C operations logged to processing history, reported in metadata to document lineage

Implementation Scenarios End-to-End (logger-to-scientist)  Acquire raw data from logger or file system (standard or custom import filters)  Assign metadata from template or using forms to validate and flag data  Review data and fine-tune flag assignments  Generate distribution files & plots, archive data, index for searching  Desktop data management solution Data Pre-processing  Acquire, validate and flag raw data (on demand or timed/triggered)  Upload processed data files (e.g. csv) or value & flag arrays to RDBMS Workflow Step  Call toolbox functions as part of another workflow process, custom program  Kepler MATLAB actor?

Suitability for Real-Time Sensor Data Good Scalability  Data volumes only limited by computer memory (tested >2 GB data sets)  Multiple instances can be run on high-end, 64bit, clustered workstations  Good flag evaluation performance in use, testing with diverse rule sets Good scope for automation  Timed and triggered workflow implementations easy to deploy Support for multiple I/O formats, transport protocols  Formats: ASCII, MATLAB, SQL, XML (partially implemented)  Transport: local file system, UNC paths, HTTP, FTP, SOAP Already used for real-time GCE data, USGS data harvesting service (LTER HydroDB, CWT)

Concluding Remarks Benefits  Flexible, modular design  No qualifier vocabulary, semantics assumed – many purposes, standards  Many operations on flagged values – supports different strategies for archiving and distributing data at different processing levels Limitations  Requires MATLAB  Rule syntax environment-specific – a more open standard would be ideal  Support for XML metadata immature (but more development planned) More information and downloads at: This work was supported by the National Science Foundation under grant numbers OCE and OCE