Automated QA/QC Technique for Climate Sensor Data EPSCoR Hawaii HGDR Scientific Data Management Portal Development Team
TOC QA/QC Requirements Detecting Outliers – Types of Outliers – Detection Methods – Statistical Correlation Functions – QuaT Correlational Method Data Mining for further automation
QA/QC Requirements Detect Abnormal Data & Outliers Correct abnormal data and outliers where it is possible Find additional property/correlation among variables – To catch changes overtime
Detecting Outliers Type of Outliers – Correctable Outliers Caused by calibration, sensor cleaning, low battery voltage, erroneous sensor installation, etc. Outliers caused by these factors can be corrected – Error Values Missing or impossible values caused by sensor failure: physical damage, irreversible factor effects This type of outliers cannot be corrected and must be discarded
Detecting Outliers Detection Methods 1.Normal value range check (Single variable) 2.Diurnal pattern check (Single variable) 3.Correlational pattern check (Multiple variables) 4.Additional methods can be found by data mining
Normal value range check For example, humidity if it is over 100% does not make sense. Also consideration to regional and seasonal factors required. Knowledge Required Known/valid normal value ranges for all variables Also subsets of normal value ranges for all variables in different regions or seasons
Diurnal pattern check The radiation should be high in the day low in the night Knowledge Required Known/valid diurnal pattern Also different diurnal patterns for all variables in different regions or seasons Challenge – How to slice time – What value ranges are considered to be high, average, or low for each variable, simply take standard deviation?
Correlational pattern check For example, the radiation and the temperature should show correlations Knowledge Required Known correlation between the variables How can we verify the correlations? Correlation functions from statistics will be useful Also, a method called QuaT might be useful to analyze the similarity of the trends of two variables along the timeline
Additional Analyses 4.Additional methods might be helpful from data mining – Finding additional correlations – Value range change over time (Global climate change)
Statistic Functions Pearson’s Product Moment Spearman’s Rank Correlation Kendall’s Rank Correlation
Pearson’s Product Moment Pearson’s only works for parametric dataset – Dataset needs to be tested for normality before it can be analyzed – Normality test: Shapiro-Wilk Normality test If a dataset is determined to be non-parametric, either,or both of, Spearman’s or Kendall’s – Also, outliers decreases the precision of Pearson’s
Spearman’s & Kendall Correlation If a dataset is not parametric, these correlation functions can be used Both requires values to be presorted/ranked Spearman’s – compares the distance of the values of the same rank from the two variables Kendall’s – shows the ratio of the values of the same rank from the two variables
QuaT An algorithm to determine the similarity of the two trend curves Introduced by Okabe A. & Masuyama A. of Tokyo University “A robust exploratory method for qualitative trend curve analysis”
QuaT - Basic steps of the algorithm 1.Find peaks and bottoms for the curves that are compared 2.Calculate the height of each peak 3.Determine the distinct height, a threshold height, and extract peaks that are higher or equal to the distinct height. In other words, ignore less distinct peaks 4.Compare extracted peaks and determine if the two variables’ curves have the times of peaks occur at the same time and magnitude (order) for both variable
Basic Relationship among and between the Variables Radiation (short, long, net, PAR) Rainfall (humidity, soil moisture) Temperature (air, surface, body) Wind (speed, direction) AffectingRelatiohshipAffectedSpecific Variable Radiation CategorydirectTemperature Category Radiation CategoryaffectWind Category Radiation CategoryinverseRainfall CategorySoil Moisture Rainfall CategoryinverseRadiation Category Rainfall CategoryinverseTemperature Category Rainfall CategoryaffectWind Category inverseTemperature Category