Download presentation
Presentation is loading. Please wait.
1
Screening for Abnormal Values in AirBase Datasets
Oliver Kracht and Michel Gerboles European Commission - Joint Research Centre I – Ispra (VA) 18th EIONET Workshop on Air Quality Assessment and Management 24th and 25th October 2013 Dublin - Ireland
2
“Smooth Spatial Attribute Method”
Objectives: Present a prototyped screening tool for abnormal values and uncertain classifications of ambient air quality monitoring stations Methodology: “Smooth Spatial Attribute Method” (first developed for traffic sensors by Lu et al & Shekhar et al. 2003) Applications: AirBase records of daily PM10 values 22 February 2019
3
Data availability in Airbase:
Public air quality database system of the European Environment Agency (EEA) Monitoring data submitted by about 35 participating countries throughout Europe 140 pollutants, more than stations and time series with hourly and daily data of more than 30 years 22 February 2019
4
Focus of this Exercise:
records with varying time- extend from AirBase versions 4 and 7 daily PM10 values station type “Background” all area types (urban, suburban and rural – to be discussed) 22 February 2019
5
“Smooth Spatial Attribute Method”
Proposed for traffic sensors by Lu et al & Shekhar et al. 2003 1st quantify how the measurement value of a station deviates from the corresponding values observed within its spatio-temporal neighbourhood (the ‘Sx value’) 2nd compare this Sx-deviation to the corresponding Sx-deviations observed for the station’s neighbours Lu, CH.-T., D. Chen & Y. Kou, 2003: Detecting Spatial Outliers with Multiple Attributes. ICTAI'03, IEEE 2003. Shekhar, S., CH.-T. Lu & P. Zhang, 2003: A Unified Approach to Detecting Spatial Outliers. GeoInformatica, 7(2), 22 February 2019
6
Definition of Neigbourhood in 3 Dimensions:
spatial domain limited to +/- 1 spherical degrees temporal domain limited to +/- 2 days temporal domain is automatically expanded if initial neighbourhood is too little 22 February 2019
7
“Smooth Spatial Attribute Method”
Calculation of Sx-values (for each individual neighbourhood) z-transformation of Sx using the mean and std of Sx (Sxn and sSxn) within a neigbourhood Define a reference basis θ (e.g., applying a KZ-filter to the individual zi timeseries). Test statistics for abnormal values screening (e.g., threshold value chosen to as 1.96) 22 February 2019
8
Example for spatio-temporal outlier screening:
1st step: log transformation of non-Gaussian data remark: AirBase v.4 nomenclature: AT0227A AirBase v.7 nomenclature: AT30104 22 February 2019
9
Calculate neighbourhood mean.
2nd step: Calculate neighbourhood mean. (weighted mean using inverse squared normalized Euclidian distance) 22 February 2019
10
Calculate Sx within individual neighbourhoods.
3rd step: Calculate Sx within individual neighbourhoods. 22 February 2019
11
4th step: For each station, calculate the weighted mean and weighted standard deviation of Sx values within its neighbourhood. (Sxn and sSxn) 22 February 2019
12
5th step: Sx values of the central station are Z-normalised (using the Sxn and sSxn of each neighbourhood). 22 February 2019
13
(e.g., θ +/- a predefined threshold of 1.96)
6th step: Test statistics for abnormal values searches for zi values exceeding the upper/lower limits chosen as a reference. (e.g., θ +/- a predefined threshold of 1.96) 22 February 2019
14
Threshold criteria applied in the outlier screening:
Threshold reference Ө obtained from low pass filtering of individual stations zi time series. |zi| exceeding 1.96 not taken into account for computing Ө. Minimum number of data points required within a spatio-temporal neighbourhood (e.g., 20 neighbourhood points). Minimum number of data points required within a rolling window. Use a Kolmogorov-Zurbenko filter (with m = 5, k = 3) to obtain a smooth reference Ө. 22 February 2019
15
Threshold criteria applied in the outlier screening:
… Kolmogorov-Zurbenko filter (with m = 5, k = 3) to obtain a smooth reference Ө. Removes signal components with a periodicity of less than ca 8.7 days. 22 February 2019
16
Final Example Outcome 22 February 2019 remark:
AirBase v.4 nomenclature: AT0227A AirBase v.7 nomenclature: AT30104 22 February 2019
17
Threshold criteria applied in the outlier screening:
Threshold reference Ө obtained from low pass filtering of individual stations zi time series. |zi| exceeding 1.96 not taken into account for computing Ө. Minimum number of data points required within a spatio-temporal neighbourhood (e.g., 20 neighbourhood points). Minimum number of data points required within a rolling window. Use a Kolmogorov-Zurbenko filter (with m = 5, k = 3) to obtain a smooth reference Ө. non verifiable 22 February 2019
18
Systematic deviation from neighbourhood
19
Automated Data Processing
All codes prototyped in the R environment Directly coupled to postgreSQL database 22 February 2019
20
Inherent challenges in the method:
22 February 2019
21
Limited availability of neighbourhood information
Examples: availability of background station records Example 1: reasonably distributed spatial neighbourhood 22 February 2019
22
Limited availability of neighbourhood information
Examples: availability of background station records Example 2: “asymmetric” spatial neighbourhood 22 February 2019
23
Limited availability of neighbourhood information
Examples: availability of background station records Example 2: “asymmetric” spatial neighbourhood Consider investigating transboundary datasets. 22 February 2019
24
Example 3: changing neighbourhood over time
22 February 2019
25
Changing neighbourhood needs to be dynamically accounted for in the automated data processing (-> done). Maybe it is useful to flag a significant change of the group of stations within a neighbourhood which would explain sudden inset of abnormal station values (-> to do). 22 February 2019
26
Summary of Outcomes 2006 / 2007 records of AirBase v.4
22 February 2019
28
Some more aspects: Reprocessing with longer time series (AirBase v.7)
29
Influence of station-area type selections
used urban, suburban and rural used urban and suburban only Sx mean and std of neighbourhood are changing, causing a change in the normalization and in the reference.
30
Preliminary Results and Conclusions
Processed 2006 / 2007 AirBase records of daily PM10 values for a selection of 8 countries (AT, CZ, DE, ED, FR, GB, IT and NL). Content of identified abnormal datapoints typically ranges between 4% and 10% of the records within each individual country dataset. Number of non-verifiable records typically ranges between 1% and 15% per individual country (limitations of network design, e.g. to few neighbours). Figures about abnormal datapoints content are dependent on the parameter values chosen in the screening method. An absolute definition for abnormal records is not feasible, but depends on the intended objectives for using the method.
31
Preliminary Results and Conclusions
Demonstrated extension to longer time series with AirBase v.7. We anticipate that the screening tools can be a useful AirBase post-processing tool for Modellers Preparation of data summaries Spatial and temporal trend analysis Statistical evaluations May also support QA/QC with a short feedback cycle for network operators when implemented in real or near to real time mode
32
Open Questions Is there a need to derive a harmonized set of screening tools parameters through collaboration? Or better leave this open to the end-user's choice? Adjustable parameter settings: spatial domain: +/- 1 spherical degrees temporal domain: +/- 2 days temporal domain automatically expanded if neighbourhood is too little test statistics: θ +/- predefined threshold of 1.96 thresholding reference Ө obtained from low pass filtering of zi time series: |zi| exceeding 1.96 not taken into account for computing Ө. Minimum number of data points within a spatio-temporal neighbourhood: 20 Minimum number of data points required within a rolling window. KZ filter with m = 5, k = 3
33
Open Questions Identify the circle of interested end-users (EOINET community, modelling community (FAIRMODE), EEA …). Is there a need to derive a harmonized set of screening tools parameters through collaboration? Or better leave this open to the user's choice? How to report results (graphs / tables / quantitative point per point information / simple flagging / aggregated statistics)? Structure for future implementations?
34
Thank you for your attention!
22 February 2019
35
Comparison with a conventional outlier screening approach
Step 1: calculate neighbourhood standard deviation 22 February 2019
36
“Classical approach” Step 2: use the Z-score of log-transformed measurement values as an outlier criterion? 22 February 2019
37
Use the Z-score of log- transformed values as an outlier criterion?
Z-score of log-transformed values does not provide a conclusive outlier criterion for this application. Spatial and temporal trend cannot be considered in this way. 22 February 2019
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.