Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, COST-ESO601.

Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, COST-ESO601 meeting, Tarragona, 9-11 March 2009 E-mail: petr.stepanek@chmi.cz

Outline Outliers (daily data) Detection (monthly data) + correction (daily data) ( description of methodology ) Benchmark dataset (monthly data)

Processing before any data analysis Software AnClim, ProClimDB

Data Quality Control Finding Outliers Two main approaches: Using limits derived from interquartile ranges (time series) comparing values to values of neighbouring stations (spatial analysis)

Example of outputs for outliers assessment Altitudes and distances of neighbours List of neighbours Neighbour stations values Expected value Suspicious values

Spatial distribution Quality control Run for period 1961-2007, daily data (measured values in observation hours) All stations (200 climatological stations, 800 precipitation stations) All meteorological elements (T, TMA, TMI, TPM, SRA, SCE, SNO, E, RV, H, F) – parameters set individually Historical records will follow now

Air temperature, number of outliers 1961-2007, from 3.431.000 station-days T – air temperature at obs. hour, TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.

Air temperature, number of outliers 1961-2007, from 3.431.000 station-days Air temperature at obs. hour, AVG – daily average temp.

Air temperature, number of outliers 1961-2007, from 3.431.000 station-days TMA – daily maximum temp., TMI – daily min. temp., TPM – daily ground minimum temp.

Air temperature, number of outliers 1961-2007, Number of outliers per one station (all observation hours, AVG)

Spatial distribution Spatial distribution of precipitation stations period 1961-2007 600 stations mean minimum distance: 7.5 km

Problematic detections (heavy rainfall)

Problematic detections (heavy rainfall), Radar information

Precipitation, number of outliers 1961-2007, from 13.724.000 station-days.

Precipitation, number of outliers 1961-2007, Number of outliers per one station

Conclusions Only combination of several methods for outliers detection leads to satisfying results (“real” outliers detection, supressing fault detection -> Emsemble approach) Parameters (settings) has to be found individually for each meteorological element, maybe also region (terrain complexity) and part of a year (noticeable annual cycle in number of outliers) Similar to homogenization of time series, it is important to use measured value (e.g. from observation hours) - outliers are masked in daily average (and even more in monthly or annual ones) Errors found in all elements and investigated countries (AT, CZ, SK, HU)

Experiences with homogenization in the Czech Republic

Homogenization Change of measuring conditions inhomogeneities

Detection A DetectingInhomogeneities by SNHT (p=0.05, 950 series) Detecting Inhomogeneities by SNHT (p=0.05, 950 series)

most of metadata incomplete we depend upon statistical tests results uncertainty in test results - right inhomogeneity detection is problematic (for smaller amount of change) Assessing Homogeneity - Problems

SOLUTION For each year, group of years and whole series: portion of number of detected inhomogenities in number of all possible (theoretical) detections Proposed solution most of metadata incomplete uncertainty in test results - the right inhomogeneity detection is problematic „Ensemble“ approach - processing of big amount of test results for each individual series To get as many test results for each candidate series as possible

Diagram Days, Months, seasons, year How to increase number of test results

for monthly, daily data (each month individually) weighted /unweighted mean from neighbouring stations criterions used for stations selection (or combination of it): –best correlated / nearest neighbours (correlations – from the first differenced series) –limit correlation, limit distance –limit difference in altitudes neighbouring stations series should be standardized to test series AVG and / or STD (temperature - elevation, precipitation - variance) - missing data are not so big problem then Creating Reference Series

Relative homogeneity testing Available tests: –Alexandersson SNHT –Bivariate test of Maronna and Yohai –Mann – Whitney – Pettit test –t-test –Easterling and Peterson test –Vincent method –…–… 20 year parts of the daily series (40 for monthly series with 10 years overlap), in SNHT splitting into subperiods in position of detected significant changepoint (30-40 years per one inhomogeneity)

Homogeneity assessment Quality control Homogenization Data Analysis Output example: Station Čáslav, 3rd segment, 1911-1950, n=40

Homogeneity assessment Homogeneity assessment, Homogeneity assessment, Output II example: Summed numbers of detections for individual years

Homogeneity assessment combining several outputs (sums of detections in individual years, metadata, graphs of differences/ratios, …)

Adjusting monthly data using reference series based on correlations adjustment: from differences/ratios 20 years before and after a change, monhtly smoothing monthly adjustments (low-pass filter for adjacent values)

Example: Adjusting values - evaluation Example combining series

Iterative homogeneity testing several iteration of testing and results evaluation –several iterations of homogeneity testing and series adjusting (3 iterations should be sufficient) –question of homogeneity of reference series is thus solved: possible inhomogeneities should be eliminated by using averages of several neighbouring stations if this is not true: in next iteration neighbours should be already homogenized

Filling missing values Before homogenization: influence on right inhomogeneity detection After homogenization: more precise - data are not influenced by possible shifts in the series Dependence of tested series on reference series

Homogenization of the series in the Czech Republic

Correlations between tested and reference series Air temperature Graphs - selection Boxplots: - Median - Upper and lower quartiles (for 200 testes series)

Correlations between tested and reference series Precipitation, snow depth, new snow Graphs - selection Boxplots: - Median - Upper and lower quartiles (for 800 testes series)

Correlations between tested and reference series Wind speed Graphs - selection Boxplots: - Median - Upper and lower quartiles (for 200 testes series)

Number of significant inhomogeneities (0.05) detected by used tests (A, B tests, c and d reference series, alltogether) Homogeneity testing results Air temperature

Amount of adjustments, averages of absolute values, T_AVG Homogeneity testing results Air temperature

Homogeneity testing results Precipitation 4 tests, 4 reference series, 12 months + 4 seasons and year Number of detected inhomogeneities (significant)

Amount of change (ratios – standardized to be >1.0 ), precipitation ( reference series calculation based on correlations ) Graphs - adjsust Boxplots: - Median - Upper and lower quartiles (for 589 testes series) Correlation improvement

Homogenization Final remarks, recommendations 1/2 data quality control before homogenization is of very importance (if it is not part of it) Using series of observation hours ( complementarily to daily AVG) is highly recommended (different manifestation of breaks) be aware of annual cycle of inhomogeneities, adjustments, … to know behavior of spatial correlations (of element being processed) to be able to create reference series of sufficient quality …

Homogenization Final remarks, recommendations 2/2 Because of Noise in the time series it makes sense : - „Ensemble“ approach to homogenization (combining information from different statistical tests, time frames, overlapping periods, reference series, meteorological elements, …) - more information for inhomogeneities assessment – higher quality of homogenization in case metadata are incomplete

Software used for data processing LoadData - application for downloading data from central database (e.g. Oracle) ProClimDB software for processing whole dataset (finding outliers, combining series, creating reference series, preparing data for homogeneity testing, extreme value analysis, RCM outputs validation, correction, …) AnClim software for homogeneity testing http://www.climahom.eu

AnClim software

ProcData software ProClimDB software

Testing of benchmark dataset Fully automated (just for one click), but it should not work like this in reality: detection phase can be fully automated, but desicion about breaks to be corrected should be man-made (comparison with metadata, plots of differences – ratios, …)

Fully automatic detection phase in ProClimDB software selecting neighbours for reference series

Fully automatic detection phase in ProClimDB software reference series calculation and launching AnClim

Automation in AnClim, tests can be selected in advance

Test results processed back in ProClimDB

Non-automated desicion about breaks before correction

http://www.climahom.eu

Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, COST-ESO601.

Similar presentations

Presentation on theme: "Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, COST-ESO601."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, COST-ESO601.

Similar presentations

Presentation on theme: "Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, COST-ESO601."— Presentation transcript:

Similar presentations

About project

Feedback