Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema.

Slides:



Advertisements
Similar presentations
Statistical modelling of precipitation time series including probability assessments of extreme events Silke Trömel and Christian-D. Schönwiese Institute.
Advertisements

Zentralanstalt für Meteorologie und Geodynamik 1. Comparison of HOM, SPLIDHOM and INTERP 2. Ideas for the daily benchmark dataset (temperature) Christine.
Developing a Caribbean Climate Interactive Database (CCID) Rainaldo F. Crosbourne, Michael A. Taylor, A. M. D. Amarakoon** CLIMATE STUDIES GROUP MONA Department.
Kriging.
Clima en España: Pasado, presente y futuro Madrid, Spain, 11 – 13 February 1 IMEDEA (UIB - CSIC), Mallorca, SPAIN. 2 National Oceanography Centre, Southampton,
Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.
Benchmark database based on surrogate climate records Victor Venema.
Short-term, platform- like inhomogeneities in observed climatic time series Peter Domonkos Centre for Climate Change University Rovira i Virgili, Tortosa,
REFERENCES Begert M., Schlegel T., Kirchhofer W., Homogeneous temperature and precipitation series of Switzerland from 1864 to Int. J. Climatol.,
A Procedure for Automated Quality Control and Homogenization of historical daily temperature and precipitation data (APACH). Part 1: Quality Control of.
Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema.
Statistical Treatment of Data Significant Figures : number of digits know with certainty + the first in doubt. Rounding off: use the same number of significant.
Global analysis of recent frequency component changes in interannual climate variability Murray Peel 1 & Tom McMahon 1 1 Civil & Environmental Engineering,
Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema.
Hydrologic Statistics
Detected Inhomogeneities In Wind Direction And Speed Data From Ireland Predrag Petrović Republic Hydrometeorological Service of Serbia Mary Curley Met.
Oceanography 569 Oceanographic Data Analysis Laboratory Kathie Kelly Applied Physics Laboratory 515 Ben Hall IR Bldg class web site: faculty.washington.edu/kellyapl/classes/ocean569_.
June 3, 2008Stat Lecture 5 - Correlation1 Exploring Data Numerical Summaries of Relationships between Statistics Lecture 5.
Detection of inhomogeneities in Daily climate records to Study Trends in Extreme Weather Detection of Breaks in Random Data, in Data Containing True Breaks,
Benchmark dataset processing P. Štěpánek, P. Zahradníček Czech Hydrometeorological Institute (CHMI), Regional Office Brno, Czech Republic, COST-ESO601.
After HOME : Progress in the practical application of statistical homogenisation Peter Domonkos Dimitrios Efthymiadis Centre for Climate Change University.
COSTOC Olivier MestreMétéo-FranceFrance Ingebor AuerZAMGAustria Enric AguilarU. Rovirat i VirgiliSpain Paul Della-MartaMeteoSwissSwitzerland Vesselin.
ES0601 Action progress report COST ES0601 MC5Bucuresti, May 2010 Advances in HOmogenisation MEthods for climate series an integrated approach COST.
SCIENTIFIC REPORT ON COST SHORT TERM SCIENTIFIC MISSION Tania Marinova National Institute of Meteorology and Hydrology at the Bulgarian Academy of Sciences,
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
SIXTH SEMINAR FOR HOMOGENIZATION AND QUALITY CONTROL IN CLIMATOLOGICAL DATABASES AND COST ES-0601 “HOME” ACTION MANAGEMENT COMMITTEE AND WORKING GROUPS.
On the multiple breakpoint problem and the number of significant breaks in homogenisation of climate records Separation of true from spurious breaks Ralf.
Breaks in Daily Climate Records Ralf Lindau University of Bonn Germany.
Latest results in verification over Poland Katarzyna Starosta, Joanna Linkowska Institute of Meteorology and Water Management, Warsaw 9th COSMO General.
HOME-ES601WG-1 Report to the 2nd MC, Vienna 23/11/2007 WG1 REPORT TO THE 2nd MC Enric Aguilar URV, Tarragona, Spain
European Climate Assessment CCl/CLIVAR ETCCDMI meeting Norwich, UK November 2003 Albert Klein Tank KNMI, the Netherlands.
Quality control and homogenization of the COST benchmark dataset Petr Štěpánek Pavel Zahradníček Czech Hydrometeorological Institute, regional office Brno.
Correction of daily values for inhomogeneities P. Štěpánek Czech Hydrometeorological Institute, Regional Office Brno, Czech Republic
Quality control of daily data on example of Central European series of air temperature, relative humidity and precipitation P. Štěpánek (1), P. Zahradníček.
Ozone Update Ben Wells U.S. Environmental Protection Agency Office of Air Quality Planning and Standards Air Quality Analysis Group February 11, 2014.
Propagation of Error Ch En 475 Unit Operations. Quantifying variables (i.e. answering a question with a number) 1. Directly measure the variable. - referred.
WFM 6311: Climate Risk Management © Dr. Akm Saiful Islam WFM 6311: Climate Change Risk Management Akm Saiful Islam Lecture-7:Extereme Climate Indicators.
Spatial interpolation of Daily temperatures using an advection scheme Kwang Soo Kim.
Homogenization of benchmark network temp/sur1/ Dubravka Rasol, MHSC, Croatia Elke Rustemeier, University of Bonn, Germany Olivier Mestre, Meteo-France,
On the reliability of using the maximum explained variance as criterion for optimum segmentations Ralf Lindau & Victor Venema University of Bonn Germany.
Development and testing of homogenisation methods: Moving parameter experiments Peter Domonkos and Dimitrios Efthymiadis Centre for Climate Change University.
A novel methodology for identification of inhomogeneities in climate time series Andrés Farall 1, Jean-Phillipe Boulanger 1, Liliana Orellana 2 1 CLARIS.
Experience regarding detecting inhomogeneities in temperature time series using MASH Lita Lizuma, Valentina Protopopova and Agrita Briede 6TH Homogenization.
ACTION COST-ES0601: Advances in homogenisation methods of climate series: an integrated approach (HOME), WG Meeting, Palma de Mallorca, January, 25-27,
“Building the daily observations database for the European Climate Assessment” KNMI.nl CLARIS meeting, 7 july 2005.
Homogenization of Chinese daily surface air temperatures:An update for CHHT1.0 Li Qingxiang, Xu Wenhui, Xiaolan Wang, and coauthors (National Meteorological.
Statistics Presentation Ch En 475 Unit Operations.
Developing long-term homogenized climate Data sets Olivier Mestre Météo-France Ecole Nationale de la Météorologie Université Paul Sabatier, Toulouse.
The ENSEMBLES high- resolution gridded daily observed dataset Malcolm Haylock, Phil Jones, Climatic Research Unit, UK WP5.1 team: KNMI, MeteoSwiss, Oxford.
1 Detection of discontinuities using an approach based on regression models and application to benchmark temperature by Lucie Vincent Climate Research.
N ational C limatic D ata C enter Development of the Global Historical Climatology Network Sea Level Pressure Data Set (Version 2) David Wuertz, Physical.
Quantifying efficiency of homogenisation methods Dr. Peter Domonkos COST HOME ES0601.
ENVIRONMENTAL AGENCY OF THE REPUBLIC OF SLOVENIA COST benchmark dataset homogenisation: issues and remarks of the “Slovenian team” Presentation.
Homogenization of daily data series for extreme climate index calculation Lakatos, M., Szentimey T. Bihari, Z., Szalai, S. Meeting of COST-ES0601 (HOME)
Inhomogeneities in temperature records deceive long-range dependence estimators Victor Venema Olivier Mestre Henning W. Rust Presentation is based on:
Benchmark database Victor Venema, Olivier Mestre, Enric Aguilar, Ingeborg Auer, José A. Guijarro, Petr Stepanek, Claude.N.Williams, Matthew Menne, Peter.
Homogenisation of temperature time series in Croatia
The homogenization of GPS Integrated Water Vapour time series: methodology and benchmarking the algorithms on synthetic datasets R. Van Malderen1, E. Pottiaux2,
Why Model? Make predictions or forecasts where we don’t have data.
Examples, examples: Outline
Dan Zarrow Northeast Regional Climate Center Fall 2010
The homogenization of GPS Integrated Water Vapour time series: methodology and benchmarking the algorithms on synthetic datasets R. Van Malderen1, E. Pottiaux2,
The break signal in climate records: Random walk or random deviations
Meeting of COST-ES0601 (HOME) Mallorca JAN 2010
Statistics Presentation
The Importance of Reforecasts at CPC
Facultad de Ingeniería, Centro de Cálculo
Defining the Products: ‘GSICS Correction’
Presentation transcript:

Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema

Victor Venema, COST HOME, March 2009, Tarragona, Spain Content  Introduction to benchmark dataset  Some results  Some questions about exercise  Question about future work  Analyse and publish the results

Victor Venema, COST HOME, March 2009, Tarragona, Spain Benchmark dataset 1)Real (inhomogeneous) climate records  Most realistic case  Investigate if various HA find the same breaks 2)Synthetic data  For example, Gaussian white noise  Insert know inhomogeneities  Test performance 3)Surrogate data  Empirical distribution and correlations  Insert know inhomogeneities  Compare to synthetic data: test of assumptions

Victor Venema, COST HOME, March 2009, Tarragona, Spain Creation benchmark – Outline talk 1)Start with homogeneous data 2)Multiple surrogate and synthetic realisations 3)Mask surrogate records 4)Add global trend 5)Insert inhomogeneities in station time series 6)Published on the web 7)Homogenize by COST participants and third parties 8)Analyse the results and publish

Victor Venema, COST HOME, March 2009, Tarragona, Spain 1) Start with homogeneous data  Monthly mean temperature and precipitation  Later also daily data (WG4), maybe other variables (pressure, wind)  Homogeneous, no missing data  Longer surrogates are based on multiple copies  Generated networks are 100 a

Victor Venema, COST HOME, March 2009, Tarragona, Spain 2) Multiple surrogate realisations  Multiple surrogate realisations –Temporal correlations –Station cross-correlations –Empirical distribution function  Annual cycle removed before, added at the end  Number of stations, 5, 9 or 15  Cross correlation varies as much as possible

Victor Venema, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Independent breaks  Determined at random for every station and time  5 Breaks per 100 a  Monthly slightly different perturbations  Temperature –Additive –Size: Gaussian distribution, σ=0.8°C  Rain –Multiplicative –Size: Gaussian distribution, =1, σ=10%

Example break perturbations station

Example break perturbations network

Victor Venema, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Correlated break in network  One break in 10 % of networks  In 30 % of the station simultaneously  Position random –At least 10 % of data points on either side

Example correlated break

Victor Venema, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Outliers  Size –Temperature: 99 percentile –Rain: 99.9 percentile  Frequency –50 % of networks: 1 % –50 % of networks: 3 %

Example outlier perturbations station

Example outliers network

Victor Venema, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Local trends (only temperature)  Linear increase or decrease in one station  Duration: between 30 and 60a  Maximum size: Gaussian distribution, σ=0.8°C  Frequency: once in 10 % of the stations

Victor Venema, COST HOME, March 2009, Tarragona, Spain Example local trends

Victor Venema, COST HOME, March 2009, Tarragona, Spain 6) Published on the web  Inhomogeneous data are published on the COST- HOME homepage  Everyone is welcome to download and homogenize the data  venema/themes/homogenisation

Victor Venema, COST HOME, March 2009, Tarragona, Spain 7) Homogenize by participants  Return homogenised data  Should be in COST-HOME file format (next slide) –For real data including quality flags  Return break detection file –BREAK –OUTLI –BEGTR –ENDTR  Multiple breaks at one data possible

Victor Venema, COST HOME, March 2009, Tarragona, Spain Typical errors  The file format needs to be perfect!  Forgetting the station-file that describes which stations belong to the homogenised network  Changing the file names in this station file to homogeneous data files ►  (Forgetting to return the files with the quality flags)  The sizes of the breaks are not in the break file  Please, keep directory structure of the benchmark like it is, also for partial contributions –The only difference is the main directory  All files are tab-delimited ASCII files

Victor Venema, COST HOME, March 2009, Tarragona, Spain COST-HOME file format – network file

Victor Venema, COST HOME, March 2009, Tarragona, Spain Typical errors  The file format needs to be perfect!  Forgetting the station-file that describes which stations belong to the homogenised network  Changing the file names in this station file to homogeneous data files  (Forgetting to return the files with the quality flags)  The sizes of the breaks are not in the break file ►  Please, keep directory structure of the benchmark like it is, also for partial contributions –The only difference is the main directory  All files are tab-delimited ASCII files

Victor Venema, COST HOME, March 2009, Tarragona, Spain Detected breaks file

Victor Venema, COST HOME, March 2009, Tarragona, Spain Typical errors – see discussion  The file format needs to be perfect!  Forgetting the station-file that describes which stations belong to the homogenised network  Changing the file names in this station file to homogeneous data files  (Forgetting to return the files with the quality flags)  The sizes of the breaks are not in the break file  Please, keep directory structure of the benchmark like it is, also for partial contributions –The only difference is the main directory  All files are tab-delimited ASCII files ►

Victor Venema, COST HOME, March 2009, Tarragona, Spain COST-HOME file format – monthly data

Victor Venema, COST HOME, March 2009, Tarragona, Spain Contributions ParticipantAlgorithmRemarks 1. José GuijarroClimatol6 Versions with different settings 2. Péter DomonkosCM-D, MASH-D, NSHT-D 3 Versions / detection algorithms 3. Michele BrunettiBrunettiDetection Craddock based; 2 surrogate temp. networks 4. Dubravka Rasol & Olivier Mestre PRODIGEAll surrogate temp.; 13 surrogate precip. Networks 5. Matthew Menne & Claude Williams Automated pairwise hom. 2 Versions; “all” temp. Networks (part of real #3 is missing) 6. Christine Gruber & Ingeborg Auer HOCLIS1 Surrogate temp. & 1 surrogate precip. 7. Gregor VertacnikMASHAll surrogate temp. 8. Petr StepanekAnClim1 Surrogate temp. & 1 surrogate precip. 9. Lucie VincentVincent1 Surrogate temp. 10. Enric AguilarNSHTNot in the right format yet

Victor Venema, COST HOME, March 2009, Tarragona, Spain No. homogenised networks - algorithm Table 1. Number of homogenised networks per algorithm Homogenisation alg.All networksReal netw.Surrogate netw.Synthetic netw. PRODIGE320 0 Brunetti2020 MASH250 0 Vincent1010 HOCLIS1010 AnClim2020 Climatol A Climatol C Climatol D Climatol E Climatol F ClimatolG APHa APHa CM-D5050 MASH-D5050 SNHT-D5050

Victor Venema, COST HOME, March 2009, Tarragona, Spain No. homogenised networks – input data Table 3. Summary data: Number of homogenised networks per network NetworkNo. networksTemp. netw.Precip. netw. All Real Surrogate Surrogate # Surrogate ~# Synthetic Synthetic #11275 Synthetic ~#

Victor Venema, COST HOME, March 2009, Tarragona, Spain Mean no. outliers per station Table 21. Mean number of outliers per station for every algorithm Homogenisation alg.All networksReal netw.Surrogate netw.Synthetic netw. PRODIGE0.0NaN0.0NaN Brunetti3.4NaN3.4NaN MASH16.1NaN16.1NaN Vincent0.0NaN0.0NaN HOCLIS6.0NaN6.0NaN AnClim5.5NaN5.5NaN Climatol A Climatol C Climatol D Climatol E Climatol F ClimatolG014.4NaN4.4NaN APHa APHa CM-D15.7NaN15.7NaN MASH-D15.5NaN15.5NaN SNHT-D15.3NaN15.3NaN

Victor Venema, COST HOME, March 2009, Tarragona, Spain Mean no. breaks per station Table 22. Mean number of breaks per station for every algorithm Homogenisation alg.All networksReal netw.Surrogate netw.Synthetic netw. PRODIGE2.7NaN2.7NaN Brunetti5.0NaN5.0NaN MASH4.6NaN4.6NaN Vincent0.0NaN0.0NaN HOCLIS2.8NaN2.8NaN AnClim1.2NaN1.2NaN Climatol A Climatol C Climatol D Climatol E Climatol F ClimatolG011.2NaN1.2NaN APHa APHa CM-D4.6NaN4.6NaN MASH-D3.9NaN3.9NaN SNHT-D3.2NaN3.2NaN

Victor Venema, COST HOME, March 2009, Tarragona, Spain Homogenising the exercise  Tab-delimited files: also space-delimited? –Mixture of strings and numbers  Data quality files only for real data section  Do we want to use the Diurnal Temperature Range (DTR)? –Not useful for surrogate and synthetic data! –If we do, everyone should do it  End or begin uncorrected? –Compute statistics independent of absolute level?  Filling missing values part exercise?  Human quality control or raw algorithm output?  Homogenise all or homogenisable networks, times

Victor Venema, COST HOME, March 2009, Tarragona, Spain Contributions – who is missing? ParticipantAlgorithmRemarks 1. José GuijarroClimatol6 Versions with different settings 2. Péter DomonkosCM-D, MASH-D, NSHT-D 3 Versions / detection algorithms 3. Michele BrunettiBrunettiDetection Craddock based; 2 surrogate temp. networks 4. Dubravka Rasol & Olivier Mestre PRODIGEAll surrogate temp.; 13 surrogate precip. Networks 5. Matthew Menne & Claude Williams Automated pairwise hom. 2 Versions; “all” temp. Networks (part of real #3 is missing) 6. Christine Gruber & Ingeborg Auer HOCLIS1 Surrogate temp. & 1 surrogate precip. 7. Gregor VertacnikMASHAll surrogate temp. 8. Petr StepanekAnClim1 Surrogate temp. & 1 surrogate precip. 9. Lucie VincentVincent1 Surrogate temp. 10. Enric AguilarNSHTNot in the right format yet

Victor Venema, COST HOME, March 2009, Tarragona, Spain Analysing the results  What measures define a well homogenised dataset? –Real data vs. data with known truth  Ensemble mean for real data? –Breaks  Position, hit rate  size distribution  detection probability as function of size –Data itself  Root mean square error (RMSE)  RMSE (without outliers)  RMSE (bias corrected)  Uncertainty in the network mean trend  How to study which components are best?

Victor Venema, COST HOME, March 2009, Tarragona, Spain Deadline(s)  Agreed on 09/2009, September this year  Multiple deadlines –For example: synthetic data, real data, surrogate data –After deadline the truth can be revealed –After deadline the other contributions can be revealed(?) –Start earlier analysing the results –For example: May, July, September  Bologna, 25 – 26 May, EGU, 19 – 24 April

Victor Venema, COST HOME, March 2009, Tarragona, Spain Articles  Articles –Overview COST Action & benchmark with very basic analysis results  Performance difference between synthetic (Gaussian, white noise) and surrogate data  How to deal multiple contributions per algorithm?  Do we have references to all algorithms? –What should the others be about  Analysing results, which components are best  Who will organise, coordinate it? –Not everyone should do the same analysis –How to subdivide the work?  After deadline: sensitivity analysis