Verification of probabilistic forecasts: comparing proper scoring rules Thordis L. Thorarinsdottir and Nina Schuhen 11.04.2018
Introduction Proper scoring rules: measure the accuracy of a forecast assign numerical penalty Often used to rank different models or forecasters For both deterministic and probabilistic verification Propriety: expected score is optimized for true distribution
Real-life forecast szenario Which proper scoring rule should I use? What if they give conflicting results? How should I report results? Is my data set sufficient? Which should I use for model parameter optimization? Short: How to use proper scores in practice!
Proper scoring rules Squared error: Absolute error: Ignorance score: Continuous ranked probability score:
Scores behave differently…
Simulation study: concept Draw random data from a «true» distribution with Verifying observations: 1000 data points Training data: 300 data points for each observation Estimate forecast distributions from the training data (method of moments) Make forecasts from the estimated distributions (50 members) Evaluate against observations
Forecasting distributions Expected value Variance Normal non- central t log-normal Gumbel + true distribution = Euler-Mascheroni constant
Example forecast: scores vs. obs IGN has different minimum due to skewness of the Gumbel distribution All scores are minimized at the same value => proper
Mean scores and bootstrap intervals Only IGN has large difference between Gumbel and other forecasters 1000 forecasts log-normal is best if truth is unknown 10^6 forecasts True distribution always has the lowest score, same ranking for all scores
PIT histograms (normal sample size)
PIT histograms (huge sample size)
Variation: Gumbel true distribution Estimated Gumbel has lower mean score than the true distribution 1000 forecasts 10^6 forecasts True distribution always has the lowest score, same ranking for all scores
Summary For a huge sample size, all proper scores give the same result For more realistic sample sizes, they differ widely The best model doesn’t always get the best score AE and CRPS have trouble identifying appropriate distributions Ignorance score is sensitive to shape of distributions => There is no «best» scoring rule
Summary For robust results: Use error bars! Use a combination of scores CRPS is very useful if the distribution is unknown or can not be easily specified Minimum score estimation: CRPS or Maximum Likelihood? No clear answer Depends on the forecast situation and model choice
Read more in… Statistical Postprocessing of Ensemble Forecasts Editors: Stéphane Vannitsem, Daniel S. Wilks, Jakob W. Messner Elsevier 978-0-12-812372-0 Planned publication: September 2018