Evaluating forecasts and models

Slides:



Advertisements
Similar presentations
Measuring the performance of climate predictions Chris Ferro, Tom Fricker, David Stephenson Mathematics Research Institute University of Exeter, UK IMA.
Advertisements

What is a good ensemble forecast? Chris Ferro University of Exeter, UK With thanks to Tom Fricker, Keith Mitchell, Stefan Siegert, David Stephenson, Robin.
What is a good ensemble forecast? Chris Ferro University of Exeter, UK With thanks to Tom Fricker, Keith Mitchell, Stefan Siegert, David Stephenson, Robin.
Fair scores for ensemble forecasts Chris Ferro University of Exeter 13th EMS Annual Meeting and 11th ECAM (10 September 2013, Reading, UK)
Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.
Creating probability forecasts of binary events from ensemble predictions and prior information - A comparison of methods Cristina Primo Institute Pierre.
On judging the credibility of climate predictions Chris Ferro (University of Exeter) Tom Fricker, Fredi Otto, Emma Suckling 12th International Meeting.
STAT 497 APPLIED TIME SERIES ANALYSIS
Visual Recognition Tutorial
Point estimation, interval estimation
Evaluating Hypotheses
Visual Recognition Tutorial
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Lecture II-2: Probability Review
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Linear Regression Inference
Evaluating decadal hindcasts: why and how? Chris Ferro (University of Exeter) T. Fricker, F. Otto, D. Stephenson, E. Suckling CliMathNet Conference (3.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Verification of ensembles Courtesy of Barbara Brown Acknowledgments: Tom Hamill, Laurence Wilson, Tressa Fowler Copyright UCAR 2012, all rights reserved.
Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
Sample Space and Events Section 2.1 An experiment: is any action, process or phenomenon whose outcome is subject to uncertainty. An outcome: is a result.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Predicting the performance of climate predictions Chris Ferro (University of Exeter) Tom Fricker, Fredi Otto, Emma Suckling 13th EMS Annual Meeting and.
Tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.
Basic Practice of Statistics - 3rd Edition Introducing Probability
SUR-2250 Error Theory.
OPERATING SYSTEMS CS 3502 Fall 2017
Chapter 7. Classification and Prediction
Statistical Data Analysis - Lecture /04/03
STATISTICAL INFERENCE
Visual Recognition Tutorial
Linear Regression.
Sampling Distributions and Estimation
Review Measure testosterone level in rats; test whether it predicts aggressive behavior. What would make this an experiment? Randomly choose which rats.
Goals of Statistics.
Appendix A: Probability Theory
STATISTICS Random Variables and Distribution Functions
Verifying and interpreting ensemble products
Analyzing Redistribution Matrix with Wavelet
Overview of Supervised Learning
Introduction to Summary Statistics
Chapter 4 – Part 3.
Introduction to Instrumentation Engineering
Ensemble variance loss in transport models:
Inferential Statistics
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Summarizing Data by Statistics
STOCHASTIC HYDROLOGY Random Processes
Warmup Consider tossing a fair coin 3 times.
Computing and Statistical Data Analysis / Stat 7
Measuring the performance of climate predictions
Ensemble forecasts and seasonal precipitation tercile probabilities
Essential Statistics Introducing Probability
Independence and Counting
Experiments, Outcomes, Events and Random Variables: A Revisit
Mathematical Foundations of BME Reza Shadmehr
Independence and Counting
the performance of weather forecasts
Independence and Counting
What is a good ensemble forecast?
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Evaluating forecasts and models Chris Ferro Department of Mathematics University of Exeter (45+15mins) Statistician, interest in forecasting, applications in weather and climate forecasting. Discuss how evaluate performance of forecasting systems and of models more generally. Can assess self-consistency, consistency with theory, computational expense etc. We focus on comparing forecasts and simulations to observations of the real world. Workshop on stochastic modelling in GFD, data assimilation and non-equilibrium phenomena (2 November 2015, Imperial College London)

Overview Which measures should we use to rank forecasts/models? 1. Probability forecasts Use proper scoring rules (local or non-local) 2. Model simulations Use fair scoring rules if we want realistic simulations Local/non-local depends on meaning of ‘realistic’ Assume no error in the verifying observations. Consider problem of choosing between two forecasting systems or models: which is better? So we want a scalar measure of performance, but which measures should we use? We describe some principles that allow us to narrow down the choice of measures. For ranking probability forecasts, I’ll explain why we should use proper scores. I’ll also describe the difference between local and non-local scores. For ranking models, I’ll explain why things are less clear cut, but that we might want to use fair scores depending on what we want from our models. We’ll assume throughout that there is no error in the verifying observation. We’re working on how to handle that.

1. Evaluating probability forecasts

What is a probability forecast? A probability forecast is a probability distribution representing our uncertainty about a predictand. By issuing a forecast, we are saying that we expect the outcome to be a random draw from this distribution. The predictand can be multi-dimensional, so can be a spatial field, a climatological distribution, a time-series with a trend or cycle, a description of the evolution of some phenomenon, etc. Can we narrow down class of performance measures?

Performance measures: a criterion Imagine a long sequence of forecasting problems... Suppose I issue the same forecast, p, on each occasion. (That is, I expect the outcomes to be a sample from p.) Suppose that the actual distribution of outcomes is q. The long-run score should be optimised when p = q. Bröcker and Smith (2007, Wea. Forecasting) You might argue that no-one would really issue the same forecast on every occasion, but we just want a principle that narrows down the class of scores that we consider. Anyway, consider tossing a coin: you’d probably issue the same forecast on each occasion then. You might also worry that this principle says nothing about cases where our forecasts vary, but, again, we just want to narrow down our class of scores.

Which performance measures? Only proper scoring rules satisfy our criterion. A scoring rule s(p,y) is a real-valued function of a probability forecast, p, and an outcome, y. Lower scores indicate better performance. Let S(p,q) = Ey~q[s(p,y)], the long-run score when y ~ q. Scoring rule is proper if S(p,q) ≥ S(q,q) for all p and q. Why a scoring rule?

Example: proper/improper scores Forecasts: p = N(μ,σ2) Outcomes: q = N(0,1) Contours of long-run score. Proper score favours good forecasts (μ = 0, σ = 1). Mean squared error favours unbiased (μ = 0) but under- dispersed (σ = 0) forecasts. proper scoring rule mean squared error mse is a bad score, but there are also consistent scores for certain features. e.g. squared error of the mean also favours unbiased, but is insensitive to the spread.

Example: proper/improper scores Proper scores can be chosen to focus on only specific features of the forecast distributions, e.g. to evaluate only the mean. Squared error of the mean favours unbiased forecasts (μ = 0) and ignores spread. proper scoring rule squared error of mean

Evaluate only the mean? Evaluating only the mean says little about the realism of forecasted outcomes. Contours of three bivariate forecast distributions. All have the same mean, but only one matches the distribution of the observed outcomes (+).

Examples of proper scoring rules Logarithmic (ignorance) score s(p,y) = −log p(y) Quadratic (proper linear) score s(p,y) = ∫p(x)2dx – 2p(y) Continuous ranked probability (energy) score s(p,y) = 2Ex|x – y| – Ex,xʹ|x – xʹ| where x, xʹ ~ p Squared error of the mean s(p,y) = [E(x) – y]2 where x ~ p

Do these favour useful forecasts? For any loss function there is a proper scoring rule that gives better scores to forecasts that yield lower losses. Let L(a,y) be the loss following action a and outcome y. Act to minimise the expected loss calculated from the forecast, p: let ap minimise Ey~p[L(a,y)]. Then s(p,y) = L(ap,y) is a proper scoring rule, and S(p,q) ≤ S(pʹ,q) iff Ey~q[L(ap,y)] ≤ Ey~q[L(apʹ,y)]. Dawid and Musio (2014, Metron) Action must be a distribution, i.e. a_p = p, so L(a_p,y) = L(p,y) and therefore L is just the same as the proper score, e.g. L(a_p,y) = -log p(y). So proper scores not only favour the correct forecast, but order all forecasts according to their utility relevance of this when forecasts are different on each occasion?

Local or non-local scoring rules? A scoring rule s(p,y) is local if it depends on only p(y), the probability forecasted for the observed outcome. Local proper scores favour forecasts with more prob. on the outcome (+). Non-local proper scores favour forecasts with more prob. near the outcome. The only local, proper score is the log score, −log p(y). Suppose the two variables are the amplitude and phase of a cycle.

Local or non-local scoring rules? A scoring rule s(p,y) is local if it depends on only p(y), the probability forecasted for the observed outcome. Local proper scores favour forecasts with more prob. on the outcome (+). Non-local proper scores favour forecasts with more prob. near the outcome. The red distribution forecasts cycles that tend to look similar to the observed outcome. The blue distribution forecasts cycles that tend to look dissimilar to the observed outcome: bias in phase and amplitude.

Local or non-local scoring rules? A scoring rule s(p,y) is local if it depends on only p(y), the probability forecasted for the observed outcome. Local proper scores favour forecasts with more prob. on the outcome (+). Non-local proper scores favour forecasts with more prob. near the outcome. It’s not just that the local score has been unlucky. Suppose we observe many outcomes. The blue distribution has the wrong relationship between the amplitude and phase, while the red distribution has the relationship right, but is just slightly biased. The local score would still favour the blue distribution for all of these outcomes.

Local or non-local scoring rules? Local scores penalise forecasts in which the observed outcome is unlikely, even if similar outcomes are likely. (Evaluation shouldn’t depend on outcomes that didn’t occur: for all we know, they might be impossible.) Non-local scores favour forecasts in which outcomes similar to the observed outcome are likely, even if the observed outcome is unlikely. (Such forecasts can lead to better decisions.)

Local or non-local scoring rules? L(a,y) y = 0 y = 1 y = 2 a = 0 3 6 a = 1 1 4 a = 2 Suppose outcome is y = 2. Using forecast 1, loss is 4. Using forecast 2, loss is 6 if p > 1/2, and 3 if p < 1/2 Forecast 1 yields lower loss if p > 1/2 but local scoring rules prefer forecast 2. Forecast 1 = (0, 1, 0) Forecast 2 = (p, 0, 1 – p)

Evaluating forecasts: summary Coherent: Proper scoring rules should be used to rank forecasts, otherwise good forecasts will be penalised. Relevant: For any loss function there is a proper scoring rule that will rank forecasts by their long-run losses. Use a local proper scoring rule if we want forecasts to have more probability on the actual outcome. Use a non-local proper scoring rule if we want forecasts to have more probability near the actual outcome.

2. Evaluating model simulations

Good models or good forecasts? Decision support benefits from probability forecasts. Models do not output probabilities, so simulations must be post-processed to produce probability forecasts. Models whose simulations are the most realistic may or may not yield the best probability forecasts. Should we evaluate the quality of probability forecasts (which also evaluates the post-processing scheme) or the realism of model simulations? We’ve already discussed how to evaluate probability forecasts, so now we’ll discuss how to evaluate model simulations.

Ensemble simulations Consider evaluating the realism of a model on the basis of an initial-condition ensemble simulation. We are evaluating all components of the ‘model’: observation, assimilation, ensemble generation, forward simulation, projection between model states and reality. Which measures should we use to rank models?

What is a perfect ensemble? Given imperfect observations, the best ensemble would be generated by sampling initial conditions from those possible states that are consistent with the observations and evolving them with a perfect model (Smith 1996). This perfect ensemble is a sample from a distribution, q. The outcome would also be like a random draw from q. Favour ensembles that appear to be sampled from the same distribution as the outcome. Why didn’t we use this argument to motivate proper scores? (Just use infinite ensemble, or evolve the pdf of initial conditions.)

Performance measures: a criterion Imagine a long sequence of forecasting problems... Suppose that I sample my ensemble from the same distribution, p, on each occasion. (That is, I expect the outcomes to be a sample from p.) Suppose that the actual distribution of outcomes is q. The long-run score should be optimised when p = q.

Which performance measures? Only fair scoring rules satisfy our criterion. A scoring rule s(x,y) is a real-valued function of an ensemble, x, and an outcome, y. Lower scores indicate better performance. Let S(p,q) = Ex,y[s(x,y)] when x are i.i.d. p and y ~ q. Scoring rule is fair if S(p,q) ≥ S(q,q) for all p and q. Ferro (2014, Q. J. Roy. Meteorol. Soc.) Fair scores effectively evaluate the (imperfectly known) distribution from which the ensembles are samples. So examples with contour plots will be the same as before.

Examples of fair scoring rules Ensemble x1, ..., xm independent with distribution p Log score (when p is Normal): Siegert et al. (2015, arXiv) Quadratic score (when x and y are binary) s(x,y) = (x‾ – y)2 – x‾(1 – x‾)/(m – 1) Continuous ranked probability (energy) score s(x,y) = 2∑|xi – y|/m – ∑|xi – xj|/{m(m-1)} Squared error of the mean s(x,y) = (x‾ – y)2 – s2/m where s2 is the variance of x

Local or non-local scoring rules? Fair scores effectively evaluate the (imperfectly known) distribution from which the ensemble is sampled. Local fair scoring rules favour ensembles with more chance of having members equal to the outcome. Non-local fair scoring rules favour ensembles with more chance of having members near the outcome. at least to within observational uncertainty if not exactly

Model simulations: summary Coherent: Fair scoring rules should be used to rank models, otherwise realistic models will be penalised. Relevant: Assuming model realism is relevant, e.g. to the quality of probability forecasts and decision support. Use a local fair scoring rule if we want model simulations to have more chance of being equal to reality. Use a non-local fair scoring rule if we want model simulations to have more chance of being near reality.

Discussion How relevant is model realism to forecast quality? Should we prefer local or non-local scoring rules? How should we handle observational uncertainty? How well can we detect differences in high dimensions? Proper/fair scoring rules should be used to rank forecasts/models but they can hide key information (e.g. direction of bias). Can scoring rules be designed to help understand performance or are other methods needed? Choice of local/non-local proper score depends on decision problem, but choice less clear for fair scores in judging model realism.

References Bröcker J, Smith LA (2007) Scoring probabilistic forecasts: the importance of being proper. Weather and Forecasting, 22, 382–388 Dawid AP, Musio M (2014) Theory and application of proper scoring rules. Metron, 72, 169–183 Ferro CAT (2014) Fair scores for ensemble forecasts. Quarterly Journal of the Royal Meteorological Society, 140, 1917–1923 Siegert S, Ferro CAT, Stephenson DB (2015) Correcting the finite-ensemble bias of the ignorance score. arXiv, 1410.8249 Smith LA (1996) Accountability and error in forecasts. In Proceedings of the 1995 Predictability Seminar, ECMWF

extra slides...

Measuring performance: scores Calculate a score for each forecast and then average over the forecasts. Other types of measure (e.g. correlation) are prone to spurious inflation due to trends in the data. Example: naive forecasts achieve correlation 0.95. Observations Forecasts

Proper scores: perfect forecasts Perfect forecast has all probability mass on the outcome. Proper scoring rules give perfect forecasts the best score since s(p,y) = S(p,δ(y)) ≥ S(δ(y),δ(y)) = s(δ(y),y). omit

Calibration and sharpness Proper scoring rules reward calibration and sharpness. Calibration: outcomes are like draws from the forecasts Sharpness: forecast distributions are concentrated The expected score is S(p,q) = S(q,q) + C(p,q) where S(q,q) is strictly concave and C(p,q) ≥ 0. C(p,q) measures calibration since C(p,q) = 0 iff p = q. S(q,q) measures sharpness: by concavity, any mixture of distributions scores worse than the sharpest distribution Example: for y in {0,1} and p in [0,1] the quadratic score is s(p,y) = (p – y)2 for which S(p,q) = q(1 – q) + (p – q)2.

Characterization: binary case Let y = 1 if an event occurs, and let y = 0 otherwise. Let si,y be the (finite) score when i of m ensemble members forecast the event and reality is y. The (negatively oriented) score is fair if (m – i)(si+1,0 – si,0) = i(si-1,1 – si,1) for i = 0, 1, ..., m and si+1,0 ≥ si,0 for i = 0, 1, ..., m – 1. Ferro (2014, Q. J. Roy. Meteorol. Soc.) Equality constraints are necessary and are a discrete analogue of a condition by Savage (1971). Inequality constraints are not necessary but are desirable if negatively oriented. If m = 1 then the only fair scores are trivial: value of score depends on observation but not on ensemble. Ensembles with one incorrect member score same as perfect ensembles.