Evaluating forecasts and models

Evaluating forecasts and models
Chris Ferro Department of Mathematics University of Exeter (45+15mins) Statistician, interest in forecasting, applications in weather and climate forecasting. Discuss how evaluate performance of forecasting systems and of models more generally. Can assess self-consistency, consistency with theory, computational expense etc. We focus on comparing forecasts and simulations to observations of the real world. Workshop on stochastic modelling in GFD, data assimilation and non-equilibrium phenomena (2 November 2015, Imperial College London)

Overview Which measures should we use to rank forecasts/models? 1. Probability forecasts Use proper scoring rules (local or non-local) 2. Model simulations Use fair scoring rules if we want realistic simulations Local/non-local depends on meaning of ‘realistic’ Assume no error in the verifying observations. Consider problem of choosing between two forecasting systems or models: which is better? So we want a scalar measure of performance, but which measures should we use? We describe some principles that allow us to narrow down the choice of measures. For ranking probability forecasts, I’ll explain why we should use proper scores. I’ll also describe the difference between local and non-local scores. For ranking models, I’ll explain why things are less clear cut, but that we might want to use fair scores depending on what we want from our models. We’ll assume throughout that there is no error in the verifying observation. We’re working on how to handle that.

1. Evaluating probability forecasts

What is a probability forecast?
A probability forecast is a probability distribution representing our uncertainty about a predictand. By issuing a forecast, we are saying that we expect the outcome to be a random draw from this distribution. The predictand can be multi-dimensional, so can be a spatial field, a climatological distribution, a time-series with a trend or cycle, a description of the evolution of some phenomenon, etc. Can we narrow down class of performance measures?

Performance measures: a criterion
Imagine a long sequence of forecasting problems... Suppose I issue the same forecast, p, on each occasion. (That is, I expect the outcomes to be a sample from p.) Suppose that the actual distribution of outcomes is q. The long-run score should be optimised when p = q. Bröcker and Smith (2007, Wea. Forecasting) You might argue that no-one would really issue the same forecast on every occasion, but we just want a principle that narrows down the class of scores that we consider. Anyway, consider tossing a coin: you’d probably issue the same forecast on each occasion then. You might also worry that this principle says nothing about cases where our forecasts vary, but, again, we just want to narrow down our class of scores.

Which performance measures?
Only proper scoring rules satisfy our criterion. A scoring rule s(p,y) is a real-valued function of a probability forecast, p, and an outcome, y. Lower scores indicate better performance. Let S(p,q) = Ey~q[s(p,y)], the long-run score when y ~ q. Scoring rule is proper if S(p,q) ≥ S(q,q) for all p and q. Why a scoring rule?

Example: proper/improper scores
Forecasts: p = N(μ,σ2) Outcomes: q = N(0,1) Contours of long-run score. Proper score favours good forecasts (μ = 0, σ = 1). Mean squared error favours unbiased (μ = 0) but under- dispersed (σ = 0) forecasts. proper scoring rule mean squared error mse is a bad score, but there are also consistent scores for certain features. e.g. squared error of the mean also favours unbiased, but is insensitive to the spread.

Example: proper/improper scores
Proper scores can be chosen to focus on only specific features of the forecast distributions, e.g. to evaluate only the mean. Squared error of the mean favours unbiased forecasts (μ = 0) and ignores spread. proper scoring rule squared error of mean

Evaluate only the mean? Evaluating only the mean says little about the realism of forecasted outcomes. Contours of three bivariate forecast distributions. All have the same mean, but only one matches the distribution of the observed outcomes (+).

Examples of proper scoring rules
Logarithmic (ignorance) score s(p,y) = −log p(y) Quadratic (proper linear) score s(p,y) = ∫p(x)2dx – 2p(y) Continuous ranked probability (energy) score s(p,y) = 2Ex|x – y| – Ex,xʹ|x – xʹ| where x, xʹ ~ p Squared error of the mean s(p,y) = [E(x) – y]2 where x ~ p

Do these favour useful forecasts?
For any loss function there is a proper scoring rule that gives better scores to forecasts that yield lower losses. Let L(a,y) be the loss following action a and outcome y. Act to minimise the expected loss calculated from the forecast, p: let ap minimise Ey~p[L(a,y)]. Then s(p,y) = L(ap,y) is a proper scoring rule, and S(p,q) ≤ S(pʹ,q) iff Ey~q[L(ap,y)] ≤ Ey~q[L(apʹ,y)]. Dawid and Musio (2014, Metron) Action must be a distribution, i.e. a_p = p, so L(a_p,y) = L(p,y) and therefore L is just the same as the proper score, e.g. L(a_p,y) = -log p(y). So proper scores not only favour the correct forecast, but order all forecasts according to their utility relevance of this when forecasts are different on each occasion?

Local or non-local scoring rules?
A scoring rule s(p,y) is local if it depends on only p(y), the probability forecasted for the observed outcome. Local proper scores favour forecasts with more prob. on the outcome (+). Non-local proper scores favour forecasts with more prob. near the outcome. The only local, proper score is the log score, −log p(y). Suppose the two variables are the amplitude and phase of a cycle.

A scoring rule s(p,y) is local if it depends on only p(y), the probability forecasted for the observed outcome. Local proper scores favour forecasts with more prob. on the outcome (+). Non-local proper scores favour forecasts with more prob. near the outcome. The red distribution forecasts cycles that tend to look similar to the observed outcome. The blue distribution forecasts cycles that tend to look dissimilar to the observed outcome: bias in phase and amplitude.

A scoring rule s(p,y) is local if it depends on only p(y), the probability forecasted for the observed outcome. Local proper scores favour forecasts with more prob. on the outcome (+). Non-local proper scores favour forecasts with more prob. near the outcome. It’s not just that the local score has been unlucky. Suppose we observe many outcomes. The blue distribution has the wrong relationship between the amplitude and phase, while the red distribution has the relationship right, but is just slightly biased. The local score would still favour the blue distribution for all of these outcomes.

Local scores penalise forecasts in which the observed outcome is unlikely, even if similar outcomes are likely. (Evaluation shouldn’t depend on outcomes that didn’t occur: for all we know, they might be impossible.) Non-local scores favour forecasts in which outcomes similar to the observed outcome are likely, even if the observed outcome is unlikely. (Such forecasts can lead to better decisions.)

L(a,y) y = 0 y = 1 y = 2 a = 0 3 6 a = 1 1 4 a = 2 Suppose outcome is y = 2. Using forecast 1, loss is 4. Using forecast 2, loss is 6 if p > 1/2, and 3 if p < 1/2 Forecast 1 yields lower loss if p > 1/2 but local scoring rules prefer forecast 2. Forecast 1 = (0, 1, 0) Forecast 2 = (p, 0, 1 – p)

Evaluating forecasts: summary
Coherent: Proper scoring rules should be used to rank forecasts, otherwise good forecasts will be penalised. Relevant: For any loss function there is a proper scoring rule that will rank forecasts by their long-run losses. Use a local proper scoring rule if we want forecasts to have more probability on the actual outcome. Use a non-local proper scoring rule if we want forecasts to have more probability near the actual outcome.

2. Evaluating model simulations

Good models or good forecasts?
Decision support benefits from probability forecasts. Models do not output probabilities, so simulations must be post-processed to produce probability forecasts. Models whose simulations are the most realistic may or may not yield the best probability forecasts. Should we evaluate the quality of probability forecasts (which also evaluates the post-processing scheme) or the realism of model simulations? We’ve already discussed how to evaluate probability forecasts, so now we’ll discuss how to evaluate model simulations.

Ensemble simulations Consider evaluating the realism of a model on the basis of an initial-condition ensemble simulation. We are evaluating all components of the ‘model’: observation, assimilation, ensemble generation, forward simulation, projection between model states and reality. Which measures should we use to rank models?

What is a perfect ensemble?
Given imperfect observations, the best ensemble would be generated by sampling initial conditions from those possible states that are consistent with the observations and evolving them with a perfect model (Smith 1996). This perfect ensemble is a sample from a distribution, q. The outcome would also be like a random draw from q. Favour ensembles that appear to be sampled from the same distribution as the outcome. Why didn’t we use this argument to motivate proper scores? (Just use infinite ensemble, or evolve the pdf of initial conditions.)

Performance measures: a criterion
Imagine a long sequence of forecasting problems... Suppose that I sample my ensemble from the same distribution, p, on each occasion. (That is, I expect the outcomes to be a sample from p.) Suppose that the actual distribution of outcomes is q. The long-run score should be optimised when p = q.

Which performance measures?
Only fair scoring rules satisfy our criterion. A scoring rule s(x,y) is a real-valued function of an ensemble, x, and an outcome, y. Lower scores indicate better performance. Let S(p,q) = Ex,y[s(x,y)] when x are i.i.d. p and y ~ q. Scoring rule is fair if S(p,q) ≥ S(q,q) for all p and q. Ferro (2014, Q. J. Roy. Meteorol. Soc.) Fair scores effectively evaluate the (imperfectly known) distribution from which the ensembles are samples. So examples with contour plots will be the same as before.

Examples of fair scoring rules
Ensemble x1, ..., xm independent with distribution p Log score (when p is Normal): Siegert et al. (2015, arXiv) Quadratic score (when x and y are binary) s(x,y) = (x‾ – y)2 – x‾(1 – x‾)/(m – 1) Continuous ranked probability (energy) score s(x,y) = 2∑|xi – y|/m – ∑|xi – xj|/{m(m-1)} Squared error of the mean s(x,y) = (x‾ – y)2 – s2/m where s2 is the variance of x

Fair scores effectively evaluate the (imperfectly known) distribution from which the ensemble is sampled. Local fair scoring rules favour ensembles with more chance of having members equal to the outcome. Non-local fair scoring rules favour ensembles with more chance of having members near the outcome. at least to within observational uncertainty if not exactly

Model simulations: summary
Coherent: Fair scoring rules should be used to rank models, otherwise realistic models will be penalised. Relevant: Assuming model realism is relevant, e.g. to the quality of probability forecasts and decision support. Use a local fair scoring rule if we want model simulations to have more chance of being equal to reality. Use a non-local fair scoring rule if we want model simulations to have more chance of being near reality.

Discussion How relevant is model realism to forecast quality? Should we prefer local or non-local scoring rules? How should we handle observational uncertainty? How well can we detect differences in high dimensions? Proper/fair scoring rules should be used to rank forecasts/models but they can hide key information (e.g. direction of bias). Can scoring rules be designed to help understand performance or are other methods needed? Choice of local/non-local proper score depends on decision problem, but choice less clear for fair scores in judging model realism.

References Bröcker J, Smith LA (2007) Scoring probabilistic forecasts: the importance of being proper. Weather and Forecasting, 22, 382–388 Dawid AP, Musio M (2014) Theory and application of proper scoring rules. Metron, 72, 169–183 Ferro CAT (2014) Fair scores for ensemble forecasts. Quarterly Journal of the Royal Meteorological Society, 140, 1917–1923 Siegert S, Ferro CAT, Stephenson DB (2015) Correcting the finite-ensemble bias of the ignorance score. arXiv, Smith LA (1996) Accountability and error in forecasts. In Proceedings of the 1995 Predictability Seminar, ECMWF

extra slides...

Measuring performance: scores
Calculate a score for each forecast and then average over the forecasts. Other types of measure (e.g. correlation) are prone to spurious inflation due to trends in the data. Example: naive forecasts achieve correlation 0.95. Observations Forecasts

Proper scores: perfect forecasts
Perfect forecast has all probability mass on the outcome. Proper scoring rules give perfect forecasts the best score since s(p,y) = S(p,δ(y)) ≥ S(δ(y),δ(y)) = s(δ(y),y). omit

Calibration and sharpness
Proper scoring rules reward calibration and sharpness. Calibration: outcomes are like draws from the forecasts Sharpness: forecast distributions are concentrated The expected score is S(p,q) = S(q,q) + C(p,q) where S(q,q) is strictly concave and C(p,q) ≥ 0. C(p,q) measures calibration since C(p,q) = 0 iff p = q. S(q,q) measures sharpness: by concavity, any mixture of distributions scores worse than the sharpest distribution Example: for y in {0,1} and p in [0,1] the quadratic score is s(p,y) = (p – y)2 for which S(p,q) = q(1 – q) + (p – q)2.

Characterization: binary case
Let y = 1 if an event occurs, and let y = 0 otherwise. Let si,y be the (finite) score when i of m ensemble members forecast the event and reality is y. The (negatively oriented) score is fair if (m – i)(si+1,0 – si,0) = i(si-1,1 – si,1) for i = 0, 1, ..., m and si+1,0 ≥ si,0 for i = 0, 1, ..., m – 1. Ferro (2014, Q. J. Roy. Meteorol. Soc.) Equality constraints are necessary and are a discrete analogue of a condition by Savage (1971). Inequality constraints are not necessary but are desirable if negatively oriented. If m = 1 then the only fair scores are trivial: value of score depends on observation but not on ensemble. Ensembles with one incorrect member score same as perfect ensembles.

Evaluating forecasts and models

Similar presentations

Presentation on theme: "Evaluating forecasts and models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluating forecasts and models

Similar presentations

Presentation on theme: "Evaluating forecasts and models"— Presentation transcript:

Similar presentations

About project

Feedback