Download presentation
1
Verification of Seasonal Forecasts
Simon J. Mason International Research Institute for Climate and Society The Earth Institute of Columbia University Advanced Studies Program (ASP) Summer Colloquium on Forecast Verification in the Atmospheric Sciences and Beyond Boulder, CO, 06 – 18 June, 2010
2
Introduction “Are you, Socrates, the one who is called the Expert?”
“Not if it is better,” Socrates replied, “to be called the Idiot.” “It would be better to be called the Idiot than the Expert Meteorologist!” Xenophon, Memorabilia How does one evaluate forecasts that are Inherently not very good, and Have a very short history, without looking like: An idiot; A cheat?
3
Forecast “goodness” What makes a good forecast? Consistency Quality
Value Murphy AH 1993; Wea. Forecasting 8, 281 But there are some additional qualitative aspects of forecast goodness …
4
Seasonal forecast formats
Most seasonal forecasts are in one of two classes: A (set of) deterministic forecast(s) – outputs from a dynamical or statistical model.
5
Seasonal forecast formats
(a) Maps showing probabilities of the verification falling within one of multiple categories (by grid)
6
Seasonal forecast formats
(b) Maps showing probabilities of the verification falling within one of multiple categories (by region)
7
What makes a “good” forecast?
Ambiguity Forecast: USA will do well in the 2010 soccer World Cup Verification: ?
8
What is the predictand? What is the seasonal forecast for Timbuktu. Does the forecast apply to: Individual stations? Regions, and if so what size?
9
What makes a “good” forecast?
Uncertainty Forecast: Iceland will not win the 2010 soccer World Cup Verification: Duh!
10
Seasonal temperature forecasts
The convention is to issue temperature forecasts using categories defined with a 1971 – 2000 climatology. 2009 2008 2007 2006 2005 2004 2003 2002
11
What makes a “good” forecast?
Relevance Forecast: Mario Robbe and Joey ten Berge (the Netherlands) will win the 2007 men’s pairs Darts World Cup Verification: Who cares?
12
Seasonal rainfall forecasts
Recent flooding in Guatemala and Honduras Will there be flooding in Tegucigalpa over the next few months?
13
Seasonal rainfall forecasts
For the next 3 months, measure all the rainfall over most of the country; the amount of rainfall we think you will get in 2010 will make it amongst the 10 wettest years we measured between and 2000 (but …
14
Seasonal rainfall forecasts
the probability is only 45%, so we’re more likely to be wrong than right, and … even if we are right there may be no floods because it may rain frequently but not heavily (and there may be floods even if we’re wrong), and … we might be wrong about the 45%). Consider yourselves forewarned.
15
Seasonal forecasts Most seasonal forecasts are not “good” because:
They use outdated, unavailable and difficult to understand references They do not forecast things of much interest The sharpness is poor Verification information is very difficult to find (and very difficult to understand if you can find it)
16
Are the forecasts skillful?
Because of the poor predictability, there is an obsession with measuring the “skill” of seasonal forecasts. RPSS of IRI seasonal forecasts Barnston et al. 2010; J. Clim. Appl. Met. 49, 493.
17
Skill Imagine a set of forecasts that indicates probabilities of rainfall (which has a climatological probability of 30%): Suppose that rainfall occurs on 40% of the green forecasts, and 20% of the brown. The forecasts correctly indicate times with increased and decreased chances of rainfall, but do so over-confidently. The Brier skill score is -7%.
18
Brier skill score In the previous example, the problem is that the reliability errors (2.5%) more than offset the gain in resolution (1%). Why should this difference be interesting?
19
Skill The problem with the Brier (and with the ranked probability scores) is that they try to measure a number of attributes of good probabilistic forecasts at the same time. Most generic scores are difficult to interpret because they measure multiple attributes. “Skill” is a vague attribute – it measures whether one sets of forecasts is “better” than another. “Better” in what way(s)? What is the attribute of real interest?
20
Meta-verification Pertinency
“Any verification scheme should be … chosen to answer the questions of interest.” Jolliffe and Stephenson, 2003 It is essential to define exactly what is the purpose of the verification analysis so as to choose an appropriate score. Intelligibility The score should have an intuitive interpretation. Do the forecasts have any potentially usable information?
21
What makes a “good” forecast?
Resolution Forecast: 2008 will be the warmest year in a century The Old Farmer’s Almanac, Sep 2007 Verification: Coldest year of the decade Resolution: Was the temperature different when the forecast was for the coldest year in a century? Resolution is the crucial attribute of a good forecast. If the outcome differs depending on the forecast then the forecasts have potentially useful information. If the outcome is the same regardless of the forecast the forecaster can be ignored.
22
Resolution of RCOF Forecasts
October – December Southern Africa rainfall forecasts for 1997–2007 made the preceding August or September. All gridboxes and categories. Each point on the reliability curve is a conditional hit rate. Many forecasts are required.
23
What makes a “good” forecast?
Discrimination Forecast: 2008 will be the warmest year in a century The Old Farmer’s Almanac, Sep 2007 Verification: Coldest year of the decade Resolution: Was the forecast different when it was the warmest year of the decade Discrimination is an alternative perspective to resolution. If the forecast differs given different outcomes then the forecasts have potentially useful information. If the forecast is the same regardless of the outcome the forecaster can be ignored.
24
The classical approach
Forecast 1 Observation 1 Score 1 Forecast 2 Observation 2 Score 2 Forecast 3 Observation 3 Score 3 Forecast N Observation N Score N Skill Score
25
An alternative approach
Observation 1 Observation 2 Compare two observations One observation has a specified characteristic The other observation lacks this characteristic Which of these two observations has the characteristic? => Two-alternative forced choice (2AFC)
26
Two-Alternative Forced Choice Test
Who said: “I’m the President of the United States and I’m not going to eat any more broccoli”? Ronald Reagan George Bush
27
Two-Alternative Forced Choice Test
In which of these two Januaries did El Niño occur (Niño3.4 index >27°C)? What is the probability of getting the answer correct? 50% (assuming that you do not have inside information about ENSO).
28
Two-Alternative Forced Choice Test
In which of these two Januaries did El Niño occur (Niño3.4 index >27°C)? What is the probability of getting the answer correct now? That depends on whether we can believe the forecasts.
29
Two-Alternative Forced Choice Test
But what if we have more information in our forecasts than just “yes” or “no” warnings of El Niño? In which of these two Januaries did El Niño occur (Niño3.4 index >27°C)? What is the probability of getting the answer correct now? That again depends on whether we can believe the forecasts. Select the forecast with the higher value.
30
Two-Alternative Forced Choice Test
When El Niño occurs, do we forecast warmer temperatures than when it does not occur? The probability that we would forecast higher temperatures for an El Niño than for a non-El Niño is equivalent to a Mann-Whitney U-test. The score is 99%.
31
Two-Alternative Forced Choice Test
The same test can be applied if probabilistic forecasts are issued. What is the probability of getting the answer correct now? That again depends on whether we can believe the forecasts. Select the forecast with the higher probability.
32
Two-Alternative Forced Choice Test
When El Niño occurs, do we forecast El Niño with higher probabilities than when it does not occur? If the forecasts are probabilistic, the test becomes equivalent to calculating the area beneath the ROC curve. The score is 98%.
33
Relative Operating Characteristics
The most sensible strategy would be to list the years in order of increasing forecast rainfall. If the forecasts are good, the “drought” years should be at the top of the list.
34
Relative Operating Characteristics
For the first guess: Repeat for all forecasts.
35
Relative Operating Characteristics
36
Relative Operating Characteristics
The area beneath the curve (0.61 for below, and 0.77 for above) gives us the probability that we will successfully discriminate a “drought” (“wet”) year from a non-drought (“non-wet”) year.
37
Two-Alternative Forced Choice Test
Randomly select a drought year and a non-drought year. There are 5 15 possible pairs. Count the number of yellow years below the orange years – these are the correct 2AFC tests. ( ) / (515) = 0.61
38
Two-Alternative Forced Choice Test
But what if the outcome is not a simple “yes” or “no”? The 2AFC test can also be applied to comparisons: does A score higher (or lower) on some characteristic than B?
39
Two-Alternative Forced Choice Test
Who was the highest paid baseball player in 2009? $33 mn $24 mn Alex Rodriguez NY Yankees Manny Ramirez LA Dodgers
40
Two-Alternative Forced Choice Test
In which of these two Januaries did the stronger El Niño occur? What is the probability of getting the answer correct now? That again depends on whether we can believe the forecasts. Select the forecast with the higher value.
41
Are the forecasts skillful?
Do the forecasts have skill? Do the forecasts have potentially usable information? precipitation temperature
42
2AFC Tests If there’s one thing Alan Murphy did abhor, ’Twas an up-start who derived a “new” score. For, lo and behold! It was always quite old, For Sir Gilbert had defined it before. In many forecasting contexts, the 2AFC-score is not new but can be reduced to an established statistical test * A version of the Student’s t-test for cases with unequal variance. The 2AFC score is based on this test, but is not equivalent to it.
43
Communication “How often are the forecasts correct?”
44
Verification of individual forecasts
It is perfectly reasonable to ask: “Was last season’s forecast unusually good or bad?”
45
Summary Verification procedures that measure individual attributes are recommended. Summary measures tend to be too abstract to be easily interpretable. If the question “How good are the forecasts?” is vague then the question “How skillful are the forecasts?” (i.e., “How much better are the forecasts than the reference?”) is equally vague. Most naïve users want to know how often forecasts are correct? Prevarication is not an option. A good starting point is to address the question “Are the forecasts worth listening to?” by focusing on resolution or discrimination. Only if the answer is “yes” is it worth asking more detailed questions about whether the forecasts can be taken at face value.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.