1 Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen

Slides:

Advertisements

Similar presentations

Copyright © 2008 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Managerial Economics, 9e Managerial Economics Thomas Maurice.

Advertisements

Introductory Mathematics & Statistics for Business

Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.

Part 3 Probabilistic Decision Models

BUS 220: ELEMENTARY STATISTICS

Statistical inference Ian Jolliffe University of Aberdeen CLIPS module 3.4b.

ECMWF Slide 1Met Op training course – Reading, March 2004 Forecast verification: probabilistic aspects Anna Ghelli, ECMWF.

Copyright © 2010 Pearson Education, Inc. Slide

Overview of Lecture Parametric vs Non-Parametric Statistical Tests.

Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA

Lecture 7 THE NORMAL AND STANDARD NORMAL DISTRIBUTIONS

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

CS1512 Foundations of Computing Science 2 Week 3 (CSD week 32) Probability © J R W Hunter, 2006, K van Deemter 2007.

1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.

1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is.

Quantitative Methods Lecture 3

SADC Course in Statistics Introduction to Non- Parametric Methods (Session 19)

Robin Hogan Ewan OConnor University of Reading, UK What is the half-life of a cloud forecast?

Chapter 7 Sampling and Sampling Distributions

Solve Multi-step Equations

Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.

Module 4. Forecasting MGS3100.

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

6. Statistical Inference: Example: Anorexia study Weight measured before and after period of treatment y i = weight at end – weight at beginning For n=17.

Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.

Contingency Tables Prepared by Yu-Fen Li.

Biostatistics Unit 10 Categorical Data Analysis 1.

Chapter 16 Goodness-of-Fit Tests and Contingency Tables

Business and Economics 6th Edition

5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.

1 Slides revised The overwhelming majority of samples of n from a population of N can stand-in for the population.

LIAL HORNSBY SCHNEIDER

Chapter 6 The Mathematics of Diversification

Hypothesis Tests: Two Independent Samples

Chapter 10 Estimating Means and Proportions

Chapter 4 Inference About Process Quality

Hydrologic Statistics Reading: Chapter 11, Sections 12-1 and 12-2 of Applied Hydrology 04/04/2006.

Quantitative Analysis (Statistics Week 8)

Exeter, February The impenetrable hedge: some thoughts on propriety, equitability and consistency Ian Jolliffe Forecast verification Hedging Propriety.

Lecture 3 Validity of screening and diagnostic tests

What is a good ensemble forecast? Chris Ferro University of Exeter, UK With thanks to Tom Fricker, Keith Mitchell, Stefan Siegert, David Stephenson, Robin.

Statistical Analysis SC504/HS927 Spring Term 2008

Chapter 8: Introduction to Hypothesis Testing. 2 Hypothesis Testing An inferential procedure that uses sample data to evaluate the credibility of a hypothesis.

Putting Statistics to Work

Statistical Inferences Based on Two Samples

Chapter 8 Estimation Understandable Statistics Ninth Edition

PSSA Preparation.

Experimental Design and Analysis of Variance

Testing Hypotheses About Proportions

Simple Linear Regression Analysis

Part 13: Statistical Tests – Part /37 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of.

Correlation and Linear Regression

Multiple Regression and Model Building

January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.

1 McGill University Department of Civil Engineering and Applied Mechanics Montreal, Quebec, Canada.

Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Section 7-2 Estimating a Population Proportion Created by Erin.

4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.

Commonly Used Distributions

Chapter 26 Comparing Counts

Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.

The Simple Regression Model

1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.

Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.

4IWVM - Tutorial Session - June 2009 Verification of categorical predictands Anna Ghelli ECMWF.

Sampling Uncertainty in Verification Measures for Binary Deterministic Forecasts Ian Jolliffe and David Stephenson 1EMS September Sampling uncertainty.

Heidke Skill Score (for deterministic categorical forecasts) Heidke score = Example: Suppose for OND 1997, rainfall forecasts are made for 15 stations.

Model validation Simon Mason Seasonal Forecasting Using the Climate Predictability Tool Bangkok, Thailand, 12 – 16 January 2015.

A Superior Alternative to the Modified Heidke Skill Score for Verification of Categorical Versions of CPC Outlooks Bob Livezey Climate Services Division/OCWWS/NWS.

Chapter 13 Understanding research results: statistical inference.

Verifying and interpreting ensemble products

Presentation transcript:

1 Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen

2 Outline of talk Introduction – what, why, how? Binary forecasts –Performance measures, ROC curves –Desirable properties Of forecasts Of performance measures Other forecasts –Multi-category, continuous, (probability) Value

3 Forecasts Forecasts are made in many disciplines –Weather and climate –Economics –Sales –Medical diagnosis

4 Why verify/validate/assess forecasts? Decisions are based on past data but also on forecasts of data not yet observed A look back at the accuracy of forecasts is necessary to determine whether current forecasting methods should be continued, abandoned or modified

5 Two (very different) recent references I T Jolliffe and D B Stephenson (eds.) (2003) Forecast verification A practitioner’s guide in atmospheric science. Wiley. M S Pepe (2003) The statistical evaluation of medical tests for classification and prediction. Oxford.

6 Horses for courses Different types of forecast need different methods of verification, for example in the context of weather hazards (TSUNAMI project): lbinary data - damaging frost: yes/no lcategorical - storm damage: slight/moderate/severe ldiscrete - how many land-falling hurricanes in a season lcontinuous - height of a high tide lprobabilities – of tornado Some forecasts (wordy/ descriptive) are very difficult to verify at all

7 Binary forecasts Such forecasts might be –Whether temperature will fall below a threshold, damaging crops or forming ice on roads –Whether maximum river flow will exceed a threshold, causing floods –Whether mortality due to extreme heat will exceed some threshold (PHEWE project) –Whether a tornado will occur in a specified area The classic Finley Tornado example (next 2 slides) illustrates that assessing such forecasts is more subtle than it looks There are many possible verification measures – most have some poor properties

8 Forecasting tornados Tornado Observed Tornado not observed Total Tornado Forecast Tornado not forecast Total

9 Tornado forecasts Correct decisions 2708/2803 = 96.6% Correct decisions by procedure which always forecasts ‘No Tornado’ 2752/2803 = 98.2% It’s easy to forecast ‘No Tornado’, and get it right but more difficult to forecast when tornadoes will occur Correct decision when Tornado is forecast is 28/100 = 28.0% Correct forecast of observed tornadoes 28/51 = 54.5%

10 Forecast/observed contingency table Event Observed Event not observed Total Event Forecast aba + b Event not forecast cdc + d Totala + cb + dn

11 Some verification measures for (2 x 2) tables a/(a+c) Hit rate = true positive fraction = sensitivity b/(b+d) False alarm rate = 1- specificity b/(a+b) False alarm ratio = 1 – positive predictive value c/(c+d) Negative predictive value (a+d)/n Proportion correct (PC) (a+b)/(a+c) Bias

12 Skill scores A skill score is a verification measure adjusted to show improvement over some unskilful baseline, typically a forecast of ‘climatology’, a random forecast or a forecast of persistence. Usually adjustment gives zero value for the baseline and unity for a perfect forecast. For (2x2) tables we know how to calculate ‘expected’ values in the cells of the table under a null hypothesis of no association (no skill) for a χ 2 test.

13 More (2x2) verification measures (PC – E)/(1- E), where E is the expected value of PC assuming no skill – the Heidke (1926) skill score = Cohen’s Kappa (1960), also Doolittle (1885) a/(a+b+c). Critical success index (CSI) = threat score Gilbert’s (1884) skill score - a skill score version of CSI (ad –bc)/(ad +bc) Yule’s Q (1900). A skill score version of the odds ratio ad/bc a(b+d)/b(a+c); c(b+d)/d(a+c) Diagnostic likelihood ratios Note that neither the list of measures nor the list of names is exhaustive – see, for example, J A Swets (1986), Psychological Bulletin, 99,

14 The (Relative Operating Characteristic) ROC curve Plots hit rate (proportion of occurrences of the event that were correctly forecast) against false alarm rate (proportion of non-occurrences that were incorrectly forecast) for different thresholds Especially relevant if a number of different thresholds are of interest There are a number of verification measures based on ROC curves. The most widely used is probably the area under the curve

15 Desirable properties of measures: hedging and proper scores ‘Hedging’ is when a forecaster gives a forecast different from his/her true belief because he/she believes that the hedged forecasts will improve the (expected) score on a measure used to verify the forecasts. Clearly hedging is undesirable. For probability forecasts, a (strictly) proper score is one for which the forecaster (uniquely) maximises the expected score by forecasting his/her true beliefs, so that there is no advantage in hedging.

16

17 Desirable properties of measures: equitability A score for a probability forecast is equitable if it takes the same expected value (often chosen to be zero) for all unskilful forecasts of the type –Forecast the same probability all the time or –Choose a probability randomly from some distribution on the range [0,1]. Equitability is desirable – if two sets of forecasts are made randomly, but with different random mechanisms, one should not score better than the other.

18 Desirable properties of measures III There are a number of other desirable properties of measures, both for probability forecasts and other types of forecast, but equitability and propriety are most often cited in the meteorological literature. Equitability and propriety are incompatible (a new result)

19 Desirable properties (attributes) of forecasts Reliability. Conditionally unbiased. Expected value of the observation equals the forecast value. Resolution. The sensitivity of the expected value of the observation to different forecasts values (or more generally the sensitivity of this conditional distribution as a whole). Discrimination. The sensitivity of the conditional distribution of forecasts, given observations, to the value of the observation. Sharpness. Measures spread of marginal distribution of forecasts. Equivalent to resolution for reliable (perfectly calibrated) forecasts. –Other lists of desirable attributes exist.

20 A reliability diagram For a probability forecast of an event based on 850hPa temperature. Lots of grid points, so lots of forecasts (16380). Plots observed proportion of event occurrence for each forecast probability vs. forecast probability (solid line). Forecast probability takes only 17 possible values (0, 1/16, 2/16, … 15/16, 1) because forecast is based on proportion of an ensemble of 16 forecasts that predict the event. Because of the nature of the forecast event, 0 or 1 are forecast most of the time (see inset sharpness diagram).

21 Weather/climate forecasts vs medical diagnostic tests Quite different approaches in the two literatures –Weather/climate. Lots of measures used. Literature on properties, but often ignored. Inference (tests, confidence intervals, power) seldom considered –Medical (Pepe). Far fewer measures. Little discussion of properties. More inference: confidence intervals, complex models for ROC curves

22 Multi-category forecasts These are forecasts of the form –Temperature or rainfall ‘above’, ‘below’ or ‘near’ average (a common format for seasonal forecasts) –‘Very High Risk’, High Risk’, ‘Moderate Risk’, ‘Low Risk’ of excess mortality (PHEWE) Different verification measures are relevant depending on whether categories are ordered (as here) or unordered

23 Multi-category forecasts II As with binary forecasts there are many possible verification measures With K categories one class of measures assigns scores to each cell in the (K x K) table of forecast/outcome combinations Then multiply the proportion of observations in each cell by its score, and sum over cells to get an overall score By insisting on certain desirable properties (equitability, symmetry etc) the number of possible measures is narrowed

24 Gerrity (and LEPS) scores for 3 ordered category forecasts with equal probabilities Two possibilities are Gerrity scores or LEPS (Linear Error in Probability Space) In the example, LEPS rewards correct extreme forecasts more, and penalises badly wrong forecasts more, than Gerrity (divide Gerrity (LEPS) by 24 (36) to give the same scaling – an expected maximum value of 1) 30 (48) -6 (-6) -24 (-42) -6 (-6) 12 (12) -6 (-6) -24 (-42) -6 (-6) 30 (48)

25 Verification of continuous variables Suppose we make forecasts f 1, f 2, …, f n ; the corresponding observed data are x 1, x 2, …, x n. We might assess the forecasts by computing  [ |f 1 -x 1 | + |f 2 -x 2 | + … + |f n -x n |]/n (mean absolute error)  [ (f 1 -x 1 ) 2 + (f 2 -x 2 ) 2 + … + (f n -x n ) 2 ]/n (mean square error) – or take its square root Some form of correlation between the f’s and x’s Both MSE and correlation can be highly influenced by a few extreme forecasts/observations No time here to explore other possibilities

26 Skill or value? Our examples have looked at assessing skill Often we really want to assess value This needs quantification of the loss/cost of incorrect forecasts in terms of their ‘incorrectness’

27 Value of Tornado Forecasts If wrong forecasts of any sort costs $1K, then the cost of forecasting system is $95K, but the naive system costs only $51K If a false alarm costs $1K, but a tornado missed costs $10K, then the system costs $302K, but naivety costs $510K If a false alarm costs $1K, but a tornado missed costs $1million, then the system costs $23.07million, with naivety costing $51 million

28 Concluding remarks Forecasts should be verified Forecasts are multi-faceted; verification should reflect this Interpretation of verification results needs careful thought Much more could be said, for example, on inference, wordy forecasts, continuous forecasts, probability forecasts, ROC curves, value, spatial forecasts etc.

29

30 Continuous variables – LEPS scores Also for MSE a difference between forecast and observed of, say, 2 o C is treated the same way, whether it is –a difference between 1 o C above and 1 o C below the long-term mean or –a difference between 3 o C above and 5 o C above the long-term mean It can be argued that the second forecast is better than the first because the forecast and observed are closer with respect to the probability distribution of temperature.

31 LEPS scores II LEPS (Linear Error in Probability Space) are scores that measure distances with respect to position in a probability distribution. They start from the idea of using | P f – P v |, where P f, P v are positions in the cumulative probability distribution of the measured variable for the forecast and observed values, respectively This has the effect of down-weighting differences between extreme forecasts and outcomes e.g. a forecast/outcome pair 3 & 4 standard deviations above the mean is deemed ‘closer’ than a pair 1 & 2 SDs above the mean. Hence it gives greater credit to ‘good’ forecasts of extremes.

32 LEPS scores III The basic measure is normalized and adjusted to ensure –the score is doubly equitable –no ‘bending back’ – a simple value for unskilful and for perfect forecasts We end up with  3(1-|P f -P v |+P f 2 -P f +P v 2 -P v )-1

33 LEPS scores IV LEPS scores can be used on both continuous and categorical data. A skill score, taking values between –100 and 100 (or –1 and 1) for a set of forecasts can be constructed based on the LEPS score but it is not doubly equitable. Cross-validation (successively leaving out one of the data points and basing the prediction for that point from a rule derived from all the other data points) can be used to reduce the optimistic bias which exists when the same data are used to construct and to evaluate a rule. It has been used in some applications of LEPS, but is relevant more widely.