Measuring forecast skill: is it real skill or is it the varying climatology? Tom Hamill NOAA Earth System Research Lab, Boulder, Colorado

Slides:



Advertisements
Similar presentations
Climate Modeling LaboratoryMEASNC State University An Extended Procedure for Implementing the Relative Operating Characteristic Graphical Method Robert.
Advertisements

ECMWF Slide 1Met Op training course – Reading, March 2004 Forecast verification: probabilistic aspects Anna Ghelli, ECMWF.
Verification of probability and ensemble forecasts
Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.
Seasonal Predictability in East Asian Region Targeted Training Activity: Seasonal Predictability in Tropical Regions: Research and Applications 『 East.
Bookmaker or Forecaster? By Philip Johnson. Jersey Meteorological Department.
Introduction to Probability and Probabilistic Forecasting L i n k i n g S c i e n c e t o S o c i e t y Simon Mason International Research Institute for.
Testing Differences Among Several Sample Means Multiple t Tests vs. Analysis of Variance.
Lecture 3 Miscellaneous details about hypothesis testing Type II error
QUANTITATIVE DATA ANALYSIS
Using Statistics in Research Psych 231: Research Methods in Psychology.
Why Differences are Important
1 Intercomparison of low visibility prediction methods COST-722 (WG-i) Frédéric Atger & Thierry Bergot (Météo-France)
Lect 10b1 Histogram – (Frequency distribution) Used for continuous measures Statistical Analysis of Data ______________ statistics – summarize data.
Inferential Statistics
Multi-Model Ensembling for Seasonal-to-Interannual Prediction: From Simple to Complex Lisa Goddard and Simon Mason International Research Institute for.
Probabilistic forecasts of (severe) thunderstorms for the purpose of issuing a weather alarm Maurice Schmeits, Kees Kok, Daan Vogelezang and Rudolf van.
Introduction to Seasonal Climate Prediction Liqiang Sun International Research Institute for Climate and Society (IRI)
Verification has been undertaken for the 3 month Summer period (30/05/12 – 06/09/12) using forecasts and observations at all 205 UK civil and defence aerodromes.
1/2555 สมศักดิ์ ศิวดำรงพงศ์
ECMWF WWRP/WMO Workshop on QPF Verification - Prague, May 2001 NWP precipitation forecasts: Validation and Value Deterministic Forecasts Probabilities.
Common verification methods for ensemble forecasts, and how to apply them properly Tom Hamill NOAA Earth System Research Lab, Physical Sciences Division,
Quantitative Skills: Data Analysis
Verification of ensembles Courtesy of Barbara Brown Acknowledgments: Tom Hamill, Laurence Wilson, Tressa Fowler Copyright UCAR 2012, all rights reserved.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
ISDA 2014, Feb 24 – 28, Munich 1 Impact of ensemble perturbations provided by convective-scale ensemble data assimilation in the COSMO-DE model Florian.
Tutorial. Other post-processing approaches … 1) Bayesian Model Averaging (BMA) – Raftery et al (1997) 2) Analogue approaches – Hopson and Webster, J.
Exploring sample size issues for 6-10 day forecasts using ECMWF’s reforecast data set Model: 2005 version of ECMWF model; T255 resolution. Initial Conditions:
How can LAMEPS * help you to make a better forecast for extreme weather Henrik Feddersen, DMI * LAMEPS =Limited-Area Model Ensemble Prediction.
Heidke Skill Score (for deterministic categorical forecasts) Heidke score = Example: Suppose for OND 1997, rainfall forecasts are made for 15 stations.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Verification of IRI Forecasts Tony Barnston and Shuhua Li.
© Crown copyright Met Office Probabilistic turbulence forecasts from ensemble models and verification Philip Gill and Piers Buchanan NCAR Aviation Turbulence.
1 An overview of the use of reforecasts for improving probabilistic weather forecasts Tom Hamill NOAA / ESRL, Physical Sciences Div.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
61 st IHC, New Orleans, LA Verification of the Monte Carlo Tropical Cyclone Wind Speed Probabilities: A Joint Hurricane Testbed Project Update John A.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Verification and Metrics (CAWCR)
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Probabilistic Forecasts of Extreme Precipitation Events for the U.S. Hazards Assessment Kenneth Pelman 32 nd Climate Diagnostics Workshop Tallahassee,
The Analysis of Variance ANOVA
One-Way Analysis of Variance Recapitulation Recapitulation 1. Comparing differences among three or more subsamples requires a different statistical test.
Verification of ensemble systems Chiara Marsigli ARPA-SIMC.
Common verification methods for ensemble forecasts
NCAR, 15 April Fuzzy verification of fake cases Beth Ebert Center for Australian Weather and Climate Research Bureau of Meteorology.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Hypothesis test flow chart
Details for Today: DATE:13 th January 2005 BY:Mark Cresswell FOLLOWED BY:Practical Dynamical Forecasting 69EG3137 – Impacts & Models of Climate Change.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Data Analysis. Qualitative vs. Quantitative Data collection methods can be roughly divided into two groups. It is essential to understand the difference.
Inferential Statistics Psych 231: Research Methods in Psychology.
Statistical Inference
LEPS VERIFICATION ON MAP CASES
Understanding Results
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Verifying and interpreting ensemble products
Binary Forecasts and Observations
Nathalie Voisin, Andy W. Wood and Dennis P. Lettenmaier
Probabilistic forecasts
Validation-Based Decision Making
N. Voisin, J.C. Schaake and D.P. Lettenmaier
Quantitative verification of cloud fraction forecasts
COSMO-LEPS Verification
Deterministic (HRES) and ensemble (ENS) verification scores
Inferential Statistics
Verification of SPE Probability Forecasts at SEPC
Short Range Ensemble Prediction System Verification over Greece
Presentation transcript:

Measuring forecast skill: is it real skill or is it the varying climatology? Tom Hamill NOAA Earth System Research Lab, Boulder, Colorado NOAA Earth System Research Lab

Hypothesis If climatological event probability varies among samples, then many verification metrics will credit a forecast with extra skill it doesn’t deserve - the extra skill comes from the variations in the climatology.

Example: Brier Skill Score Brier Score: Mean-squared error of probabilistic forecasts. Brier Skill Score: Skill relative to some reference, like climatology. 1.0 = perfect forecast, 0.0 = skill of reference.

Overestimating skill: example 5-mm threshold Location A: P f = 0.05, P clim = 0.05, Obs = 0 Location B: P f = 0.05, P clim = 0.25, Obs = 0 Locations A and B:

Overestimating skill: example 5-mm threshold Location A: P f = 0.05, P clim = 0.05, Obs = 0 Location B: P f = 0.05, P clim = 0.25, Obs = 0 Locations A and B: why not 0.48? for more detail, see Hamill and Juras, QJRMS, Oct 2006 (c)

Another example of unexpected skill: two islands, zero meteorologists Imagine a planet with a global ocean and two isolated islands. Weather forecasting other than climatology for each island is impossible. Island 1: Forecast, observed uncorrelated, ~ N (+α, 1) Island 2: Forecast, observed uncorrelated, ~ N (–α, 1) 0 ≤ α ≤ 5 Event: Observed > 0 Forecasts: random ensemble draws from climatology

Two islands As α increases… Island 1Island 2 But still, each island’s forecast is no better than a random draw from its climatology. Expect no skill.

Consider three metrics… (1) Brier Skill Score (2) Relative Operating Characteristic (3) Equitable Threat Score (each will show this tendency to have scores vary depending on how they’re calculated)

Relative Operating Characteristic: standard method of calculation Populate 2x2 contingency tables, separate one for each sorted ensemble member. The contingency table for the ith sorted ensemble member is Event forecast by ith member? YES NO YES | a i | b i | Event Observed?NO | c i | d i | ( a i + b i + c i + d i = 1 ) (hit rate) (false alarm rate) ROC is a plot of hit rate (y) vs. false alarm rate (x). Commonly summarized by “area under curve” (AUC), 1.0 for perfect forecast, 0.5 for climatology.

Relative Operating Characteristic (ROC) skill score

Equitable Threat Score: standard method of calculation where YESNO YES(H) % HIT(F) % FALSE ALARM NO(M) % MISS(C) % CORRECT NO Observed ≥ T? Fcst ≥ T? (our test statistic)

Two islands As α increases… Island 1Island 2 But still, each island’s forecast is no better than a random draw from its climatology. Expect no skill.

Skill with conventional methods of calculation Reference climatology implicitly becomes N(+α,1) + N(–α,1) not N(+α,1) OR N(–α,1)

The new implicit reference climatology

Related problem when means are the same but climatological variances differ Event: v > 2.0 Island 1: f ~ N(0,1), v ~ N(0,1), Corr (f,v) = 0.0 Island 2: f ~ N(0,  ), v ~ N(0,  ), 1 ≤  ≤ 3, Corr (f,v) = 0.9 Expectation: positive skill over two islands, but not a function of 

the island with the greater climatological uncertainty of the observed event ends up dominating the calculations. more

Are standard methods wrong? Assertion: we’ve just re-defined climatology, they’re the correct scores with reference to that climatology. Response: You can calculate them this way, but you shouldn’t. –You will draw improper inferences due to “lurking variable” - i.e., the varying climatology should be a predictor. –Discerning real skill or skill difference gets tougher “One method that is sometimes used is to combine all the data into a single 2x2 table … this procedure is legitimate only if the probability p of an occurrence (on the null hypothesis) can be assumed to be the same in all the individual 2x2 tables. Consequently, if p obviously varies from table to table, or we suspect that it may vary, this procedure should not be used.” W. G. Cochran, 1954, discussing ANOVA tests

Solutions ? (1) Analyze events where climatological probabilities are the same at all locations, e.g., terciles.

Solutions, continued (2) Calculate metrics separately for different points with different climatologies. Form overall number using sample-weighted averages ROC :

Real-world examples: (1) Why so little skill for so much reliability? These reliability diagrams formed from locations with different climatologies. Day-5 usage distribution not much different from climatological usage distribution (solid lines).

Degenerate case: Skill might appropriately be 0.0 if all samples with 0.0 probability are drawn from climatology with 0.0 probability, and all samples with 1.0 are drawn from climatology with 1.0 probability.

(2) Consider Equitable Threat Scores…

(2) Consider Equitable Threat Scores… (1)ETS location-dependent, related to climatological probability.

(2) Consider Equitable Threat Scores… (1)ETS location-dependent, related to climatological probability. (2) Average of ETS at individual grid points = 0.28

(2) Consider Equitable Threat Scores… (1)ETS location-dependent, related to climatological probability. (2) Average of ETS at individual grid points = 0.28 (3) ETS after data lumped into one big table = 0.42

Equitable Threat Score: alternative method of calculation Consider the possibility of different regions with different climates. Assume n c contingency tables, each associated with samples with a distinct climatological event frequency. n s (k) out of the m samples were used to populate the k th table. ETS calculated separately for each contingency table, and alternative, weighted- average ETS is calculated as

ETS calculated two ways

Conclusions Many conventional verification metrics like BSS, RPSS, threat scores, ROC, potential economic value, etc. can be overestimated if climatology varies among samples. –results in false inferences: think there’s skill where there’s none. –complicates evaluation of model improvements; Model A better than Model B, but doesn’t appear quite so since both inflated in skill. Fixes: (1)Consider events where climatology doesn’t vary such as the exceedance of a quantile of the climatological distribution (2)Combine after calculating for distinct climatologies. Please: Document your method for calculating a score! Acknowledgements: Matt Briggs, Dan Wilks, Craig Bishop, Beth Ebert, Steve Mullen, Simon Mason, Bob Glahn, Neill Bowler, Ken Mylne, Bill Gallus, Frederic Atger, Francois LaLaurette, Zoltan Toth, Jeff Whitaker.