Common verification methods for ensemble forecasts

Slides:



Advertisements
Similar presentations
Climate Modeling LaboratoryMEASNC State University An Extended Procedure for Implementing the Relative Operating Characteristic Graphical Method Robert.
Advertisements

ECMWF Slide 1Met Op training course – Reading, March 2004 Forecast verification: probabilistic aspects Anna Ghelli, ECMWF.
1 Verification Continued… Holly C. Hartmann Department of Hydrology and Water Resources University of Arizona RFC Verification Workshop,
14 May 2001QPF Verification Workshop Verification of Probability Forecasts at Points WMO QPF Verification Workshop Prague, Czech Republic May 2001.
A primer on ensemble weather prediction and the use of probabilistic forecasts Tom Hamill NOAA Earth System Research Laboratory Physical Sciences Division.
Statistical post-processing using reforecasts to improve medium- range renewable energy forecasts Tom Hamill and Jeff Whitaker NOAA Earth System Research.
1 of Introduction to Forecasts and Verification.
1 Introduction to Inference Confidence Intervals William P. Wattles, Ph.D. Psychology 302.
Sampling Distributions (§ )
Verification of probability and ensemble forecasts
Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss Improving COSMO-LEPS forecasts of extreme events with.
Daria Kluver Independent Study From Statistical Methods in the Atmospheric Sciences By Daniel Wilks.
Introduction to Probability and Probabilistic Forecasting L i n k i n g S c i e n c e t o S o c i e t y Simon Mason International Research Institute for.
Instituting Reforecasting at NCEP/EMC Tom Hamill (ESRL) Yuejian Zhu (EMC) Tom Workoff (WPC) Kathryn Gilbert (MDL) Mike Charles (CPC) Hank Herr (OHD) Trevor.
MOS Performance MOS significantly improves on the skill of model output. National Weather Service verification statistics have shown a narrowing gap between.
Ensemble Post-Processing and it’s Potential Benefits for the Operational Forecaster Michael Erickson and Brian A. Colle School of Marine and Atmospheric.
Introduction to Regression Analysis, Chapter 13,
Multi-Model Ensembling for Seasonal-to-Interannual Prediction: From Simple to Complex Lisa Goddard and Simon Mason International Research Institute for.
Ensemble weather prediction: verification, numerical issues, and use
© Crown copyright Met Office Operational OpenRoad verification Presented by Robert Coulson.
ECMWF WWRP/WMO Workshop on QPF Verification - Prague, May 2001 NWP precipitation forecasts: Validation and Value Deterministic Forecasts Probabilities.
Common verification methods for ensemble forecasts, and how to apply them properly Tom Hamill NOAA Earth System Research Lab, Physical Sciences Division,
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 7. Using Probability Theory to Produce Sampling Distributions.
Verification of ensembles Courtesy of Barbara Brown Acknowledgments: Tom Hamill, Laurence Wilson, Tressa Fowler Copyright UCAR 2012, all rights reserved.
Rank Histograms – measuring the reliability of an ensemble forecast You cannot verify an ensemble forecast with a single.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Exploring sample size issues for 6-10 day forecasts using ECMWF’s reforecast data set Model: 2005 version of ECMWF model; T255 resolution. Initial Conditions:
Measuring forecast skill: is it real skill or is it the varying climatology? Tom Hamill NOAA Earth System Research Lab, Boulder, Colorado
Heidke Skill Score (for deterministic categorical forecasts) Heidke score = Example: Suppose for OND 1997, rainfall forecasts are made for 15 stations.
Improving Ensemble QPF in NMC Dr. Dai Kan National Meteorological Center of China (NMC) International Training Course for Weather Forecasters 11/1, 2012,
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
1 Using reforecasts for probabilistic forecast calibration Tom Hamill & Jeff Whitaker NOAA Earth System Research Lab, Boulder, CO NOAA.
1 An overview of the use of reforecasts for improving probabilistic weather forecasts Tom Hamill NOAA / ESRL, Physical Sciences Div.
61 st IHC, New Orleans, LA Verification of the Monte Carlo Tropical Cyclone Wind Speed Probabilities: A Joint Hurricane Testbed Project Update John A.
2nd SRNWP Workshop on “Short-range ensembles” – Bologna, 7-8 April Verification of ensemble systems Chiara Marsigli ARPA-SIM.
Probabilistic Forecasting. pdfs and Histograms Probability density functions (pdfs) are unobservable. They can only be estimated. They tell us the density,
Theoretical Review of Seasonal Predictability In-Sik Kang Theoretical Review of Seasonal Predictability In-Sik Kang.
ECMWF Training Course Reading, 25 April 2006 EPS Diagnostic Tools Renate Hagedorn European Centre for Medium-Range Weather Forecasts.
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Verification and Metrics (CAWCR)
Standard Verification Strategies Proposal from NWS Verification Team NWS Verification Team Draft03/23/2009 These slides include notes, which can be expanded.
© 2009 UCAR. All rights reserved. ATEC-4DWX IPR, 21−22 April 2009 National Security Applications Program Research Applications Laboratory Ensemble-4DWX.
Using Model Climatology to Develop a Confidence Metric Taylor Mandelbaum, School of Marine and Atmospheric Sciences, Stony Brook, NY Brian Colle, School.
EUMETCAL NWP-course 2007: The Concept of Ensemble Forecasting Renate Hagedorn European Centre for Medium-Range Weather Forecasts The General Concept of.
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss A more reliable COSMO-LEPS F. Fundel, A. Walser, M. A.
Probabilistic Forecasts of Extreme Precipitation Events for the U.S. Hazards Assessment Kenneth Pelman 32 nd Climate Diagnostics Workshop Tallahassee,
Verification of ensemble systems Chiara Marsigli ARPA-SIMC.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Applying Ensemble Probabilistic Forecasts in Risk-Based Decision Making Hui-Ling Chang 1, Shu-Chih Yang 2, Huiling Yuan 3,4, Pay-Liam Lin 2, and Yu-Chieng.
Nathalie Voisin 1, Florian Pappenberger 2, Dennis Lettenmaier 1, Roberto Buizza 2, and John Schaake 3 1 University of Washington 2 ECMWF 3 National Weather.
Chapter 13 Understanding research results: statistical inference.
DOWNSCALING GLOBAL MEDIUM RANGE METEOROLOGICAL PREDICTIONS FOR FLOOD PREDICTION Nathalie Voisin, Andy W. Wood, Dennis P. Lettenmaier University of Washington,
VERIFICATION OF A DOWNSCALING SEQUENCE APPLIED TO MEDIUM RANGE METEOROLOGICAL PREDICTIONS FOR GLOBAL FLOOD PREDICTION Nathalie Voisin, Andy W. Wood and.
Probabilistic Forecasts Based on “Reforecasts” Tom Hamill and Jeff Whitaker and
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Application of the CRA Method Application of the CRA Method William A. Gallus, Jr. Iowa State University Beth Ebert Center for Australian Weather and Climate.
Statistical analysis.
LEPS VERIFICATION ON MAP CASES
Statistical analysis.
Verifying and interpreting ensemble products
Precipitation Products Statistical Techniques
Ensemble Forecasting and its Verification
Nathalie Voisin, Andy W. Wood and Dennis P. Lettenmaier
Probabilistic forecasts
N. Voisin, J.C. Schaake and D.P. Lettenmaier
COSMO-LEPS Verification
Sampling Distributions (§ )
Test 2 Covers Topics 12, 13, 16, 17, 18, 14, 19 and 20 Skipping Topics 11 and 15.
Verification of SPE Probability Forecasts at SEPC
Short Range Ensemble Prediction System Verification over Greece
Presentation transcript:

Common verification methods for ensemble forecasts NOAA Earth System Research Laboratory Common verification methods for ensemble forecasts Tom Hamill NOAA Earth System Research Lab, Physical Sciences Division, Boulder, CO tom.hamill@noaa.gov

What constitutes a “good” ensemble forecast? Here, the observed is outside of the range of the ensemble, which was sampled from the pdf shown. Is this a sign of a poor ensemble forecast?

Rank 1 of 21 Rank 14 of 21 Rank 5 of 21 Rank 3 of 21

One way of evaluating ensembles: “rank histograms” or “Talagrand diagrams” We need lots of samples from many situations to evaluate the characteristics of the ensemble. Happens when observed is indistinguishable from any other member of the ensemble. Ensemble is “reliable” Happens when observed too commonly is lower than the ensemble members. Happens when there are either some low and some high biases, or when the ensemble doesn’t spread out enough. ref: Hamill, MWR, March 2001

Rank histograms of Z500, T850, T2m (from 1998 reforecast version of NCEP GFS) Solid lines indicate ranks after bias correction. Rank histograms are particularly U-shaped for T2M, which is probably the most relevant of the three plotted here.

Rank histograms tell us about reliability - but what else is important? “Sharpness” measures the specificity of the probabilistic forecast. Given two reliable forecast systems, the one producing the sharper forecasts is preferable. But: don’t want sharp if not reliable. Implies unrealistic confidence.

“Spread-skill” relationships are important, too. Small-spread ensemble forecasts should have less ensemble-mean error than large-spread forecasts. ensemble-mean error from a sample of this pdf on avg. should be low. error should be moderate on avg. large on avg.

Spread-skill for 1990’s NCEP GFS At a given grid point, spread S is assumed to be a random variable with a lognormal distribution where Sm is the mean spread and  is its standard deviation. As  increases, there is a wider range of spreads in the sample. One would expect then the possibility for a larger spread-skill correlation.

Spread-skill and precipitation forecasts. True spread-skill relationships harder to diagnose if forecast PDF is non-normally distributed, as they are typically for precipitation forecasts. Commonly, spread is no longer independent of the mean value; it’s larger when the amount is larger. Hence, you get an apparent spread-skill relationship, but this may reflect variations in the mean forecast rather than real spread-skill.

Reliability Diagrams

Reliability Diagrams Curve tells you what the observed frequency was each time you forecast a given probability. This curve ought to lie along y = x line. Here this shows the ensemble-forecast system over-forecasts the probability of light rain. Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Reliability Diagrams Inset histogram tells you how frequently each probability was issued. Perfectly sharp: frequency of usage populates only 0% and 100%. Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Reliability Diagrams BSS = Brier Skill Score BS(•) measures the Brier Score, which you can think of as the squared error of a probabilistic forecast. Perfect: BSS = 1.0 Climatology: BSS = 0.0 Ref: Wilks text, Statistical Methods in the Atmospheric Sciences

Brier Score Define an event, e.g., precip > 2.5 mm. Let be the forecast probability for the ith forecast case. Let be the observed probability (1 or 0). Then (So the Brier score is the averaged squared error of the probabilistic forecast)

Reliability after post-processing Statistical correction of forecasts using a long, stable set of prior forecasts from the same model (like in MOS). Ref: Hamill et al., MWR, Nov 2006

Cumulative Distribution Function (CDF) Ff(x) = Pr {X ≤ x} where X is the random variable, x is some specified threshold.

Continuous Ranked Probability Score Let be the forecast probability CDF for the ith forecast case. Let be the observed probability CDF (Heaviside function).

Continuous Ranked Probability Score Let be the forecast probability CDF for the ith forecast case. Let be the observed probability CDF (Heaviside function). (squared)

Continuous Ranked Probability Skill Score (CRPSS) Like the Brier score, it’s common to convert this to a skill score by normalizing by the skill of climatology

Relative Operating Characteristic (ROC)

Relative Operating Characteristic (ROC)

Method of Calculation of ROC: Parts 1 and 2 (1) Build contingency tables for each sorted ensemble member T Obs F F F F F F 55 56 57 58 59 60 61 62 63 64 65 66 Obs ≥ T? Obs ≥ T? Obs ≥ T? Obs ≥ T? Obs ≥ T? Obs ≥ T? Y N 1 Y N 1 Y N 1 Y N 1 Y N 1 Y N 1 Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? (2) Repeat the process for other locations, dates, building up contingency tables for sorted members.

Method of Calculation of ROC: Part 3 (3) Get hit rate and false alarm rate for each from contingency table for each sorted ensemble member. Obs ≥ T? Y N H F M C HR = H / (H+M) FAR = F / (F+C) Fcst ≥ T? Sorted Member 1 Sorted Member 2 Sorted Member 3 Sorted Member 4 Sorted Member 5 Sorted Member 6 Obs ≥ T? Obs ≥ T? Obs ≥ T? Obs ≥ T? Obs ≥ T? Obs ≥ T? Y N 1106 3 5651 73270 Y N 3097 176 3630 73097 Y N 4020 561 2707 72712 Y N 4692 1270 2035 72003 Y N 5297 2655 1430 70618 Y N 6603 44895 124 28378 Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? Fcst ≥ T? HR = 0.163 FAR = 0.000 HR = 0.504 FAR = 0.002 HR = 0.597 FAR = 0.007 HR = 0.697 FAR = 0.017 HR = 0.787 FAR = 0.036 HR = 0.981 FAR = 0.612

Method of Calculation of ROC: Part 3 HR = 0.163 FAR = 0.000 HR = 0.504 FAR = 0.002 HR = 0.597 FAR = 0.007 HR = 0.697 FAR = 0.017 HR = 0.787 FAR = 0.036 HR = 0.981 FAR = 0.612 HR = [0.000, 0.163, 0.504, 0.597, 0.697, 0.787, 0.981, 1.000] FAR = [0.000, 0.000, 0.002, 0.007, 0.017, 0.036, 0.612, 1.000] (4) Plot hit rate vs. false alarm rate

Economic Value Diagrams Motivated by search for a metric that relates ensemble forecast performance to things that customers will actually care about. These diagrams tell you the potential economic value of your ensemble forecast system applied to a particular forecast aspect. Perfect forecast has value of 1.0, climatology has value of 1.0. Value differs with user’s cost/loss ratio.

Economic Value: calculation method Assumes decision maker alters actions based on weather forecast info. C = Cost of protection L = Lp+Lu = total cost of a loss, where … Lp = Loss that can be protected against Lu = Loss that can’t be protected against. N = No cost

Economic value, continued Suppose we have the contingency table of forecast outcomes, [h, m, f, c]. Then we can calculate the expected value of the expenses from a forecast, from climatology, from a perfect forecast. Note that value will vary with C, Lp, Lu; Different users with different protection costs may experience a different value from the forecast system.

From ROC to Economic Value Value is now seen to be related to FAR and HR, the components of the ROC curve. A (HR, FAR) point on the ROC curve will thus map to a value curve (as a function of C/L)

The red curve is from the ROC data for the member defining the 90th percentile of the ensemble distribution. Green curve is for the 10th percentile. Overall economic value is the maximum (use whatever member for decision threshold that provides the best economic value).

Forecast skill often overestimated! - Suppose you have a sample of forecasts from two islands, and each island has different climatology. - Weather forecasts impossible on both islands. Simulate “forecast” with an ensemble of draws from climatology Island 1: F ~ N(,1). Island 2: F ~ N(-,1) Calculate ROCSS, BSS, ETS in normal way. Expect no skill. As climatology of the two islands begins to differ, then “skill” increases though samples drawn from climatology. These scores falsely attribute differences in samples’ climatologies to skill of the forecast. Samples must have the same climatological event frequency to avoid this.

Useful References Good overall references for forecast verification: (1): Wilks, D.S., 2006: Statistical Methods in the Atmospheric Sciences (2nd Ed). Academic Press, 627 pp. (2) Beth Ebert’s forecast verification web page, http://tinyurl.com/y97c74 Rank histograms: Hamill, T. M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Mon. Wea. Rev., 129, 550-560. Spread-skill relationships: Whitaker, J.S., and A. F. Loughe, 1998: The relationship between ensemble spread and ensemble mean skill. Mon. Wea. Rev., 126, 3292-3302. Brier score, continuous ranked probability score, reliability diagrams: Wilks text again. Relative operating characteristic: Harvey, L. O., Jr, and others, 1992: The application of signal detection theory to weather forecasting behavior. Mon. Wea. Rev., 120, 863-883. Economic value diagrams: (1)Richardson, D. S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quart. J. Royal Meteor. Soc., 126, 649-667. (2) Zhu, Y, and others, 2002: The economic value of ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83, 73-83. Overforecasting skill: Hamill, T. M., and J. Juras, 2006: Measuring forecast skill: is it real skill or is it the varying climatology? Quart. J. Royal Meteor. Soc., Jan 2007 issue. http://tinyurl.com/kxtct