Evaluating decadal hindcasts: why and how? Chris Ferro (University of Exeter) T. Fricker, F. Otto, D. Stephenson, E. Suckling CliMathNet Conference (3.

Slides:



Advertisements
Similar presentations
Measuring the performance of climate predictions Chris Ferro, Tom Fricker, David Stephenson Mathematics Research Institute University of Exeter, UK IMA.
Advertisements

What is a good ensemble forecast? Chris Ferro University of Exeter, UK With thanks to Tom Fricker, Keith Mitchell, Stefan Siegert, David Stephenson, Robin.
What is a good ensemble forecast? Chris Ferro University of Exeter, UK With thanks to Tom Fricker, Keith Mitchell, Stefan Siegert, David Stephenson, Robin.
Fair scores for ensemble forecasts Chris Ferro University of Exeter 13th EMS Annual Meeting and 11th ECAM (10 September 2013, Reading, UK)
Lesson Overview 1.1 What Is Science?.
LRF Training, Belgrade 13 th - 16 th November 2013 © ECMWF Sources of predictability and error in ECMWF long range forecasts Tim Stockdale European Centre.
Reactions To Performance Feedback Juan I. Sanchez, Ph.D.
PRESENTS: FORECASTING FOR OPERATIONS AND DESIGN February 16 th 2011 – Aberdeen.
A Metrics Framework for Interannual-to-Decadal Predictions Experiments L. Goddard, on behalf of the US CLIVAR Decadal Predictability Working Group & Collaborators:
Snow Trends in Northern Spain. Analysis and Simulation with Statistical Downscaling Methods Thanks to: Daniel San Martín, Sixto.
Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.
Creating probability forecasts of binary events from ensemble predictions and prior information - A comparison of methods Cristina Primo Institute Pierre.
Jon Robson (Uni. Reading) Rowan Sutton (Uni. Reading) and Doug Smith (UK Met Office) Analysis of a decadal prediction system:
On judging the credibility of climate predictions Chris Ferro (University of Exeter) Tom Fricker, Fredi Otto, Emma Suckling 12th International Meeting.
Introduction to Probability and Probabilistic Forecasting L i n k i n g S c i e n c e t o S o c i e t y Simon Mason International Research Institute for.
Analysis of Extremes in Climate Science Francis Zwiers Climate Research Division, Environment Canada. Photo: F. Zwiers.
For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:
University of Oxford Quantifying and communicating the robustness of estimates of uncertainty in climate predictions Implications for uncertainty language.
EG1204: Earth Systems: an introduction Meteorology and Climate Lecture 7 Climate: prediction & change.
Testing Bridge Lengths The Gadsden Group. Goals and Objectives Collect and express data in the form of tables and graphs Look for patterns to make predictions.
Introduction to Numerical Weather Prediction and Ensemble Weather Forecasting Tom Hamill NOAA-CIRES Climate Diagnostics Center Boulder, Colorado USA.
Ensemble Post-Processing and it’s Potential Benefits for the Operational Forecaster Michael Erickson and Brian A. Colle School of Marine and Atmospheric.
CEEN-2131 Business Statistics: A Decision-Making Approach CEEN-2130/31/32 Using Probability and Probability Distributions.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Chapter 13 – Weather Analysis and Forecasting. The National Weather Service The National Weather Service (NWS) is responsible for forecasts several times.
Evaluation of Potential Performance Measures for the Advanced Hydrologic Prediction Service Gary A. Wick NOAA Environmental Technology Laboratory On Rotational.
1 Patch Complexity, Finite Pixel Correlations and Optimal Denoising Anat Levin, Boaz Nadler, Fredo Durand and Bill Freeman Weizmann Institute, MIT CSAIL.
Jon Curwin and Roger Slater, QUANTITATIVE METHODS: A SHORT COURSE ISBN © Thomson Learning 2004 Jon Curwin and Roger Slater, QUANTITATIVE.
ESA DA Projects Progress Meeting 2University of Reading Advanced Data Assimilation Methods WP2.1 Perform (ensemble) experiments to quantify model errors.
Chapter 16: Time-Series Analysis
Tutorial. Other post-processing approaches … 1) Bayesian Model Averaging (BMA) – Raftery et al (1997) 2) Analogue approaches – Hopson and Webster, J.
1 The Scientist Game Chris Slaughter, DrPH (courtesy of Scott Emerson) Dept of Biostatistics Vanderbilt University © 2002, 2003, 2006, 2008 Scott S. Emerson,
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Where did plants and animals come from? How did I come to be?
Celeste Saulo and Juan Ruiz CIMA (CONICET/UBA) – DCAO (FCEN –UBA)
Model validation Simon Mason Seasonal Forecasting Using the Climate Predictability Tool Bangkok, Thailand, 12 – 16 January 2015.
Chap 4-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 4 Using Probability and Probability.
Toward Probabilistic Seasonal Prediction Nir Krakauer, Hannah Aizenman, Michael Grossberg, Irina Gladkova Department of Civil Engineering and CUNY Remote.
University of Oxford Uncertainty in climate science: what it means for the current debate Myles Allen Department of Physics, University of Oxford
1.Introduction Prediction of sea-ice is not only important for shipping but also for weather as it can have a significant climatic impact. Sea-ice predictions.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Research Needs for Decadal to Centennial Climate Prediction: From observations to modelling Julia Slingo, Met Office, Exeter, UK & V. Ramaswamy. GFDL,
Lesson Overview Lesson Overview What Is Science? Lesson Overview 1.1 What Is Science?
. Outline  Evaluation of different model-error schemes in the WRF mesoscale ensemble: stochastic, multi-physics and combinations thereof  Where is.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Inference: Probabilities and Distributions Feb , 2012.
Health and Disease in Populations 2002 Sources of variation (1) Paul Burton! Jane Hutton.
Two extra components in the Brier Score Decomposition David B. Stephenson, Caio A. S. Coelho (now at CPTEC), Ian.T. Jolliffe University of Reading, U.K.
Nathalie Voisin 1, Florian Pappenberger 2, Dennis Lettenmaier 1, Roberto Buizza 2, and John Schaake 3 1 University of Washington 2 ECMWF 3 National Weather.
Lesson Overview Lesson Overview What Is Science?.
© Vipin Kumar IIT Mumbai Case Study 2: Dipoles Teleconnections are recurring long distance patterns of climate anomalies. Typically, teleconnections.
DOWNSCALING GLOBAL MEDIUM RANGE METEOROLOGICAL PREDICTIONS FOR FLOOD PREDICTION Nathalie Voisin, Andy W. Wood, Dennis P. Lettenmaier University of Washington,
VERIFICATION OF A DOWNSCALING SEQUENCE APPLIED TO MEDIUM RANGE METEOROLOGICAL PREDICTIONS FOR GLOBAL FLOOD PREDICTION Nathalie Voisin, Andy W. Wood and.
Sample Space and Events Section 2.1 An experiment: is any action, process or phenomenon whose outcome is subject to uncertainty. An outcome: is a result.
Predicting the performance of climate predictions Chris Ferro (University of Exeter) Tom Fricker, Fredi Otto, Emma Suckling 13th EMS Annual Meeting and.
Application of the CRA Method Application of the CRA Method William A. Gallus, Jr. Iowa State University Beth Ebert Center for Australian Weather and Climate.
National Oceanic and Atmospheric Administration’s National Weather Service Colorado Basin River Forecast Center Salt Lake City, Utah 11 The Hydrologic.
making certain the uncertainties
Verifying and interpreting ensemble products
Evaluating forecasts and models
forecasts of rare events
Judging the credibility of climate projections
Caio Coelho (Joint CBS/CCl IPET-OPSLS Co-chair) CPTEC/INPE, Brazil
Case Studies in Decadal Climate Predictability
Unit: Science & Technology
Measuring the performance of climate predictions
Verification of probabilistic forecasts: comparing proper scoring rules Thordis L. Thorarinsdottir and Nina Schuhen
the performance of weather forecasts
What is a good ensemble forecast?
Ryan Kang, Wee Leng Tan, Thea Turkington, Raizan Rahmat
Presentation transcript:

Evaluating decadal hindcasts: why and how? Chris Ferro (University of Exeter) T. Fricker, F. Otto, D. Stephenson, E. Suckling CliMathNet Conference (3 July 2013, Exeter, UK)

Evaluating decadal hindcasts: why and how? Chris Ferro (University of Exeter) T. Fricker, F. Otto, D. Stephenson, E. Suckling CliMathNet Conference (3 July 2013, Exeter, UK)

Evaluating decadal hindcasts: why and how? Chris Ferro (University of Exeter) T. Fricker, F. Otto, D. Stephenson, E. Suckling CliMathNet Conference (3 July 2013, Exeter, UK)

Evaluating ensemble forecasts Multiple predictions, e.g. model simulations from several initial conditions. Want scores that favour ensembles whose members behave as if they and the observation are drawn from the same probability distribution.

Current practice is unfair Current practice evaluates a proper scoring rule for the empirical distribution function of the ensemble. A scoring rule, s(p,y), for a probability forecast, p, and an observation, y, is proper if (for all p) the expected score, E y {s(p,y)}, is optimized when y ~ p. Proper scoring rules favour probability forecasts that behave as if the observations are randomly sampled from the forecast distributions.

Examples of proper scoring rules Brier score: s(p,y) = (p – y) 2 for observation y = 0 or 1, and probability forecast 0 ≤ p ≤ 1. Ensemble Brier score: s(x,y) = (i/n – y) 2 where i of the n ensemble members predict the event {y = 1}. CRPS: for real y and forecast p(t) = Pr(y ≤ t), with I the indicator function, Ensemble CRPS: where i(t) of the n ensemble members predict the event {y ≤ t},

Example: ensemble CRPS Observations y ~ N(0,1) and n ensemble members x i ~ N(0,σ 2 ) for i = 1,..., n. Plot expected value of the ensemble CRPS against σ. The ensemble CRPS is optimized when ensemble is underdispersed (σ < 1). n = 2 n = 4 n = 8

Fair scoring rules for ensembles Interpret the ensemble as a random sample. Fair scoring rules favour ensembles that behave as if the observations are sampled from the same distribution. A scoring rule, s(x,y), for an ensemble forecast, x, sampled from p, and an observation, y, is fair if (for all p) the expected score, E x,y {s(x,y)}, is optimized when y ~ p. Fricker, Ferro, Stephenson (2013) Three recommendations for evaluating climate predictions. Meteorological Applications, 20, (open access)

Characterization: binary case Let y = 1 if an event occurs, and let y = 0 otherwise. Let s i,y be the (finite) score when i of n ensemble members forecast the event and the observation is y. The (negatively oriented) score is fair if (n – i)(s i+1,0 – s i,0 ) = i(s i-1,1 – s i,1 ) for i = 0, 1,..., n and s i+1,0 ≥ s i,0 for i = 0, 1,..., n – 1. Ferro (2013) Fair scores for ensemble forecasts. Submitted.

Examples of fair scoring rules Ensemble Brier score: s(x,y) = (i/n – y) 2 where i of the n ensemble members predict the event {y = 1}. Fair Brier score: s(x,y) = (i/n – y) 2 – i(n – i)/{n 2 (n – 1)}. Ensemble CRPS: where i(t) of the n ensemble members predict the event {y ≤ t}, Fair CRPS: if (x 1,..., x n ) are the n ensemble members,

Example: ensemble CRPS Observations y ~ N(0,1) and n ensemble members x i ~ N(0,σ 2 ) for i = 1,..., n. Plot expected value of the fair CRPS against σ. The fair CRPS is always optimized when ensemble is well dispersed (σ = 1). unfair score fair score n = 2 n = 4 n = 8 all n

How good are climate predictions? Justify and quantify our judgments about the credibility of climate predictions, i.e. predictions of performance. Extrapolating past performance has little justification. Measure performance of available experiments and judge if harder or easier than the climate predictions. Ensure beliefs agree with these performance bounds. Otto, Ferro, Fricker, Suckling (2013) On judging the credibility of climate predictions. Climatic Change, on-line (open access)

Summary Use existing data explicitly to justify quantitative predictions of the performance of climate predictions. Evaluate ensemble forecasts (not only probability forecasts) to learn about ensemble prediction system. Use fair scoring rules to favour ensembles whose members behave as if they and the observation are drawn from the same probability distribution.

References Ferro CAT (2013) Fair scores for ensemble forecasts. Submitted Fricker TE, Ferro CAT, Stephenson DB (2013) Three recommendations for evaluating climate predictions. Meteorological Applications, 20, (open access) Goddard L, co-authors (2013) A verification framework for interannual-to-decadal predictions experiments. Climate Dynamics, 40, Otto FEL, Ferro CAT, Fricker TE, Suckling EB (2013) On judging the credibility of climate predictions. Climatic Change, on-line (open access)

Evaluating climate predictions 1. Large trends over the verification period can inflate spuriously the value of some verification measures, e.g. correlation. Scores, which measure the performance of each forecast separately before averaging, are immune to spurious skill. Correlation: 0.06 and 0.84

Evaluating climate predictions 2. Long-range predictions of short-lived quantities (e.g. daily temperatures) can be well calibrated, and may exhibit resolution. Evaluate predictions for relevant quantities, not only multi-year means.

Evaluating climate predictions 3. Scores should favour ensembles whose members behave as if they and the observation are sampled from the same distribution. ‘Fair’ scores do this; traditional scores do not. n = 2 n = 4 n = 8 unfair score fair score Figure: The unfair continuous ranked probability score is optimized by under-dispersed ensembles of size n.

Summary Use existing data explicitly to justify quantitative predictions of the performance of climate predictions. Be aware that some measures of performance may be inflated spuriously by climate trends. Consider climate predictions of more decision-relevant quantities, not only multi-year means. Use fair scores to evaluate ensemble forecasts.

Credibility and performance Many factors may influence credibility judgments, but should do so if and only if they affect our expectations about the performance of the predictions. Identify credibility with predicted performance. We must be able to justify and quantify (roughly) our predictions of performance if they are to be useful.

Performance-based arguments Extrapolate past performance on basis of knowledge of the climate model and the real climate (Parker 2010). Define a reference class of predictions (including the prediction in question) whose performances you cannot reasonably order in advance, measure the performance of some members of the class, and infer the performance of the prediction in question. Popular for weather forecasts (many similar forecasts) but less use for climate predictions (Frame et al. 2007).

Climate predictions Few past predictions are similar to future predictions, so performance-based arguments are weak for climate. Other data may still be useful: short-range predictions, in-sample hindcasts, imperfect model experiments etc. These data are used by climate scientists, but typically to make qualitative judgments about performance. We propose to use these data explicitly to make quantitative judgments about future performance.

Bounding arguments 1.Form a reference class of predictions that does not contain the prediction in question. 2.Judge if the prediction in question is a harder or easier problem than those in the reference class. 3.Measure the performance of some members of the reference class. This provides a bound for your expectations about the performance of the prediction in question.

Bounding arguments S = performance of a prediction from reference class C S′ = performance of the prediction in question, from C′ Let performance be positive with smaller values better. Infer probabilities Pr(S > s) from a sample from class C. If C′ is harder than C then Pr(S′ > s) > Pr(S > s) for all s. If C′ is easier than C then Pr(S′ > s) s) for all s.

Hindcast example Global mean, annual mean surface air temperature anomalies relative to mean over the previous 20 years. Initial-condition ensembles of HadCM3 launched every year from 1960 to Measure performance by the absolute errors and consider a lead time of 9 years. 1. Perfect model: predict another HadCM3 member 2. Imperfect model: predict a MIROC5 member 3. Reality: predict HadCRUT4 observations

Hindcast example

1. Errors when predict HadCM3

2. Errors when predict MIROC5

3. Errors when predict reality

Recommendations Use existing data explicitly to justify quantitative predictions of the performance of climate predictions. Collect data on more predictions, covering a range of physical processes and conditions, to tighten bounds. Design hindcasts and imperfect model experiments to be as similar as possible to future prediction problems. Train ourselves to be better judges of relative performance, especially to avoid over-confidence.

Future developments Bounding arguments may help us to form fully probabilistic judgments about performance. Let s = (s 1,..., s n ) be a sample from S ~ F(∙|p). Let S′ ~ F(∙|cp) with priors p ~ g(∙) and c ~ h(∙). Then Pr(S′ ≤ s|s) = ∫∫F(s|cp)h(c)g(p|s)dcdp. Bounding arguments refer to prior beliefs about S′ directly rather than indirectly through beliefs about c.

Predicting performance We might try to predict performance by forming our own prediction of the predictand. If we incorporate information about the prediction in question then we must already have judged its credibility; if not then we ignore relevant information. Consider predicting a coin toss. Our own prediction is Pr(head) = 0.5. Then our prediction of the performance of another prediction is bound to be Pr(correct) = 0.5 regardless of other information about that prediction.