On judging the credibility of climate predictions Chris Ferro (University of Exeter) Tom Fricker, Fredi Otto, Emma Suckling 12th International Meeting.

Slides:

Advertisements

Similar presentations

Measuring the performance of climate predictions Chris Ferro, Tom Fricker, David Stephenson Mathematics Research Institute University of Exeter, UK IMA.

Advertisements

What is a good ensemble forecast? Chris Ferro University of Exeter, UK With thanks to Tom Fricker, Keith Mitchell, Stefan Siegert, David Stephenson, Robin.

What is a good ensemble forecast? Chris Ferro University of Exeter, UK With thanks to Tom Fricker, Keith Mitchell, Stefan Siegert, David Stephenson, Robin.

Fair scores for ensemble forecasts Chris Ferro University of Exeter 13th EMS Annual Meeting and 11th ECAM (10 September 2013, Reading, UK)

Reactions To Performance Feedback Juan I. Sanchez, Ph.D.

Palaeo-Constraints on Future Climate Change Mat Collins, School of Engineering, Mathematics and Physical Sciences, University of Exeter Tamsin Edwards.

A Metrics Framework for Interannual-to-Decadal Predictions Experiments L. Goddard, on behalf of the US CLIVAR Decadal Predictability Working Group & Collaborators:

Statistical post-processing using reforecasts to improve medium- range renewable energy forecasts Tom Hamill and Jeff Whitaker NOAA Earth System Research.

Snow Trends in Northern Spain. Analysis and Simulation with Statistical Downscaling Methods Thanks to: Daniel San Martín, Sixto.

Details for Today: DATE:3 rd February 2005 BY:Mark Cresswell FOLLOWED BY:Assignment 2 briefing Evaluation of Model Performance 69EG3137 – Impacts & Models.

Creating probability forecasts of binary events from ensemble predictions and prior information - A comparison of methods Cristina Primo Institute Pierre.

Jon Robson (Uni. Reading) Rowan Sutton (Uni. Reading) and Doug Smith (UK Met Office) Analysis of a decadal prediction system:

Introduction to Probability and Probabilistic Forecasting L i n k i n g S c i e n c e t o S o c i e t y Simon Mason International Research Institute for.

1 Use of Mesoscale and Ensemble Modeling for Predicting Heavy Rainfall Events Dave Ondrejik Warning Coordination Meteorologist

Analysis of Extremes in Climate Science Francis Zwiers Climate Research Division, Environment Canada. Photo: F. Zwiers.

For stimulus s, have estimated s est Bias: Cramer-Rao bound: Mean square error: Variance: Fisher information How good is our estimate? (ML is unbiased:

Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.

University of Oxford Quantifying and communicating the robustness of estimates of uncertainty in climate predictions Implications for uncertainty language.

Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~

EG1204: Earth Systems: an introduction Meteorology and Climate Lecture 7 Climate: prediction & change.

MOS Performance MOS significantly improves on the skill of model output. National Weather Service verification statistics have shown a narrowing gap between.

Introduction to Numerical Weather Prediction and Ensemble Weather Forecasting Tom Hamill NOAA-CIRES Climate Diagnostics Center Boulder, Colorado USA.

Chapter 13 – Weather Analysis and Forecasting. The National Weather Service The National Weather Service (NWS) is responsible for forecasts several times.

Evaluation of Potential Performance Measures for the Advanced Hydrologic Prediction Service Gary A. Wick NOAA Environmental Technology Laboratory On Rotational.

The problem of sampling error in psychological research We previously noted that sampling error is problematic in psychological research because differences.

Multi-Model Ensembling for Seasonal-to-Interannual Prediction: From Simple to Complex Lisa Goddard and Simon Mason International Research Institute for.

Introduction to Seasonal Climate Prediction Liqiang Sun International Research Institute for Climate and Society (IRI)

1 Patch Complexity, Finite Pixel Correlations and Optimal Denoising Anat Levin, Boaz Nadler, Fredo Durand and Bill Freeman Weizmann Institute, MIT CSAIL.

Evaluating decadal hindcasts: why and how? Chris Ferro (University of Exeter) T. Fricker, F. Otto, D. Stephenson, E. Suckling CliMathNet Conference (3.

Verification of ensembles Courtesy of Barbara Brown Acknowledgments: Tom Hamill, Laurence Wilson, Tressa Fowler Copyright UCAR 2012, all rights reserved.

Estimating a Population Proportion

Inductive Generalizations Induction is the basis for our commonsense beliefs about the world. In the most general sense, inductive reasoning, is that in.

Measuring forecast skill: is it real skill or is it the varying climatology? Tom Hamill NOAA Earth System Research Lab, Boulder, Colorado

Celeste Saulo and Juan Ruiz CIMA (CONICET/UBA) – DCAO (FCEN –UBA)

Toward Probabilistic Seasonal Prediction Nir Krakauer, Hannah Aizenman, Michael Grossberg, Irina Gladkova Department of Civil Engineering and CUNY Remote.

Verification of IRI Forecasts Tony Barnston and Shuhua Li.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.4 Analyzing Dependent Samples.

Determining the Size of a Sample 1 Copyright © 2014 Pearson Education, Inc.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Probabilistic Forecasting. pdfs and Histograms Probability density functions (pdfs) are unobservable. They can only be estimated. They tell us the density,

1 Motivation Motivation SST analysis products at NCDC SST analysis products at NCDC  Extended Reconstruction SST (ERSST) v.3b  Daily Optimum Interpolation.

Chapter Thirteen Copyright © 2004 John Wiley & Sons, Inc. Sample Size Determination.

. Outline  Evaluation of different model-error schemes in the WRF mesoscale ensemble: stochastic, multi-physics and combinations thereof  Where is.

Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.

Forecasting and Decision Making Under Uncertainty Thomas R. Stewart, Ph.D. Center for Policy Research Rockefeller College of Public Affairs and Policy.

The inapplicability of traditional statistical methods for analysing climate ensembles Dave Stainforth International Meeting of Statistical Climatology.

18 September 2009: On the value of reforecasts for the TIGGE database 1/27 On the value of reforecasts for the TIGGE database Renate Hagedorn European.

- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.

Typhoon Forecasting and QPF Technique Development in CWB Kuo-Chen Lu Central Weather Bureau.

“Comparison of model data based ENSO composites and the actual prediction by these models for winter 2015/16.” Model composites (method etc) 6 slides Comparison.

Based on data to 2000, 20 years of additional data could halve uncertainty in future warming © Crown copyright Met Office Stott and Kettleborough, 2002.

1 General Elements in Evaluation Research. 2 Types of Evaluations.

Verification of ensemble systems Chiara Marsigli ARPA-SIMC.

Nathalie Voisin 1, Florian Pappenberger 2, Dennis Lettenmaier 1, Roberto Buizza 2, and John Schaake 3 1 University of Washington 2 ECMWF 3 National Weather.

VERIFICATION OF A DOWNSCALING SEQUENCE APPLIED TO MEDIUM RANGE METEOROLOGICAL PREDICTIONS FOR GLOBAL FLOOD PREDICTION Nathalie Voisin, Andy W. Wood and.

Predicting the performance of climate predictions Chris Ferro (University of Exeter) Tom Fricker, Fredi Otto, Emma Suckling 13th EMS Annual Meeting and.

Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:

EC 827 Module 2 Forecasting a Single Variable from its own History.

Verifying and interpreting ensemble products

Tom Hopson, Jason Knievel, Yubao Liu, Gregory Roux, Wanli Wu

Question 1 Given that the globe is warming, why does the DJF outlook favor below-average temperatures in the southeastern U. S.? Climate variability on.

Evaluating forecasts and models

Nathalie Voisin, Andy W. Wood and Dennis P. Lettenmaier

forecasts of rare events

Judging the credibility of climate projections

Measuring the performance of climate predictions

Verification of probabilistic forecasts: comparing proper scoring rules Thordis L. Thorarinsdottir and Nina Schuhen

the performance of weather forecasts

What is a good ensemble forecast?

Ryan Kang, Wee Leng Tan, Thea Turkington, Raizan Rahmat

Presentation transcript:

On judging the credibility of climate predictions Chris Ferro (University of Exeter) Tom Fricker, Fredi Otto, Emma Suckling 12th International Meeting on Statistical Climatology (28 June 2013, Jeju, Korea)

Credibility and performance Many factors may influence credibility judgments, but should do so if and only if they affect our expectations about the performance of the predictions. Identify credibility with predicted performance. We must be able to justify and quantify (roughly) our predictions of performance if they are to be useful.

Performance-based arguments Extrapolate past performance on basis of knowledge of the climate model and the real climate (Parker 2010). Define a reference class of predictions (including the prediction in question) whose performances you cannot reasonably order in advance, measure the performance of some members of the class, and infer the performance of the prediction in question. Popular for weather forecasts (many similar forecasts) but less use for climate predictions (Frame et al. 2007).

Climate predictions Few past predictions are similar to future predictions, so performance-based arguments are weak for climate. Other data may still be useful: short-range predictions, in-sample hindcasts, imperfect model experiments etc. These data are used by climate scientists, but typically to make qualitative judgments about performance. We propose to use these data explicitly to make quantitative judgments about future performance.

Bounding arguments 1.Form a reference class of predictions that does not contain the prediction in question. 2.Judge if the prediction in question is a harder or easier problem than those in the reference class. 3.Measure the performance of some members of the reference class. This provides a bound for your expectations about the performance of the prediction in question.

Bounding arguments S = performance of a prediction from reference class C S′ = performance of the prediction in question, from C′ Let performance be positive with smaller values better. Infer probabilities Pr(S > s) from a sample from class C. If C′ is harder than C then Pr(S′ > s) > Pr(S > s) for all s. If C′ is easier than C then Pr(S′ > s) s) for all s.

Hindcast example Global mean, annual mean surface air temperature anomalies relative to mean over the previous 20 years. Initial-condition ensembles of HadCM3 launched every year from 1960 to Measure performance by the absolute errors and consider a lead time of 9 years. 1. Perfect model: predict another HadCM3 member 2. Imperfect model: predict a MIROC5 member 3. Reality: predict HadCRUT4 observations

Hindcast example

1. Errors when predict HadCM3

2. Errors when predict MIROC5

3. Errors when predict reality

Recommendations Use existing data explicitly to justify quantitative predictions of the performance of climate predictions. Collect data on more predictions, covering a range of physical processes and conditions, to tighten bounds. Design hindcasts and imperfect model experiments to be as similar as possible to future prediction problems. Train ourselves to be better judges of relative performance, especially to avoid over-confidence.

Evaluating climate predictions 1. Large trends over the verification period can inflate spuriously the value of some verification measures, e.g. correlation. Scores, which measure the performance of each forecast separately before averaging, are immune to spurious skill. Correlation: 0.06 and 0.84

Evaluating climate predictions 2. Long-range predictions of short-lived quantities (e.g. daily temperatures) can be well calibrated, and may exhibit resolution. Evaluate predictions for relevant quantities, not only multi-year means.

Evaluating climate predictions 3. Scores should favour ensembles whose members behave as if they and the observation are sampled from the same distribution. ‘Fair’ scores do this; traditional scores do not. n = 2 n = 4 n = 8 unfair score fair score Figure: The unfair continuous ranked probability score is optimized by under-dispersed ensembles of size n.

Summary Use existing data explicitly to justify quantitative predictions of the performance of climate predictions. Be aware that some measures of performance may be inflated spuriously by climate trends. Consider climate predictions of more decision-relevant quantities, not only multi-year means. Use fair scores to evaluate ensemble forecasts.

References Allen M, Frame D, Kettleborough J, Stainforth D (2006) Model error in weather and climate forecasting. In Predictability of Weather and Climate (eds T Palmer, R Hagedorn) CUP Ferro CAT (2013) Fair scores for ensemble forecasts. Submitted Frame DJ, Faull NE, Joshi MM, Allen MR (2007) Probabilistic climate forecasts and inductive problems. Philos. Trans. R. Soc. A 365, Fricker TE, Ferro CAT, Stephenson DB (2013) Three recommendations for evaluating climate predictions. Meteorol. Appl., in press Goddard L, co-authors (2013) A verification framework for interannual-to-decadal predictions experiments. Clim. Dyn. 40, Knutti R (2008) Should we believe model predictions of future climate change? Philos. Trans. R. Soc. A 366, Otto FEL, Ferro CAT, Fricker TE, Suckling EB (2013) On judging the credibility of climate predictions. Clim. Change, in press Parker WS (2010) Predicting weather and climate: uncertainty, ensembles and probability. Stud. Hist. Philos. Mod. Phys. 41, Parker WS (2011) When climate models agree: the significance of robust model predictions. Philos. Sci. 78, Smith LA (2002) What might we learn from climate forecasts? Proc. Natl. Acad. Sci. 99, Stainforth DA, Allen MR, Tredger ER, Smith LA (2007) Confidence, uncertainty and decision- support relevance in climate predictions. Philos. Trans. R. Soc. A 365,

Future developments Bounding arguments may help us to form fully probabilistic judgments about performance. Let s = (s 1,..., s n ) be a sample from S ~ F(∙|p). Let S′ ~ F(∙|cp) with priors p ~ g(∙) and c ~ h(∙). Then Pr(S′ ≤ s|s) = ∫∫F(s|cp)h(c)g(p|s)dcdp. Bounding arguments refer to prior beliefs about S′ directly rather than indirectly through beliefs about c.

Fair scores for ensemble forecasts Let s(p,y) be a scoring rule for a probability forecast, p, and observation, y. The rule is proper if its expectation, E y [s(p,y)], is optimized when y ~ p. No forecasts score better, on average, than the observation’s distribution. Let s(x,y) be a scoring rule for an ensemble forecast, x, sampled randomly from p. The rule is fair if E x,y [s(x,y)] is optimized when y ~ p. No ensembles score better, on average, than those from the observation’s distribution. Fricker et al. (2013), Ferro (2013)

Fair scores: binary characterization Let y = 1 if an event occurs, and let y = 0 otherwise. Let s i,y be the (finite) score when i of n ensemble members forecast the event and the observation is y. The (negatively oriented) score is fair if (n – i)(s i+1,0 – s i,0 ) = i(s i-1,1 – s i,1 ) for i = 0, 1,..., n and s i+1,0 ≥ s i,0 for i = 0, 1,..., n – 1. Ferro (2013)

Fair scores: example The (unfair) ensemble version of the continuous ranked probability score is where p n (t) is the proportion of the n ensemble members (x 1,..., x n ) no larger than t, and where I(A) = 1 if A is true and I(A) = 0 otherwise. A fair version is

Fair scores: example Unfair (dashed) and fair (solid) expected scores against σ when y ~ N(0,1) and x i ~ N(0,σ 2 ) for i = 1,..., n. n = 2 n = 4 n = 8

Predicting performance We might try to predict performance by forming our own prediction of the predictand. If we incorporate information about the prediction in question then we must already have judged its credibility; if not then we ignore relevant information. Consider predicting a coin toss. Our own prediction is Pr(head) = 0.5. Then our prediction of the performance of another prediction is bound to be Pr(correct) = 0.5 regardless of other information about that prediction.