Verifying and interpreting ensemble products
What do we mean by “calibration” or “post-processing”? “bias” obs Forecast PDF Probability Probability Forecast PDF obs “spread” or “dispersion” calibration Temperature [K] Temperature [K] Post-processing has corrected: the “on average” bias as well as under-representation of the 2nd moment of the empirical forecast PDF (i.e. corrected its “dispersion” or “spread”)
Specific Benefits of Post-Processing Improvements in: statistical accuracy, bias, and, reliability Correcting basic forecast statistics (increasing user “trust”) discrimination and sharpness Increasing “information content”; in many cases, gains equivalent to years of NWP model development! Relatively inexpensive! Statistical post-processing can improve not only statistical (unconditional) accuracy of forecasts (as measured by reliability diagrams, rank histograms, etc), but also factors more related to “conditional” forecast behavior (as measured by, say, RPS, skill-spread relations,etc.) Last point: inexpensive statistically-derived skill improvements can equate to significant NWP developments (expensive)
(cont) Benefits of Post-Processing Essential for tailoring to local application: NWP provides spatially- and temporally-averaged gridded forecast output => Applying gridded forecasts to point locations requires location specific calibration to account for spatial- and temporal- variability ( => increasing ensemble dispersion) Statistical post-processing can improve not only statistical (unconditional) accuracy of forecasts (as measured by reliability diagrams, rank histograms, etc), but also factors more related to “conditional” forecast behavior (as measured by, say, RPS, skill-spread relations,etc.) Last point: inexpensive statistically-derived skill improvements can equate to significant NWP developments (expensive)
Raw versus Calibrated PDF’s Blue is “raw” ensemble Black is calibrated ensemble Red is the observed value Notice: significant change in both “bias” and dispersion of final PDF (also notice PDF asymmetries) obs
Verifying ensemble (probabilistic) forecasts Overview: Rank histogram Mean square error (MSE) Brier score Rank Probability Score (RPS) Reliability diagram Relative Operating Characteristic (ROC) curve Skill score
Example: January T Before Calibration After Calibration Quartiles? Note underbias of forecasts on plot on the left Plot on right, observations now generally embedded with the ensemble bundle Comparing left/right ovals, notice the significant variation in the PDF’s both in their non-gaussian structure and the change in the ensemble spread Black curve shows observations; colors are ensemble
You cannot verify an ensemble forecast with a single observation. Rank Histograms – measuring the reliability of an ensemble forecast You cannot verify an ensemble forecast with a single observation. The more data you have for verification, (as is true in general for other statistical measures) the more certain you are. Rare events (low probability) require more data to verify => as do systems with many ensemble members. From Barb Brown
From Tom Hamill
Troubled Rank Histograms 0 10 20 30 Counts 0 10 20 30 Counts 1 2 3 4 5 6 7 8 9 10 Ensemble # 1 2 3 4 5 6 7 8 9 10 Ensemble # Slide from Matt Pocernic
After quantile regression, rank histograms uniform Example: July T Before Calibration After Calibration Before callibration, U-shaped rank histogram (underdispersive) with larger under- than over-bias After calibration, observation falls within ensemble bundle (still not perfectly “uniform” rank histogram due to small sample size) -- NOTE: for our current application, improving the information content of the ensemble spread (I.e. as a representation of potential forecast error) is the most significant gain in our calibration. After quantile regression, rank histograms uniform
Continuous scores: MSE Attribute: measures accuracy Average of the squares of the errors: it measures the magnitude of the error, weighted on the squares of the errors it does not indicate the direction of the error Quadratic rule, therefore large weight on large errors: good if you wish to penalize large error sensitive to large values (e.g. precipitation) and outliers; sensitive to large variance (high resolution models); encourage conservative forecasts (e.g. climatology) => For ensemble forecast, use ensemble mean Slide from Barbara Casati
Scatter-plot and Contingency Table Brier Score Does the forecast detect correctly temperatures above 18 degrees ? y = forecasted event occurence o = observed occurrence (0 or 1) i = sample # of total n samples => Note similarity to MSE Slide from Barbara Casati
Rank Probability Score for multi-categorical or continuous variables
Conditional Distributions Conditional histogram and conditional box-plot Slide from Barbara Casati
Reliability (or Attribute) Diagram Slide from Matt Pocernic
Scatter-plot and Contingency Table Does the forecast detect correctly temperatures above 18 degrees ? Does the forecast detect correctly temperatures below 10 degrees ? Slide from Barbara Casati
Discrimination Plot Decision Threshold Outcome = No Outcome = Yes Hits False Alarms Slide from Matt Pocernic
Receiver Operating Characteristic (ROC) Curve Slide from Matt Pocernic
Skill Scores Single value to summarize performance. Reference forecast - best naive guess; persistence, climatology A perfect forecast implies that the object can be perfectly observed Positively oriented – Positive is good
References: Jolliffe and Stephenson (2003): Forecast Verification: a practitioner’s guide, Wiley & Sons, 240 pp. Wilks (2005): Statistical Methods in Atmospheric Science, Academic press, 467 pp. Stanski, Burrows, Wilson (1989) Survey of Common Verification Methods in Meteorology http://www.eumetcal.org.uk/eumetcal/verification/www/english/courses/msgcrs/index.htm http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html