Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu.

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu 1, Jason Knievel 1, Tom Warner 1, Scott Swerdlin 1, John Pace 2, Scott Halvorson 2 2 U.S. Army Test and Evaluation Command 1

Outline I.Motivation: ensemble forecasting and post- processing II.E-RTFDDA for Dugway Proving Grounds III.Introduce Quantile Regression (QR; Kroenker and Bassett, 1978) III.Post-processing procedure IV.Verification results V.Warning: dynamically finding ensemble dispersion at risk ensemble mean utility VI.Conclusions

Goals of an EPS Predict the observed distribution of events and atmospheric states Predict uncertainty in the day’s prediction Predict the extreme events that are possible on a particular day Provide a range of possible scenarios for a particular forecast

1.Greater accuracy of ensemble mean forecast (half the error variance of single forecast) 2.Likelihood of extremes 3.Non-Gaussian forecast PDF’s 4.Ensemble spread as a representation of forecast uncertainty => All rely on forecasts being calibrated Further … -- Argue calibration essential for tailoring to local application: NWP provides spatially- and temporally-averaged gridded forecast output -- Applying gridded forecasts to point locations requires location specific calibration to account for local spatial- and temporal- scales of variability ( => increasing ensemble dispersion) More technically …

Dugway Proving Grounds, Utah e.g. T Thresholds Includes random and systematic differences between members. Not an actual chance of exceedance unless calibrated.

Challenges in probabilistic mesoscale prediction Model formulation Bias (marginal and conditional) Lack of variability caused by truncation and approximation Non-universality of closure and forcing Initial conditions Small-scales are damped in analysis systems, and the model must develop them Perturbation methods designed for medium-range systems may not be appropriate Lateral boundary conditions After short time periods the lateral boundary conditions can dominate Representing uncertainty in lateral boundary conditions is critical Lower boundary conditions Dominate boundary-layer response Difficult to estimate uncertainty in lower boundary conditions

RTFDDA and Ensemble-RTFDDA Liu et al. 2010 AMS Annual Meeting, 14 th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010 yliu@ucar.edu

The Ensemble Execution Module Perturbations observations Member 1 Perturbations observations Member 2 Perturbations observations Member 3 Perturbations observations Member N … 36-48h fcsts 36-48h fcsts 36-48h fcsts 36-48h fcsts Input to decision support tools Post processing Archiving and verification RTFDDA Liu et al. 2010 AMS Annual Meeting, 14 th IOAS-AOLS, Atlanta, GA. January 18 – 23, 2010 yliu@ucar.edu

Operated at US Army DPG since Sep. 2007 D1 D2 D3 Surface and X-sections – Mean, Spread, Exceedance Probability, Spaghetti, … Likelihood for SPD > 10m/s Mean T & Wind T Mean and SD Wind Speed T-2m Wind Rose Pin-point Surface and Profiles – Mean, Spread, Exceedance probability, spaghetti, Wind roses, Histograms … Real-time Operational Products for DPG

Forecast “calibration” or “post-processing” Probability calibration Flow rate [m 3 /s] Probability Post-processing has corrected: the “on average” bias as well as under-representation of the 2nd moment of the empirical forecast PDF (i.e. corrected its “dispersion” or “spread”) “spread” or “dispersion” “bias” obs Forecast PDF Forecast PDF Flow rate [m 3 /s] Our approach: under-utilized “quantile regression” approach probability distribution function “means what it says” daily variation in the ensemble dispersion directly relate to changes in forecast skill => informative ensemble skill-spread relationship

Example of Quantile Regression (QR) Our application Fitting T quantiles using QR conditioned on: 1)Ranked forecast ens 2)ensemble mean 3)ensemble median 4) ensemble stdev 5) Persistence

T [K] Time forecastsobserved Regressor set: 1. reforecast ens 2. ens mean 3. ens stdev 4. persistence 5. LR quantile (not shown) Probability/°K Temperature [K] climatological PDF Step I: Determine climatological quantiles Step 2: For each quan, use “forward step-wise cross-validation” to iteratively select best subset Selection requirements: a) QR cost function minimum, b) Satisfy binomial distribution at 95% confidence If requirements not met, retain climatological “prior” 1. 3. 2. 4. Step 3: segregate forecasts into differing ranges of ensemble dispersion and refit models (Step 2) uniquely for each range Time forecasts T [K] I.II.III.II.I. Probability/°K Temperature [K] Forecast PDF prior posterior Final result: “sharper” posterior PDF represented by interpolated quans

Measures Used: 1)Rank histogram (converted to scalar measure) 2)Root Mean square error (RMSE) 3)Brier score 4)Rank Probability Score (RPS) 5)Relative Operating Characteristic (ROC) curve 6)New measure of ensemble skill-spread utility => Using these for automated calibration model selection by using weighted sum of skill scores of each Utilizing Verification measures near-real-time …

Problems with Spread-Skill Correlation …  ECMWF spread-skill (black) correlation << 1  Even “perfect model” (blue) correlation << 1 and varies with forecast lead-time ECMWF r = 0.33 “Perfect” r = 0.68 ECMWF r = “Perfect” r = 0.56 ECMWF r = 0.39 “Perfect” r = 0.53 ECMWF r = 0.36 “Perfect” r = 0.49 1 day 7 day 4 day 10 day

National Security Applications Program Research Applications Laboratory 3-hr dewpoint time series Before Calibration After Calibration Station DPG S01

42-hr dewpoint time series Before Calibration After Calibration Station DPG S01

obs Blue is “raw” ensemble Black is calibrated ensemble Red is the observed value Notice: significant change in both “bias” and dispersion of final PDF (also notice PDF asymmetries) PDFs: raw vs. calibrated

National Security Applications Program Research Applications Laboratory 3-hr dewpoint rank histograms Station DPG S01

National Security Applications Program Research Applications Laboratory Station DPG S01 42-hr dewpoint rank histograms

Skill Scores Single value to summarize performance. Reference forecast - best naive guess; persistence, climatology A perfect forecast implies that the object can be perfectly observed Positively oriented – Positive is good

National Security Applications Program Research Applications Laboratory Skill Score Verification RMSE Skill Score CRPS Skill Score Reference Forecasts: Black -- raw ensemble Blue -- persistence

Computational Resource Questions: How best to utilize a multi-model simulations (forecast), especially if under-dispersive? a)Should more dynamical variability be searched for? Or a)Is it better to balance post-processing with multi-model utilization to create a properly dispersive, informative ensemble?

National Security Applications Program Research Applications Laboratory 3-hr dewpoint rank histograms Station DPG S01

National Security Applications Program Research Applications Laboratory RMSE of ensemble members 3hr Lead-time 42hr Lead-time Station DPG S01

National Security Applications Program Research Applications Laboratory Significant calibration regressors 3hr Lead-time 42hr Lead-time Station DPG S01

Questions revisited: How best to utilize a multi-model simulations (forecast), especially if under-dispersive? a)Should more dynamical variability be searched for? Or b)Is it better to balance post-processing with multi-model utilization to create a properly dispersive, informative ensemble? Warning: adding more models can lead to decreasing utility of the ensemble mean (even if the ensemble is under-dispersive)

Summary  Quantile regression provides a powerful framework for improving the whole (potentially non-gaussian) PDF of an ensemble forecast – different regressors for different quantiles and lead-times  This framework provides an umbrella to blend together multiple statistical correction approaches (logistic regression, etc., not shown) as well as multiple regressors  As well, “step-wise cross-validation” based calibration provides a method to ensure forecast skill no worse than climatological and persistence for a variety of cost functions  As shown here, significant improvements made to the forecast’s ability to represent its own potential forecast error (while improving sharpness): – uniform rank histogram – significant spread-skill relationship (new skill-spread measure)  Care should be used before “throwing more models” at an “under-dispersive” forecast problem Further questions: hopson@ucar.edu or yliu@ucar.eduhopson@ucar.eduyliu@ucar.edu

Dugway Proving Ground

other options … Assign dispersion bins, then: 2) Average the error values in each bin, then correlate 3) Calculate individual rank histograms for each bin, convert to a scalar measure

Example: French Broad River Before Calibration => underdispersive Black curve shows observations; colors are ensemble

Rank Histogram Comparisons After quantile regression, rank histogram more uniform (although now slightly over-dispersive) Raw full ensembleAfter calibration

Frequency Used for Quantile Fitting of Method I: Best Model=76% Ensemble StDev=13% Ensemble Mean=0% Ranked Ensemble=6% What Nash-Sutcliffe (RMSE) implies about Utility

Note: Take home message: For a “calibrated ensemble”, error variance of the ensemble mean is 1/2 the error variance of any ensemble member (on average), independent of the distribution being sampled Probability obs Forecast PDF Discharge

Sequentially-averaged models (ranked based on NS Score) and their resultant NS Score => Notice the degredation of NS with increasing # (with a peak at 2 models) => For an equitable multi-model, NS should rise monotonically => Maybe a smaller subset of models would have more utility? (A contradiction for an under-dispersive ensemble?) What Nash-Sutcliffe (RMSE) implies about Utility (cont) -- degredation with increased ensemble size

Initial Frequency Used for Quantile Fitting: Best Model=76% Ensemble StDev=13% Ensemble Mean=0% Ranked Ensemble=6% What Nash-Sutcliffe implies about Utility (cont) Reduced Set Frequency Used for Quantile Fitting: Best Model=73% Ensemble StDev=3% Ensemble Mean=32% Ranked Ensemble=29% …using only top 1/3 of models To rank and form ensemble mean … … earlier results … => Appears to be significant gains in the utility of the ensemble after “filtering” (except for drop in StDev) … however “proof is in the pudding” … => Examine verification skill measures …

Skill Score Comparisons between full- and “filtered” ensemble sets Points: -- quite similar results for a variety of skill scores -- both approaches give appreciable benefit over the original raw multi-model output -- however, only in the CRPSS is there improvement of the “filtered” ensemble set over the full set => post-processing method fairly robust => More work (more filtering?)! GREEN -- full calibrated multi-model BLUE -- “filtered” calibrated multi-model Reference – uncalibrated set

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu.

Similar presentations

Presentation on theme: "Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu.

Similar presentations

Presentation on theme: "Quantile regression as a means of calibrating and verifying a mesoscale NWP ensemble Tom Hopson 1 Josh Hacker 1, Yubao Liu 1, Gregory Roux 1, Wanli Wu."— Presentation transcript:

Similar presentations

About project

Feedback