1 Verification Continued… Holly C. Hartmann Department of Hydrology and Water Resources University of Arizona RFC Verification Workshop, 08/14/2007
2 1.Introduction to Verification -Applications, Rationale, Basic Concepts -Data Visualization and Exploration -Deterministic Scalar measures 2. Categorical measures – KEVIN WERNER -Deterministic Forecasts -Ensemble Forecasts 3. Diagnostic Verification -Reliability -Discrimination -Conditioning/Structuring Analyses 4. Lab Session/Group Exercise - Developing Verification Strategies - Connecting to Forecast Operations and Users Agenda
3 Probabilistic Ensemble Forecasts From: California-Nevada River Forecast Center
4 Probabilistic Ensemble Forecasts From: California-Nevada River Forecast Center
5 Probabilistic Ensemble Forecasts From: A. Hamlet, University of Washington
6
7 Probabilistic Ensemble Forecasts From: A. Hamlet, University of Washington
8 Identifies systematic flaws of an ensemble prediction system. Shows effectiveness of ensemble distribution in sampling the observations. Does not indicate that the ensemble will be of practical use. Talagrand Diagram – Also Called Ranked Histogram
9 With only one ensemble member ( | ) all (100%) observations ( ) will fall “outside” With two ensemble members two out of three observations ( 2/3=67%) should fall outside With three ensemble members two out of four observations ( 2/4=50%) should fall outside | | | | | | For any number of ensemble members, 2/#members should fall outside the ensemble Identifies systematic flaws of an ensemble prediction system. Shows effectiveness of ensemble distribution in sampling the observations. Does not indicate that the ensemble will be of practical use. Principle Behind Talagrand Diagram Talagrand Diagram – Also Called Ranked Histogram Adapted from A. Persson, 2006
10 Talagrand Diagram Computation Example YEARE1 E2 E3 E OBS Four sample ensemble members (E1 – E4) for daily flow forecasts (produced from reforecasts using carryover each year) Step 1: Rank lowest to highest for each year. Four members results in 5 bins. Step 2: Determine which bin the corresponding observation falls into. Step 3: Tally how many observations fall in each bin. Step 4: Plot frequency of observations for ranked bin. Bin # Bin1 Bin2 Bin3 Bin4 Bin5 Bin # Tally
11 Talagrand Diagram Computation Example YEARE1 E2 E3 E OBS Four sample ensemble members (E1 – E4) for daily flow forecasts (produced from reforecasts using carryover each year) Step 1: Rank lowest to highest for each year. Four members results in 5 bins. Step 2: Determine which bin the corresponding observation falls into. Step 3: Tally how many observations fall in each bin. Step 4: Plot frequency of observations for ranked bin. Bin # Bin1 Bin2 Bin3 Bin4 Bin5 Bin # Tally
12 Talagrand Diagram Computation Example YEARE1 E2 E3 E OBS Four sample ensemble members (E1 – E4) ranked lowest to highest for daily flow (produced from reforecasts using carryover in each year) Bin # Bin # Tally Bin1 Bin2 Bin3 Bin4 Bin5 Frequency
13 Talagrand Diagram: 25 traces/ensemble, 375 observations Example: “U-Shaped” Observations too often falling outside ensemble Indicates ensemble spread too small Example: “L-Shaped” Observations too often larger (smaller) than ensemble Indicates under- (over-) forecasting bias Example: “N-Shaped” (domed shaped) Observations too rarely falling outside ensemble Indicates ensemble spread is too big Example: “Flat-Shaped” Observations falling uniformly across ensemble Indicates appropriately sized ensemble distribution
14 Talagrand Diagram Example: Interpretation? YEARE1 E2 E3 E OBS Bin # Bin # Tally Bin1 Bin2 Bin3 Bin4 Bin5 Four sample ensemble members (E1 – E4) ranked lowest to highest for daily flow (produced from reforecasts using carryover in each year) ??? Frequency
15 Distributions-oriented Forecast Evaluation leads to Diagnostic Verification It’s all about conditional and marginal distributions! P(O|F), P(F|O), P(F), P(O) Reliability, Discrimination, Sharpness, Uncertainty
16 Forecast Reliability -- P(O|F) For a specified forecast condition, what does the distribution of observations look like? Forecasted Probability Relative frequency of observed Forecasted Probability Relative frequency of observed User perspective: “When you say 20% chance of flood flows, how often do flood flows actually happen?” User perspective: “When you say 80% chance of flood flows, how often do flood flows actually happen?”
17 l Good reliability – close to diagonal l Sharpness diagram (p(f)) –histogram of forecasts in each probability bin shows shows marginal distribution of forecasts The reliability diagram is conditioned on the forecasts. That is, given that X was predicted, what was the outcome? Reliability (Attributes) Diagram – Reliability, Sharpness
18 Reliability Diagram Example Computation YEARE1 E2 E3 E OBS Step 1: Choose threshold value to base probability forecasts on. For simplicity we’ll choose the mean forecast over all years and all ensembles (= 208).
19 Reliability Diagram Example Computation YEARE1 E2 E3 E OBS Step 2: Choose how many forecast probability categories to use (5 here: 0,.25,.5,.75,1) Step 3: For each forecast, calculate the forecast probability below the threshold value. P(peak for < 208)
20 Reliability Diagram Example Computation YEARE1 E2 E3 E OBS Step 2: Choose how many forecast probability categories to use (5 here: 0,.25,.5,.75,1) Step 3: For each forecast, calculate the forecast probability below the threshold value. P(peak for < 208)
21 OBS Step 4: Group the observations into groups of equal forecast probability (or, more generally, into forecast probability categories). P(peak for < 208) P(peak < 208) = P(peak < 208) = , 98, 233 P(peak < 208) = , 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = P(peak < 208) = , 98, 233 P(peak < 208) = , 301, 245, 248, 227 Reliability Diagram Example Computation P(peak < 208) = 1.0
22 OBS Step 4: Group the observations into groups of equal forecast probability (or, more generally, into forecast probability categories). P(peak for < 208) P(peak < 208) = P(peak < 208) = , 98, 233 P(peak < 208) = , 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = P(peak < 208) = , 98, 233 P(peak < 208) = , 301, 245, 248, 227 P(peak < 208) = , 156, 167 Reliability Diagram Example Computation
23 Step 5: For each group, calculate the frequency of observations above the threshold value, 208 cfs. P(peak < 208) = P(peak < 208) = , 98, 233 P(peak < 208) = , 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = , 156, 167 P(obs peak < 208 given [P(peak for < 208) = 0.0]) = 0/1 = 0.0 P(obs peak < 208 given [P(peak for < 208) = 0.25]) = 1/3 = 0.33 P(obs peak < 208 given [P(peak for < 208) = 0.5]) = 1/5 = 0.2 P(obs peak < 208 given [P(peak for < 208) = 1.0]) = P(obs peak < 208 given [P(peak for < 208) = 0.75]) = 0/0 = NA Reliability Diagram Example Computation
24 Step 5: For each group, calculate the frequency of observations above the threshold value, 208 cfs. P(peak < 208) = P(peak < 208) = , 98, 233 P(peak < 208) = , 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = , 156, 167 P(obs peak < 208 given [P(peak for < 208) = 0.0]) = 0/1 = 0.0 P(obs peak < 208 given [P(peak for < 208) = 0.25]) = 1/3 = 0.33 P(obs peak < 208 given [P(peak for < 208) = 0.5]) = 1/5 = 0.2 P(obs peak < 208 given [P(peak for < 208) = 1.0]) = 3/3 = 1 P(obs peak < 208 given [P(peak for < 208) = 0.75]) = 0/0 = NA Reliability Diagram Example Computation
25 Step 6: Plot centroid of the forecast category (just points in our case) on the x-axis against the observed frequency within each forecast category on the y-axis. Include the 45 degree diagonal for reference. Reliability Diagram Example Computation
26 Step 7: Include sharpness plot showing the number of observation/forecast pairs in each category. Reliability Diagram Example Computation
27 l Good reliability – close to diagonal l Sharpness diagram (p(f)) –histogram of forecasts in each probability bin shows marginal distribution of forecasts l Good resolution –wide range of frequency of observations corresponding to forecast probabilities l Skill – related to Brier Skill Score, in reference to sample climatology (not historical climatology) The reliability diagram is conditioned on the forecasts. That is, given that X was predicted, what was the outcome? Reliability Diagram – Reliability, Sharpness – P(O|F)
28 Overall relative frequency of observations (sample climatology) Points closer to perfect-reliability line than to no-resolution line: subsamples of probabilistic forecast contribute positively to overall skill (as defined by BSS) in reference to sample climatology No-skill line : halfway between perfect-reliability line and no- resolution line, with sample climatology as a reference Attributes Diagram – Reliability, Resolution, Skill/No-skill
29 ClimatologyMinimal RESolutionUnderforecasting Good RES, at expense of REL Reliable forecasts of rare event Small sample size Source: Wilks (1995) Interpretation of Reliability Diagrams
30 Interpretation of Reliability Diagrams Reliability P[O|F] Does the frequency of occurrence match your probability statement? Identifies conditional bias Relative frequency of observations Forecasted probability No resolution
31 EVS Reliability Diagram Examples 25 th Percentile Observed Flows (low flows) Sharp forecasts, but low resolution Arkansas-Red Basin, 24-hr flows, lead time 1-14 days 85 th Percentile Observed Flows (high flows) Good reliability at shorter lead times, long-leads miss high events From: J. Brown, EVS Manual
32 Historical seasonal water supply outlooks Colorado River Basin Morrill, Hartmann, and Bales, 2007
33 Forecast probability Relative Frequency of Observations Jan 1 2) These months show best reliability; low resolution limiting reliability 1) Few high prob. fcasts, good reliability between 10-70% probability; reliability improves. Reliability: Colorado Basin ESP Seasonal Supply Outlooks Apr 1 Mar 1 Jun 1 Jan 1 Apr 1 LC JM (5 mo. lead) LC MM (3 mo. lead) LC AM (2 mo. lead) UC JJy (7 mo. lead) UC AJy (4 mo. lead) UC JnJy (2 mo. lead) 3) Reliability decreases for later forecasts as resolution increases; UC good at extremes. high 30% mid 40% low 30% Franz, Hartmann, and Sorooshian, 2003
34 For a specified observation category, what do the forecast distributions look like? Discrimination – P(F|O) “When dry conditions happen… What do the forecasts usually look like? You sure hope that forecasts look different when there’s a drought, compared to when there’s a flood!
35 Discrimination – P(F|O) You sure hope that forecasts look different when there’s a drought, compared to when there’s a flood! Example: NWS CPC Seasonal climate outlooks, sorted into DRY cases (lowest tercile), , all forecasts, all lead-times Good discrimination! Not much discrimination! Forecasted Probability Relative frequency of indicated forecast Climatology Probability of dry Probability of wet Forecasted Probability Relative frequency of indicated forecast Climatology Probability of dry Probability of wet
36 Relative Frequency of Forecasts High Mid- Low There is some discrimination… Early forecasts warned “High flows less likely” Jan 1 Jan-May When unusually low flows happened… P(F|Low flows). Low < 30 th percentile Franz, Hartmann, and Sorooshian (2003) Forecast probability Discrimination: Lower Colorado ESP Supply Outlooks
37 Relative Frequency of Forecasts Good Discrimination… Forecasts were saying: 1) high and mid- flows less likely. 2) Low flows more likely Jan 1 Forecast probability Apr 1 Jan-May Apr-May High Mid- Low There is some discrimination… Early forecasts warned “High flows less likely” Discrimination: Lower Colorado ESP Supply Outlooks When unusually low flows happened… P(F|Low flows). Low < 30 th percentile Franz, Hartmann, and Sorooshian (2003)
38 Relative Frequency of Forecasts high 30% mid 40% low 30% 1)High flows less likely. 2) No discrimination between mid and low flows. 3) Both UC and LC show good discrimination for low flows at 2-month lead time. Jan 1 Forecast probability Apr 1 Lower Colorado Basin Jan-May (5 mo. lead) April-May (2 mo. lead) Jan 1 Jun 1 Upper Colorado Basin Jan-July (7 mo. lead) June-July (2 mo. lead) For observed flows in lowest 30% of historic distribution Discrimination: Colorado Basin ESP Supply Outlooks Franz, Hartmann, and Sorooshian (2003)
39 Historical seasonal water supply outlooks Colorado River Basin
40 All observation CDF is plotted and color coded by tercile. Forecast ensemble members are sorted into 3 groups according to which tercile its associated observation falls into. The CDF for each group is plotted in the appropriate color. i.e. high is blue. Discrimination: CDF Perspective Credit: K. Werner
41 In this case, there is relatively good discrimination since the three conditional forecast CDFs separate themselves. Discrimination Credit: K. Werner
42 Discrimination Example Computation YEARE1 E2 E3 E OBS Step 1: Order observations and divide ordered list into categories. Here we will use terciles (≤ 167, 206 ≤ ≤ 245, ≥ 248). OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner
43 Discrimination Example Computation YEARE1 E2 E3 E OBS Step 2: Group forecast ensemble members according to OBS tercile. Low OBS Forecasts: 42, 74, 82, 90, 114, 277, 351, 356, 98, 170, 204, 205, 94,135, 156, 158 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner
44 Discrimination Example Computation YEARE1 E2 E3 E OBS Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.
45 Discrimination Example Computation YEARE1 E2 E3 E OBS Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, 108, 189, 227, 228 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.
46 Discrimination Example Computation YEARE1 E2 E3 E OBS Hi OBS Forecasts: 82, 192, 295, 300, 142, 291, 349, 356, 59, 175, 244, 250 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.
47 Discrimination Example Computation YEARE1 E2 E3 E OBS Hi OBS Forecasts: 82, 192, 295, 300, 211, 397, 514, , 291, 349, 356, 59, 175, 244, 250 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.
48 Discrimination Example Computation OBS Step 3: Plot all-observation CDF color coded by tercile (≤ 167, 206 ≤ ≤ 245, ≥ 248). Credit: K. Werner OBS Tercile Low Middle High Low Middle High Middle Low
49 Step 4: Add forecasts conditioned on observed terciles CDFs to plot. Low OBS Forecasts: 42, 74, 82, 90, 114, 277, 351, 356, 98, 170, 204, 205, 94, 135, 156, 158 Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, 108, 189, 227, 228 Hi OBS Forecasts: 82, 192, 295, 300, 211, 397, 514, 544, 142, 291, 349, 356, 59, 175, 244, 250 Discrimination Example Computation Credit: K. Werner
50 Step 5: Discrimination is shown by the degree to which the conditional forecast CDFs are separated from each other. In this case, high forecasts discriminate better than mid and low forecasts. Discrimination Example Computation Credit: K. Werner
51 How well do April – July volume forecasts discriminate when they are made in Jan, Mar, and May? Poor discrimination in Jan between forecasting high and medium flows. Best discrimination in May. Discrimination Credit: K. Werner
52 Another way to look at discrimination using PDF’s in lieu of CDF’s. The more separation between the PDF’s the better the discrimination. Discrimination Credit: K. Werner
53 Deterministic forecasts traditional in hydrology sub-optimal for decision making Common perspective “Deterministic model simulations and probabilistic forecasts … are two entirely different types of products. Direct comparison of probabilistic forecasts with deterministic single valued forecasts is extremely difficult” Comparing Deterministic & Probabilistic Forecasts - Anonymous
54 How can we compare deterministic and probabilistic forecasts? Deterministic Probabilistic Source: XEFS Design Team, 2007 Option: Use ensemble median with standard metrics – No! x
55 From: A. Hamlet, University of Washington The ensemble mean minimizes error, but doesn’t represent the overall behavior. “Pretend Determinism”
56 What’s wrong with using ‘deterministic’ metrics? Metrics using only central tendency of each forecast pdf fail to distinguish between forecasts 1-3, but will identify 4 as inferior. Metrics that reward accuracy but punish spread will rank the forecast skill from 1 to 4. Obs Value PDF From: A. Hamlet, University of Washington
57 How can we compare deterministic and probabilistic forecasts? Deterministic Probabilistic Source: XEFS Design Team, 2007 Option: Use ensemble median with standard metrics – No! x
58 PDF Climatology distribution Forecast distribution Tercile boundaries (equal probability) Deterministic forecast Jack-knife calibration error = PDF of error distribution can determine any quantiles Deterministic vs. Probabilistic Forecasts Observation Flow, Q Approach used by Morrill, Hartmann, Bales 2007
59 Lab Session -- Group Exercise Choose a set of forecasts. Develop strategies for verifying these forecasts from two perspectives: - Users - Forecasters during operations Report back to group. Repeat for second set of forecasts, if time permits.