Fuzzy verification using the Fractions Skill Score Marion Mittermaier and Nigel Roberts Spatial verification methods intercomparison meeting, Boulder, 20.02.07 © Crown copyright
Verification approach We want to know How the forecast skill varies with neighbourhood size. The smallest neighbourhood size that can be used to give sufficiently accurate forecasts. Does higher resolution provide more accurate forecasts on scales of interest (e.g. river catchments) Compare forecast fractions with fractions from radar over different sized neighbourhoods (squares for convenience) using GRIDDED data. Use rainfall accumulations to apply temporal smoothing Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events by Roberts and Lean (accepted in MWR, Feb 2007) © Crown copyright
Schematic comparison of fractions observed forecast This would be considered as a perfect forecast on the scale of 5x5 grid squares, since the fraction of grid boxes exceeding the threshold is the same in both the forecast and the observation. Threshold exceeded where squares are blue © Crown copyright
A score for comparing fractions with fractions Brier score for comparing fractions Skill score for fractions/probabilities - Fractions Skill Score (FSS) Denominator in Fractions Skill Score is the Brier score for the worst possible match-up between forecast and observed grid box values © Crown copyright
Example graph of FSS against neighbourhood size Emphasizes which scales have useful skill. At grid scale FSS of a random forecast that has the same base rate as the observations, f0, is equal to f0. The target skill is 0.5+ f0 /2, which is where the skill is closer to perfect than to random. © Crown copyright
Strengths and weaknesses Measures skill on fair terms from the model perspective. It gets round the double penalty problem by sampling around precipitation areas. It can be used to determine the scale over which a forecast system has sufficient skill. The method is intuitive and can be directly related to the way forecasts are presented. i.e. generating spatial probability forecasts. It is particularly useful for high-resolution precipitation forecasts in which we expect the fine detail to be unpredictable. It can be used for single or composite events. 1. The spatial skill signal may be swamped by the bias. 2. Sensitivity to small base rates at higher thresholds, i.e. is threshold dependent (as any method using thresholds!) 3. Like any score, it doesn't tell the whole story on its own. © Crown copyright
13 May 2005 © Crown copyright
Hourly accumulations Max = 94 mm (3.7 in) Max = 78 mm (3.1 in) © Crown copyright
Physical thresholds Increasing bias for higher thresholds mm in 0.04 0.08 0.16 0.32 0.64 1.28 ~60 mi I’ve added in an arbitrary 0.55 line as a reference for “where forecasts become skilful”. This is approximate, meant as a guide. The 0.5 line is also shown. I’ve highlighted the 100 km range as a possible limit for maximum averaging length. Again only as an illustration. Someone (with local knowledge) would need to decide what an acceptable level of skill and maximum averaging length would be. NCEP WRF appears to verify worst of all. It also gives a classic example of when smoothing becomes detrimental to skill, i.e. there is an absolute maximum averaging length at which you may be able to maximise skill (would a forecast averaged to this length still be useful though?). Beyond this, the skill (not to mention the usefulness) decreases. NCAR WRF appears to be the most skilful at fairly small scales and for most thresholds. CAPS WRF comes a close second. © Crown copyright
Top 25, 5 and 1 % of the distribution (including zeros) Frequency thresholds Top 25, 5 and 1 % of the distribution (including zeros) representative of rain/no rain boundary 0.5-1 mm (0.02-0.04 in) 4-6 mm (0.16-0.24 in) ~60 mi Upon analysis of the results I’ve come to the conclusion that given the level of sparseness of data in the domain I need to re-evaluate the frequency thresholds in the code (that seemed reasonable for the UK!). As the code takes some time to run and I didn’t have time to tweak too much I only have these results to show. Clearly a 0.001 (10th of a percent) could still yield sensible results. Notice also that for the NCEP WRF the rain/no rain threshold has the best skill, whereas for the other two this appears to be worst. >75% zeros in domain top 1% are values ~ 5mm (0.2 in) or more (large range) 1% of pixels is ~ 3000 (still a lot) © Crown copyright
1 June 2005 © Crown copyright
Hourly accumulations Max = 120 mm (4.7 in) Max = 84 mm (3.3 in) © Crown copyright
Physical thresholds mm in 0.04 0.08 0.16 0.32 0.64 1.28 ~60 mi NCEP WRF still less skilful than the other two. Both NCAR AND CAPS WRF struggling to reach acceptable skill, even with considerable smoothing. © Crown copyright
Frequency thresholds representative of rain/no rain boundary 0.5-1 mm (0.02-0.04 in) representative of rain/no rain boundary 4-6 mm (0.16-0.24 in) ~60 mi © Crown copyright
Issues The following points have cropped up and are listed here as general issues or specific to the FSS. At the very least they require a bit more thought and possibly some extra tests. Currently FSS is computationally expensive (run time dependent on domain size). Results may be domain size dependent (a larger domain gives the scope for larger spatial errors). Other spatial methods may suffer in the same way. (Do we know enough about this?) Independence issues (regarding adjacent pixels). This affects all spatial-based methods. (Should we be worried?) Impact of data sparseness(?), domain edge effects. A bit of a grey area but again may apply more widely. © Crown copyright