Statistical Evaluation of High-resolution WRF Model Forecasts Near the SF Bay Peninsula By Ellen METR 702 Prof. Leonard Sklar Fall 2014 Research Advisor: Prof. Dave Dempsey Funded by the ASERG Grant
What’s Coming? Introduction How do we make forecast? WRF Research Questions and Hypotheses Methods Results Conclusion
You might have seen this before: Don’t forget your ! National Weather Service textField1= &textField2 = #.VIesd0urfRo
Also: How do they make forecasts ? National Weather Service textField1= &textField2 = #.VIesd0urfRo
Partly Depend on: “Model Output Statistics (MOS)” For Example: National Weather Service v/cgi- bin/mos/getall.pl?sta=K SFO Temperature Wind direction Wind Speed How do models make forecasts ?
They: etc. Δt Solve a set of equations Different models use different method to solve these differential equations. history.mcs.st- and.ac.uk/HistTopics/W eather_forecasts.html
To solve the equations,2 info. needed (initialization): 1.Initial Condition – State of atmosphere – at the grid of points – at the starting time 2. Boundary Conditions – State of atmosphere – at the grid of points on the boundary of the forecast domain – at forecast times Forecast = Initial + Changes Forecast = Initial + Changes What does this mean?
Boundary condition To calculate the state on the boundary
Why are we talking about models?
WRF is a model ! Evaluating Weather Research and Forecasting (WRF) Model What provides the 2 types of conditions WRF needs for forecast in my project?
Two sources of initialization -- other forecast models DIFFERENCES: Different Resolutions (distance between grid points): NAM has higher (shorter distance) Spatially Different Grid:
Two sources of initialization -- other forecast models SAME: Forecast Run: 4 times/ day (We get forecasts for 4 a.m. 10 a.m. 4 p.m. and 10 p.m. Pacific Standard Time) Forecast Length: 48hrs; Forecast Time Step: 6hrs (Forecast times: 6 th hour after initial time, 12 th, 18 th, 24 th, 30 th, 36 th, 42 th, 48 th ) DIFFERENCE: Different Resolution (distance between grid points): NAM is higher (shorter distance) Spatially Different Grid: 0.5 degree resolution (north-south direction: about 55 km -- lower than mNAM’s resolution) 40km CONUS NAM
Longitude-latitude orientation; Lower resolution = bigger grids longitude latitude Not; Higher resolution = smaller grids
Domain in My Project Both NAM & GFS -- Lower resolutions but bigger domains than the one in my project sources of initialization for WRF
WRF -- make forecasts inside of the domain (model forecasts) Provided by NAM or GFS (sources of initialization)
with each source of initialization, WRF Outputs (WRF_NAM & WRF_GFS) Forecast Run: 4 times/ day; Forecast Time Step: 1hr; Forecast Length: 48hrs; 1 initial data + 48 forecast data = 49 data for each model run
Different sources of initialization may Different accuracies of forecast As we make forecasts, which one is more reliable for a specific propose? For my project, focus on: Temperature
Research Questions and Hypotheses 1.For each forecast hour through 48 hours, is there any significant difference in accuracy between WRF_NAM and WRF_GFS in temperature? Ho: no, there is no significant difference – about equally accurate. Ha: yes, there is significant difference in accuracy between WRF_NAM and WRF_GFS in temperature.
2.If there are significant differences in the accuracy (for each forecast hour, having significant differences between WRF_NAM and WRF_GFS), which ones are less accurate (having larger bias), WRF_NAM or WRF_GFS? Ho: WRF_NAM are less accurate when WRF_NAM and WRF_GFS are significant different. Ha: WRF_GFS are less accurate.
Methods to Evaluate Model Forecast Accuracy: Accuracy: Compared to……
Observation (Station) Data Gotten: MADIS UrbaNet data in NorthCA area
flow of information
Which statistics will we use to compare “ accuracy ” between model forecasts based on different initialization ? Mean Absolute Error (MAE) between forecasts and observations with Estimated Standard Deviations of Mean Errors (ESDEV) for each of 49 hours from multiple runs With the same source of initialization (since Aug 14 to Sept 14 and Nov 2 to Nov 16)
Ha: WRF_GFS are less accurate For Q2 Ho: WRF_NAM are less accurate Ho: equally accurate T-TESTs on D for each of 49 hours Right-tailed For Q2 For Q1 D is the difference between MAE of paired WRF temperature forecast. D = MAE of WRF_GFS – MAE of WRF_NAM D P(D) Ha: significant difference in accuracy T-TESTs on D for the hours when WRF_NAM &WRF_GFS diff in accuracy
Since we have large sample sizes for each forecast hours, degrees of freedom will become very large, and whether or not the variances of MAE are same won’t be an issue changing the values of standard errors of MAE. In this case, I used Welch’s t-tests for pairs of WRF-ARW_NAM and WRF-ARW_GFS at each forecast hour to answer both of my research
T-test: Sample size (n) True mean of D (μ D ) = 0 (Ho: WRF_NAM and WRF_GFS are same) Standard error of D (s.e. of D) = [(ESDEV of WRF_GFS ) 2 / n of WRF_GFS + (ESDEV of WRF_NAM ) 2 / n of WRF_NAM ] 1/2 (Error Propagation: Simple Rule for Differences) t_data = (D - μ D ) / s.e. of D = D / s.e. of D Degree of freedom of D (d.f. of D) = ( s.e. of D) 2 / {[( ESDEV of WRF_GFS ) 2 / n of WRF_GFS] 2 / ( n of WRF_GFS – 1) + [ (ESDEV of WRF_NAM ) 2 / n of WRF_NAM ] 2 / ( n of WRF_NAM – 1)} (Welch’s t-tests ) Typical value is 3000
5% (0.05) for the experimentwise error rate (α e ) – the risk across all 49 comparisons (n = 49) to calculated individual risk – comparisonwise risk (α). α = 1- (1- α e ) 1/n = 0.01% Based on d.f. of D and α, t_critical ≈3.3 (Sullivan, M. (2004) Statistics Informed Decisions Using Data. Pearson Education Inc. Upper Saddle River, NJ) What do these data mean to our test?
The range of d.f. of D is from 3000 to However, as degree of freedoms get larger, their effects on changes in t_critical are getting smaller. So, we choose 3000 as a classic value d.f. of D. To avoid overstated experimentwise error rate, which will be bigger than 5% to reject Ho, across all 49 comparisons due to separated individual t-tests with 95% conference level for each comparison, We use 5% (0.05) for the experimentwise error rate (α e ) – the risk across all 49 comparisons to calculated individual risk – comparisonwise risk (α).
For Q1: If |t_data| > t_critical at the corresponding hour reject Ho WRF_NAM and WRF_GFS are different in accuancy of temperature at that particular forecast hour |t_data| > t_critical
Ho: equally accurate T-TESTs for each of 49 hours For Q1 D is the difference between MAE of paired WRF temperature forecast. D = MAE of WRF_GFS – MAE of WRF_NAM D P(D) Ha: significant difference in accuracy
If t_data > t_critical at the corresponding hour reject Ho WRF_GFS are less accurate at that particular forecast hour For Q2: t_data > t_critical T-TESTs on D for the hours when WRF_NAM &WRF_GFS diff in accuracy
Ha: WRF_GFS are less accurate For Q2 Ho: WRF_NAM are less accurate T-TESTs for each of 49 hours Right-tailed For Q2 D is the difference between MAE of paired WRF temperature forecast. D = MAE of WRF_GFS – MAE of WRF_NAM D P(D) T-TESTs on D for the hours when WRF_NAM &WRF_GFS diff in accuracy
What’s the result?
However, Problems ① missing some data from middle September to early November might has impacts on the statistical results ② t-test -- each comparison should be independent from the other But, my forecast data are not independent across the forecast hours -- time-related t-test is not the most proper way to compare the accuracy of WRF_GFS and WRF_NAM
Conclusion: WRF_NAM probably more reliable for temperature forecasts within 48 hours For the first 26 forecast hours, WRF_GFS might be as accurate as WRF_NAM After 44 th hour, WRF_GFS constantly less accurate Need to find a better way to compare the accuracy of WRF_GFS & WRF_NAM
Thank you for listening ! Reference: Sullivan, M. (2004) Statistics Informed Decisions Using Data. Pearson Education Inc. Upper Saddle River, NJ
To make a forecast, the model solves a set of equations that describe how the state of the atmosphere, represented on a three- dimensional grid of points, changes over time at a series of discrete times/time steps
To calculate a forecast, the model requires two kinds of information: (1) the state of atmosphere at the grid of points at the starting time (i.e., initial conditions); and (2) the state of atmosphere at grid points on the boundary of the forecast domain at all subsequent f orecast times (i.e., boundary conditions).
We would like to know, for a specific purpose, which models could produce more accurate predictions than others could.
To evaluate the model forecast accuracy, we need to compare model forecasts to observations. My research used surface weather observations from the Meteorological Assimilation Data Ingest System (MADIS) UrbaNet database