Statistical Evaluation of High-resolution WRF Model Forecasts Near the SF Bay Peninsula By Ellen METR 702 Prof. Leonard Sklar Fall 2014 Research Advisor:

Slides:



Advertisements
Similar presentations
Statistics Versus Parameters
Advertisements

Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
PSY 307 – Statistics for the Behavioral Sciences
4.1 All rights reserved by Dr.Bill Wan Sing Hung - HKBU Lecture #4 Studenmund (2006): Chapter 5 Review of hypothesis testing Confidence Interval and estimation.
Evaluation.
BCOR 1020 Business Statistics
GG313 Lecture 8 9/15/05 Parametric Tests. Cruise Meeting 1:30 PM tomorrow, POST 703 Surf’s Up “Peak Oil and the Future of Civilization” 12:30 PM tomorrow.
Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.
T-Tests Lecture: Nov. 6, 2002.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Ensemble Post-Processing and it’s Potential Benefits for the Operational Forecaster Michael Erickson and Brian A. Colle School of Marine and Atmospheric.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Introduction to Analysis of Variance (ANOVA)
Relationships Among Variables
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Jamie Wolff Jeff Beck, Laurie Carson, Michelle Harrold, Tracy Hertneky 15 April 2015 Assessment of two microphysics schemes in the NOAA Environmental Modeling.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Inference for regression - Simple linear regression
Hypothesis Testing:.
Two Sample Tests Ho Ho Ha Ha TEST FOR EQUAL VARIANCES
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Inferential Statistics & Test of Significance
Chapter Thirteen Part I
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 2 – Slide 1 of 25 Chapter 11 Section 2 Inference about Two Means: Independent.
1 Level of Significance α is a predetermined value by convention usually 0.05 α = 0.05 corresponds to the 95% confidence level We are accepting the risk.
The Hypothesis of Difference Chapter 10. Sampling Distribution of Differences Use a Sampling Distribution of Differences when we want to examine a hypothesis.
Comparing Two Population Means
Chapter 9 Hypothesis Testing and Estimation for Two Population Parameters.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
User Study Evaluation Human-Computer Interaction.
Lecture 4 Basic Statistics Dr. A.K.M. Shafiqul Islam School of Bioprocess Engineering University Malaysia Perlis
1 Section 9-4 Two Means: Matched Pairs In this section we deal with dependent samples. In other words, there is some relationship between the two samples.
May 2004 Prof. Himayatullah 1 Basic Econometrics Chapter 5: TWO-VARIABLE REGRESSION: Interval Estimation and Hypothesis Testing.
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
DIRECTIONAL HYPOTHESIS The 1-tailed test: –Instead of dividing alpha by 2, you are looking for unlikely outcomes on only 1 side of the distribution –No.
Chapter 13 Multiple Regression
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Chapter 8 Parameter Estimates and Hypothesis Testing.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
1 The t-distribution General comment on z and t
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
Statistics Who Spilled Math All Over My Biology?!.
- We have samples for each of two conditions. We provide an answer for “Are the two sample means significantly different from each other, or could both.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
A STATISTICAL COMPARISON OF AMPS 10-KM AND 3.3-KM DOMAINS Michael G. Duda, Kevin W. Manning, and Jordan G. Powers Mesoscale and Microscale Meteorology.
Introduction to ANOVA Research Designs for ANOVAs Type I Error and Multiple Hypothesis Tests The Logic of ANOVA ANOVA vocabulary, notation, and formulas.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Video Conference 1 AS 2013/2012 Chapters 10 – Correlation and Regression 15 December am – 11 am Puan Hasmawati Binti Hassan
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Inferential Statistics Psych 231: Research Methods in Psychology.
Chapter 10: The t Test For Two Independent Samples.
Chapter 9 Introduction to the t Statistic
INTRODUCTION TO MULTIPLE REGRESSION MULTIPLE REGRESSION MODEL 11.2 MULTIPLE COEFFICIENT OF DETERMINATION 11.3 MODEL ASSUMPTIONS 11.4 TEST OF SIGNIFICANCE.
Lecture #25 Tuesday, November 15, 2016 Textbook: 14.1 and 14.3
of Temperature in the San Francisco Bay Area
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Fall 2016 Room 150 Harvill Building 10: :50 Mondays, Wednesdays.
Overview of Deterministic Computer Models
Hypothesis testing March 20, 2000.
Essentials of Modern Business Statistics (7e)
of Temperature in the San Francisco Bay Area
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
What are their purposes? What kinds?
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
Presentation transcript:

Statistical Evaluation of High-resolution WRF Model Forecasts Near the SF Bay Peninsula By Ellen METR 702 Prof. Leonard Sklar Fall 2014 Research Advisor: Prof. Dave Dempsey Funded by the ASERG Grant

What’s Coming?  Introduction  How do we make forecast?  WRF  Research Questions and Hypotheses  Methods  Results  Conclusion

You might have seen this before: Don’t forget your ! National Weather Service textField1= &textField2 = #.VIesd0urfRo

Also: How do they make forecasts ? National Weather Service textField1= &textField2 = #.VIesd0urfRo

Partly Depend on: “Model Output Statistics (MOS)” For Example: National Weather Service v/cgi- bin/mos/getall.pl?sta=K SFO Temperature  Wind direction Wind Speed How do models make forecasts ?

They: etc. Δt Solve a set of equations Different models use different method to solve these differential equations. history.mcs.st- and.ac.uk/HistTopics/W eather_forecasts.html

To solve the equations,2 info. needed (initialization): 1.Initial Condition – State of atmosphere – at the grid of points – at the starting time 2. Boundary Conditions – State of atmosphere – at the grid of points on the boundary of the forecast domain – at forecast times Forecast = Initial + Changes Forecast = Initial + Changes What does this mean?

Boundary condition To calculate the state on the boundary

Why are we talking about models?

WRF is a model ! Evaluating Weather Research and Forecasting (WRF) Model What provides the 2 types of conditions WRF needs for forecast in my project?

Two sources of initialization -- other forecast models DIFFERENCES:  Different Resolutions (distance between grid points): NAM has higher (shorter distance)  Spatially Different Grid:

Two sources of initialization -- other forecast models SAME: Forecast Run: 4 times/ day (We get forecasts for 4 a.m. 10 a.m. 4 p.m. and 10 p.m. Pacific Standard Time) Forecast Length: 48hrs; Forecast Time Step: 6hrs (Forecast times: 6 th hour after initial time, 12 th, 18 th, 24 th, 30 th, 36 th, 42 th, 48 th ) DIFFERENCE:  Different Resolution (distance between grid points): NAM is higher (shorter distance)  Spatially Different Grid: 0.5 degree resolution (north-south direction: about 55 km -- lower than mNAM’s resolution) 40km CONUS NAM

Longitude-latitude orientation; Lower resolution = bigger grids longitude latitude Not; Higher resolution = smaller grids

Domain in My Project Both NAM & GFS -- Lower resolutions but bigger domains than the one in my project sources of initialization for WRF

WRF -- make forecasts inside of the domain (model forecasts) Provided by NAM or GFS (sources of initialization)

with each source of initialization, WRF Outputs (WRF_NAM & WRF_GFS) Forecast Run: 4 times/ day; Forecast Time Step: 1hr; Forecast Length: 48hrs; 1 initial data + 48 forecast data = 49 data for each model run

Different sources of initialization may  Different accuracies of forecast As we make forecasts, which one is more reliable for a specific propose? For my project, focus on: Temperature

Research Questions and Hypotheses 1.For each forecast hour through 48 hours, is there any significant difference in accuracy between WRF_NAM and WRF_GFS in temperature? Ho: no, there is no significant difference – about equally accurate. Ha: yes, there is significant difference in accuracy between WRF_NAM and WRF_GFS in temperature.

2.If there are significant differences in the accuracy (for each forecast hour, having significant differences between WRF_NAM and WRF_GFS), which ones are less accurate (having larger bias), WRF_NAM or WRF_GFS? Ho: WRF_NAM are less accurate when WRF_NAM and WRF_GFS are significant different. Ha: WRF_GFS are less accurate.

Methods to Evaluate Model Forecast Accuracy: Accuracy: Compared to……

Observation (Station) Data Gotten: MADIS UrbaNet data in NorthCA area

flow of information

Which statistics will we use to compare “ accuracy ” between model forecasts based on different initialization ? Mean Absolute Error (MAE) between forecasts and observations with Estimated Standard Deviations of Mean Errors (ESDEV) for each of 49 hours from multiple runs With the same source of initialization (since Aug 14 to Sept 14 and Nov 2 to Nov 16)

Ha: WRF_GFS are less accurate For Q2 Ho: WRF_NAM are less accurate Ho: equally accurate T-TESTs on D for each of 49 hours Right-tailed For Q2 For Q1 D is the difference between MAE of paired WRF temperature forecast. D = MAE of WRF_GFS – MAE of WRF_NAM D P(D) Ha: significant difference in accuracy T-TESTs on D for the hours when WRF_NAM &WRF_GFS diff in accuracy

Since we have large sample sizes for each forecast hours, degrees of freedom will become very large, and whether or not the variances of MAE are same won’t be an issue changing the values of standard errors of MAE. In this case, I used Welch’s t-tests for pairs of WRF-ARW_NAM and WRF-ARW_GFS at each forecast hour to answer both of my research

T-test: Sample size (n) True mean of D (μ D ) = 0 (Ho: WRF_NAM and WRF_GFS are same) Standard error of D (s.e. of D) = [(ESDEV of WRF_GFS ) 2 / n of WRF_GFS + (ESDEV of WRF_NAM ) 2 / n of WRF_NAM ] 1/2 (Error Propagation: Simple Rule for Differences) t_data = (D - μ D ) / s.e. of D = D / s.e. of D Degree of freedom of D (d.f. of D) = ( s.e. of D) 2 / {[( ESDEV of WRF_GFS ) 2 / n of WRF_GFS] 2 / ( n of WRF_GFS – 1) + [ (ESDEV of WRF_NAM ) 2 / n of WRF_NAM ] 2 / ( n of WRF_NAM – 1)} (Welch’s t-tests )  Typical value is 3000

5% (0.05) for the experimentwise error rate (α e ) – the risk across all 49 comparisons (n = 49)  to calculated individual risk – comparisonwise risk (α). α = 1- (1- α e ) 1/n = 0.01% Based on d.f. of D and α,  t_critical ≈3.3 (Sullivan, M. (2004) Statistics Informed Decisions Using Data. Pearson Education Inc. Upper Saddle River, NJ) What do these data mean to our test?

The range of d.f. of D is from 3000 to However, as degree of freedoms get larger, their effects on changes in t_critical are getting smaller. So, we choose 3000 as a classic value d.f. of D. To avoid overstated experimentwise error rate, which will be bigger than 5% to reject Ho, across all 49 comparisons due to separated individual t-tests with 95% conference level for each comparison, We use 5% (0.05) for the experimentwise error rate (α e ) – the risk across all 49 comparisons to calculated individual risk – comparisonwise risk (α).

For Q1: If |t_data| > t_critical at the corresponding hour  reject Ho WRF_NAM and WRF_GFS are different in accuancy of temperature at that particular forecast hour |t_data| > t_critical

Ho: equally accurate T-TESTs for each of 49 hours For Q1 D is the difference between MAE of paired WRF temperature forecast. D = MAE of WRF_GFS – MAE of WRF_NAM D P(D) Ha: significant difference in accuracy

If t_data > t_critical at the corresponding hour  reject Ho WRF_GFS are less accurate at that particular forecast hour For Q2: t_data > t_critical T-TESTs on D for the hours when WRF_NAM &WRF_GFS diff in accuracy

Ha: WRF_GFS are less accurate For Q2 Ho: WRF_NAM are less accurate T-TESTs for each of 49 hours Right-tailed For Q2 D is the difference between MAE of paired WRF temperature forecast. D = MAE of WRF_GFS – MAE of WRF_NAM D P(D) T-TESTs on D for the hours when WRF_NAM &WRF_GFS diff in accuracy

What’s the result?

However, Problems ① missing some data from middle September to early November  might has impacts on the statistical results ② t-test -- each comparison should be independent from the other But, my forecast data are not independent across the forecast hours -- time-related  t-test is not the most proper way to compare the accuracy of WRF_GFS and WRF_NAM

Conclusion:  WRF_NAM probably more reliable for temperature forecasts within 48 hours  For the first 26 forecast hours, WRF_GFS might be as accurate as WRF_NAM  After 44 th hour, WRF_GFS constantly less accurate  Need to find a better way to compare the accuracy of WRF_GFS & WRF_NAM

Thank you for listening ! Reference: Sullivan, M. (2004) Statistics Informed Decisions Using Data. Pearson Education Inc. Upper Saddle River, NJ

To make a forecast, the model solves a set of equations that describe how the state of the atmosphere, represented on a three- dimensional grid of points, changes over time at a series of discrete times/time steps

To calculate a forecast, the model requires two kinds of information: (1) the state of atmosphere at the grid of points at the starting time (i.e., initial conditions); and (2) the state of atmosphere at grid points on the boundary of the forecast domain at all subsequent f orecast times (i.e., boundary conditions).

We would like to know, for a specific purpose, which models could produce more accurate predictions than others could.

To evaluate the model forecast accuracy, we need to compare model forecasts to observations. My research used surface weather observations from the Meteorological Assimilation Data Ingest System (MADIS) UrbaNet database