Statistical Forecasting Jan Verkade November 3, 2016
Statistical Forecasting = forecasting from data What does that mean? What other types of forecasting do you know?
Regression analysis Regression analysis: predicting future values of a variable using information about other variables Predictor: the variable that you want to forecast Predictand: the variable that you use as input what we hope to find is that the different variables do not vary independently (in a statistical sense), but that they tend to vary together. we assume that the future will behave like the past
Regression models A predictand may depend on predictor(s) in varying ways: y ~ x y ~ a + bx y ~ x2 …
The linear (regression) model 𝑌 𝑡 = 𝑏 0 + 𝑏 1 𝑋 1𝑡 + 𝑏 2 𝑋 2𝑡 + …+ 𝑏 𝑘 𝑋 𝑘𝑡 prediction for Y is a straight-line function of each of the X-variables contributions of different X variables to predictions are additive slopes b1, b2, etc: coefficients of the variables intercept b0
Justification of linear model for regression assumptions Why should we assume that relationships between variables are linear? Because linear relationships are the simplest non-trivial relationships that can be imagined (hence the easiest to work with), and..... Because the "true" relationships between our variables are often at least approximately linear over the range of values that are of interest to us, and... Even if they're not, we can often transform the variables in such a way as to linearize the relationships.
Fitting a linear model We fit a linear model through an objective function: minimise the mean squared error (MSE) Steps: Standardize variables: convert them to units of standard-deviations-from-the-mean Calculate average product of standardized values Minimize mean squared error Subsitute, re-arrange and solve for b0 and b1
Fitting a linear model Standardize variables: convert them to units of standard-deviations-from-the-mean 𝑋 𝑡 ∗ = 𝑋 𝑡 −𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑑𝑒𝑣(𝑋) 𝑌 𝑡 ∗ = 𝑌 𝑡 −𝑚𝑒𝑎𝑛(𝑌) 𝑠𝑡𝑑𝑒𝑣(𝑌)
Fitting a linear model Standardize variables: convert them to units of standard-deviations-from-the-mean Calculate average product of standardized values 𝑟 𝑋𝑌 = 1 𝑛 𝑋 1 ∗ 𝑌 1 ∗ + 𝑋 2 ∗ 𝑌 2 ∗ +…+ 𝑋 𝑛 ∗ 𝑌 𝑛 ∗
Fitting a linear model Standardize variables: convert them to units of standard-deviations-from-the-mean Calculate average product of standardized values Minimize mean squared error 𝑌 𝑡 ∗ = 𝑟 𝑋𝑌 𝑋 𝑡 ∗
Fitting a linear model Standardize variables: convert them to units of standard-deviations-from-the-mean Calculate average product of standardized values Minimize mean squared error Subsitute, re-arrange and solve for b0 and b1 𝑌 𝑡 −𝑚𝑒𝑎𝑛(𝑌) 𝑠𝑡𝑑𝑒𝑣(𝑌) = 𝑟 𝑋𝑌 𝑋 𝑡 −𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑑𝑒𝑣(𝑋) 𝑌 𝑡 ∗ = 𝑟 𝑋𝑌 𝑋 𝑡 ∗ 𝑌 𝑡 = 𝑏 0 + 𝑏 1 𝑋 1𝑡 𝑏 1 = 𝑟 𝑋𝑌 𝑠𝑡𝑑𝑒𝑣(𝑌) 𝑠𝑡𝑑𝑒𝑣(𝑋) 𝑏 0 =𝑚𝑒𝑎𝑛 𝑌 − 𝑏 1 𝑚𝑒𝑎𝑛(𝑋)
Exercise: piezometric head within a levee
Exercise: piezometric head within a levee river water level water pressure sensor
Exercise: piezometric head within a levee Use voorhavendijk.xls Explore the data by building a scatter (x,y) plot Determine mean and standard deviations Determine standardized values; then explore… marginal distributions (ecdf of either variable) joint distribution (scatter plot) Determine the coefficient of correlation Determine the coefficients of the regression equation Verify by using Excel’s built-in function to show regression line
Exercise: piezometric head within a levee
Exercise: piezometric head within a levee Discuss: is the linear model a good model?
Exercise: piezometric head within a levee How to use / interpret the regression line?
Exercise: piezometric head within a levee Use voorhavendijk.xls Explore the data by building a scatter (x,y) plot Determine mean and standard deviations Determine standardizes values; then explore… marginal distributions (ecdf of either variable) joint distribution (scatter plot) Determine the coefficient of correlation Determine the coefficients of the regression equation Verify by using Excel’s built-in function to show regression line Explore the residuals by plotting an empirical cumulative density function. What is the mean value? How are the residuals distributed?
LM-model: residuals
LM-model: residuals mean: -6.16922e-18 stdev: 0.2790263
Exercise: piezometric head within a levee How to use / interpret the regression line?
Forecasting errors Intrinsic risk: signal v noise Parameter risk: uncertain parameter values Model risk: the risk of choosing the wrong model (linear model v quadratic model, for example)
Confidence Intervals v Prediction Intervals
An alternative statistical technique: Quantile Regression Principles: QR is a method for describing conditional quantiles Rather than minimising the mean squared error (MSE) QR is based on minimising the mean absolute error (MAE) This yields not the sample mean but the sample median Other quantiles may be derived by adding weights to errors E.g. weight = .1 for positive errors and .9 for negative errors Fitting models may be done in transformed space to account for heteroscedasticity
Application in real-time hydrologic forecasting: post-processing Ensemble techniques Post-processing techniques
Application in real-time hydrologic forecasting: post-processing Once a record of forecasts is in place This record can be analysed for ‘forecast errors’ And these records can be assumed to occur in future forecasts also
1: Find a relationship between forecast and obs 5 december 2017 1: Find a relationship between forecast and obs
2. Apply that relation to new forecasts
And here’s your forecast 5 december 2017 And here’s your forecast
Famous forecasting quotes "I have seen the future and it is very much like the present, only longer." --Kehlog Albran, The Profit Pretty concise description of statistical forecasting: We search for statistical properties of a time series that are constant in time (levels, trends, seasonal patterns, correlations and autocorrelations, etc.) We then predict that those properties will describe the future as well as the present
Famous forecasting quotes "Prediction is very difficult, especially if it's about the future." --Nils Bohr, Nobel laureate in Physics warning of the importance of validating a forecasting model out-of-sample. It's often easy to find a model that fits the past data well--perhaps too well!— but quite another matter to find a model that correctly identifies those patterns in the past data that will continue to hold in the future.