Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues from data collected from last 7 years. Goal: Find an equation (model) that explains variation in Y with a smaller set of predictors that are all related to Y but not too related to each other (multicollinearity). Predict next 4 quarters of revenues. Your dependent variable will be revenues or seasonally adjusted revenues depending upon whether your data has pronounced seasonality.
Forecasting Revenue: An Example of Regression Model Building Hold out sample for validation process later. Do not use last two quarters of data (2 observations) until after you have done the validation process. Starting Point: Examine multicollinearity and poor predictors by checking correlations with a correlation matrix and by generating VIF values. Include revenue (or SA revenue). Eliminate variables with very low correlation (.9) independent variables. This allows you some choice in which to choose variables that have better forecasts available or that you believe should be most related to revenues in theory. You can swap out highly correlated variables later if you run into validation issues.
Variance Inflation Factors Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with the other predictors in the model. Used to identify Multicollinearity. Values larger than 10 for a predictor imply large inflation of standard errors of regression coefficients due to this variable being in model. Inflated standard errors lead to insignificant t- statistics for regression coefficients and wider confidence intervals
Forecasting Revenue: An Example of Regression Model Building Run a multiple regression to look at VIF values (and D-W values) – Delete one of the variables from those with VIF > 10. Choose the one that has the highest VIF or another variable with high VIF that may not have forecasts available. There is some flexibility in this step. Repeat until all VIF are smaller than 10. This will result in a reduced set of variables to use in finding an equation using All Possible Regressions.
Forecasting Revenue: An Example of Regression Model Building Best Model Process using all the data (no holdouts). Use MegaStat All Possible Regressions to find an equation that has all significant (p-value <,.05) variables and has a small standard error (large adjusted R-squared). If you have the C p Statistic it summarizes each possible model, where “best” model can be selected based on the statistic. Ideally you s elect the model with the fewest predictors that has C p p and has p-values <.05 for all variables.
No Good Model? Low R2? Outliers? Validation problems (especially low D-W)? If you had “modes” in your data (time periods where the revenue trend was clearly different) you can try adding dummy variables to handle special situations rather than delete the data Ex: SWA pre-post merger dummy –0 = pre-merger, 1=post-merger
Validating Your Model Validation with holdout sample. Forecast last two known quarters with 95% prediction intervals. Do the actual values fall within the lower and upper prediction limits implying that the predictions seem reasonable? If so use all quarters and redo the equation using the same variables and forecast next 4 quarters. Check the assumptions for the validation model. If not, try using an alternative model from the all possible regressions options or see if there is a reason that the last two known quarters are different in some way. Look at the quarterly reports and see if they might suggest use of a dummy variable. Redo the validation process.
Regression Diagnostics Model Assumptions: Residual plots or other diagnostics can be used to check the assumptions -- Plot of Residuals versus each variable should be random cloud U-shaped (or rainbow) Nonlinear relationship -- Plot of Residuals versus predicted should be random cloud Wedge shaped Non-constant (increasing) variability -- Residuals should be mound-shaped (normal). Use skewness/kurtosis or a normal probability plot to check. -- Plot of Residuals versus Time order (Time series data) should be random cloud. If D-W < 1.3, residuals are not independent. Cook’s D is a check for influential observations that may have large impacts on the equation. Check data for accuracy for high Cook’s D.
Detecting Influential Observations Studentized Residuals – Residuals divided by their estimated standard errors. Observations in dark blue are considered outliers from the equation. Leverage Values – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations in dark blue are considered to be outliers in the X values. Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than 1 are considered highly influential. Influential observations may suggest quarters to research to see if something special happened that may suggest a dummy variable.
The Final Forecasts When you have validated (holdouts, assumption) the best model possible discuss the forecasts. Look at their prediction intervals, re-seasonalize forecasts that used desesonalized data. Do the forecasts make sense? You may have actual Q3 revenues to compare your forecast with. Superimpose your forecasts on a time series plot of revenues and ensure that the forecasts seem reasonable. Document all your data and forecast sources. Write a report that documents all aspects of the forecasting process.