Regression Forecasting and Model Building

Regression Forecasting and Model Building
Forecasting company revenue with multiple linear regression

Forecasting Revenue: An Example of Regression Model Building
Setting: Possibly a large set of predictor variables (i.e. revenue drivers) used to predict future quarterly revenues from data collected from previous 7 years. Goal: Find an equation (model) that explains variation in Y with a smaller set of predictors that are all related to Y but not too related to each other (multicollinearity). Predict revenues for next four quarters. Your dependent variable will be revenues or seasonally adjusted revenues depending upon whether your data has pronounced seasonality.

When you speculate on predictors it is not unusual that many of them will be strongly related to each other. This is especially the case when a variable is mostly derived form another. Predictors that are too highly correlated can form multicollinearity where predictors essentially add no additional information while interfering with each other to fit the dependent variable. Starting Point: Examine multicollinearity by checking correlations with a correlation matrix and by generating VIF values. This allows you some choice in which to choose variables that have better forecasts available or that you believe should be most related to revenues in theory.

Variance Inflation Factors
Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with the other predictors in the model. Used to identify Multicollinearity. Values larger than 10 for a predictor imply large inflation of standard errors of regression coefficients due to this variable being in model. Inflated standard errors lead to insignificant t-statistics for regression coefficients and wider confidence intervals

Run a multiple regression to look at VIF values (and D-W values) – Delete one of the variables from those that with VIF > 10. Use the correlation matrix to see which pairs of high VIF variables are highly correlated. For each pair, choose the one that has the highest VIF or the variable with high VIF that may not have forecasts available or has other problems (such as non-linearity). There is some flexibility in this step and it may require some investigation. Repeat until all VIF are smaller than 10. This will result in a reduced set of variables to use in finding an equation using All Possible Regressions.

Best Model Process using the data. Use MegaStat All Possible Regressions to find an equation that has the fewest number of all significant (p-value < .05) variables and has a small standard error and a large adjusted R-squared. Megastat will order the models from highest adjusted R-squared and lowest standard error. Look at the top for best model candidates with all significant p-values and fewest predictors. You can use the formula =IF(COUNTIF(predictor range,">.05")>0,"","OK") to help identify the significant predictors models and compare OK models by looking at adjusted R-squared / standard error If Megastat provided a Cp Statistic, it summarizes each possible model, where “best” model can be selected based on this statistic. Ideally you select the model with the fewest predictors p that has Cp  p and has all p-values < .05 for all variables.

Again you have some flexibility here to choose a set of variables with desirable qualities (e.g. good forecasts). Minor differences in adjusted R-squared, standard error are not likely to have significant impact on your forecast results. Keep in mind that p-values only indicate confidence that the slope is not zero. You need only be confident enough and smaller p-values do not translate into better forecasts. Predictors with p-values that are small but larger than .05 may still be good for your model. If you have to go far down the list sacrificing R-square and standard error, consider using a model with less significant predictors or swap out one or more variables with one of the highly correlated variables you left out previously

Validating Your Model When you forecast with speculative predictors it’s possible that the data coincidentally has a relationship to the dependent variable (“spurious correlation”) especially with small amounts of time series data. To help address this we will use a “hold out sample” for a validation process that the relationships actually exist. Validation with holdout sample: Run the regression with the best model selected leaving out the last two quarters of data. Forecast the quarters you held out with 95% prediction intervals. Check the assumptions for the validation model. If not valid, can you fix? Transform data? Do the actual values fall within the lower and upper prediction limits implying that the predictions seem reasonable? If not, try using an alternative model from the all possible regressions options or see if there is a reason that quarters held out are different in some way. Look at the quarterly reports and see if they might suggest use of a dummy variable. Redo the validation process.

Regression Diagnostics
Model Assumptions: Residual plots or other diagnostics can be used to check the assumptions -- Plot of Residuals versus each variable should be random cloud U-shaped (or rainbow)  Nonlinear relationship -- Plot of Residuals versus predicted should be random cloud Wedge shaped  Non-constant (increasing) variability -- Residuals should be mound-shaped (normal). Use skewness/kurtosis or a normal probability plot to check. -- Plot of Residuals versus Time order (Time series data) should be random cloud. If D-W < 1.3, residuals are not independent. Cook’s D is a check for influential observations that may have large impacts on the equation. Check data for accuracy or errors (e.g. typos, wrong units, etc.).

Detecting Influential Observations
Studentized Residuals – Residuals divided by their estimated standard errors. Observations in dark blue are considered outliers from the equation. Leverage Values – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations in dark blue are considered to be outliers in the X values. Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than 1 are considered highly influential. Influential observations may suggest quarters to research to see if something special happened that may suggest a dummy variable.

The Final Forecasts Add the last two quarters back in your data set and redo the equation using the same variables and the next four quarters. Recheck the assumptions now that you have 2 additional data points. Do the forecasts make sense? Superimpose your forecasts on a time series plot of revenues and ensure that the forecasts seem reasonable. If not, try to explain or find error. Document all your data and forecast sources. Write a report that documents all aspects of the forecasting process.

Regression Forecasting and Model Building

Similar presentations

Presentation on theme: "Regression Forecasting and Model Building"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression Forecasting and Model Building

Similar presentations

Presentation on theme: "Regression Forecasting and Model Building"— Presentation transcript:

Similar presentations

About project

Feedback