John G. Zhang, Ph.D. Harper College jzhang@harpercollege.edu Looking Ahead of the Curve: an ARIMA Modeling Approach to Enrollment Forecasting John G. Zhang, Ph.D. Harper College jzhang@harpercollege.edu
Topics Why forecast How to forecast Why ARIMA What is ARIMA How to ARIMA How ARIMA did Discussion 47th AIR Annual Forum
Why Forecast Queries and Reports: what was Dashboard: what is Forecasts: what will be Forecast for enrollment: more valuable for resources planning 47th AIR Annual Forum
How to forecast Naïve forecast: random walk, moving average Exponential smoothing Markov chain Regression ARIMA Others Combining methods 47th AIR Annual Forum
Why ARIMA Naïve forecast: best guess if no patterns Exponential Smoothing: usually designed for one-step ahead forecast Markov chain: see reference Regression: frequently violates the assumption of uncorrelated errors ARIMA: worked well, more later Others: see reference Combining Methods: non-directional 47th AIR Annual Forum
What is ARIMA AutoRegressive Integrated Moving Average Generally, the model is given by 47th AIR Annual Forum
where Xt is a time series value at time t, 0 is a constant, B is a backshift or lag operator, i is a number of lags or spans, is an error term at time t, and θ are AR and MA parameters, and p, d, and q are the orders of AR, I, MA 47th AIR Annual Forum
If p = 1, 1 = 1, d = 0, θ1= 0, random walk: (1 - B)(Xt – θ0) = t if p = 1, d = 0, q = 1, ARMA(1, 1): (1 - 1B)(Xt – θ0) = (1 - θ1B) t If p = 1, d = 0, θ1 = 0, AR(1) model: (1 - 1B)(Xt – θ0) = t If p = 1, 1 = 1, d = 0, θ1= 0, random walk: (1 - B)(Xt – θ0) = t If 1 = 0, d = 0, θ1 = 0, constant: (Xt – θ0) = t 47th AIR Annual Forum
How to ARIMA Box and Jenkins (1976) notation: (p d q)(p d q)s Four stages: Identification Estimation Validation Forecasting 47th AIR Annual Forum
How to ARIMA SPSS Trends module: version 12 worked well version 13 and 14: algorithms changed same data, same program, different forecast SAS ETS module: ARIMA procedure more flexible forecast consistant automation possible thanks to macros 47th AIR Annual Forum
Identification Series Plot Autocorrelation plot Dickey-Fuller test of unit root hypothesis AR models to compare the log likelihood values for a series and its transformed series 47th AIR Annual Forum
Identification Degree of differencing Order of AR Order of MA Seasonality if any 47th AIR Annual Forum
Estimation Q statistics Goodness-of-fit criteria: variance estimate Akaike information criterion Schwartz Bayesian criterion Significance of parameters Residuals analysis Mean Absolute Percent Error 47th AIR Annual Forum
Data Time series data Date variable: year, quarter, month, week, day, hour, minute, second Enrollment data: FTE, headcount, seatcount Data points Nature of the series determines the forecast 47th AIR Annual Forum
Patterns of Data Trend: steady increase or decrease in the values of a times series Cycle: long-term patterns of rising and falling data Seasonality: regular change in the data values that occurs at the same time in a given period 47th AIR Annual Forum
FTE 47th AIR Annual Forum
FTE Pattern Trendy: FTE increasing from 1998 to 2006, suggesting non-stationary and differencing necessary Seasonal: higher in the Fall and Spring and lower in the Summer each and every year, implying a seasonal factor present as part of the model building process 47th AIR Annual Forum
Autocorrelations and Partial Autocorrelations (ACF and PACF) Lag Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 0 1.00000 | |********************| 1 0.64901 | . |************* | 2 0.29267 | . |****** | 3 -.06855 | . *| . | 4 -.42111 | ********| . | 5 -.42944 | *********| . | 6 -.43520 | *********| . | 7 -.40880 | ********| . | 8 -.38067 | ********| . | 9 -.06784 | . *| . | 10 0.25681 | . |***** . | 11 0.55983 | . |*********** | 12 0.85774 | . |***************** | 13 0.55625 | . |*********** | 14 0.24975 | . |***** . | 15 -.06186 | . *| . | 16 -.36715 | . *******| . | 17 -.37708 | . ********| . | 18 -.38454 | . ********| . | 19 -.36197 | . *******| . | 20 -.33780 | . *******| . | 21 -.07144 | . *| . | 22 0.20576 | . |**** . | 23 0.46222 | . |********* . | PACF Lag Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 1 1 0.64901 | . |************* | 2 -0.22210 | ****| . | 3 -0.28449 | ******| . | 4 -0.37073 | *******| . | 5 0.18006 | . |**** | 6 -0.26468 | *****| . | 7 -0.29117 | ******| . | 8 -0.45581 | *********| . | 9 0.72564 | . |*************** | 10 0.06626 | . |* . | 11 0.26005 | . |***** | 12 0.18460 | . |**** | 13 -0.22575 | *****| . | 14 0.14806 | . |***. | 15 0.10247 | . |** . | 16 0.16423 | . |***. | 17 -0.18254 | ****| . | 18 0.15059 | . |***. | 19 -0.04279 | . *| . | 20 0.11045 | . |** . | 21 -0.18268 | ****| . | 22 0.08106 | . |** . | 23 -0.06703 | . *| . | 47th AIR Annual Forum
Q Statistics Autocorrelation Check of Residuals To Chi- Pr > Lag Square DF ChiSq --------------------Autocorrelations-------------------- 6 385.69 6 <.0001 0.937 0.874 0.808 0.743 0.727 0.711 12 777.02 12 <.0001 0.709 0.707 0.752 0.799 0.833 0.866 18 1107.12 18 <.0001 0.811 0.755 0.697 0.640 0.624 0.608 24 1436.47 24 <.0001 0.605 0.603 0.640 0.679 0.706 0.732 Q Statistics show autocorrelations among various lags highly statistically significant Autocorrelations were very high Further actions needed 47th AIR Annual Forum
FTE Forecast 47th AIR Annual Forum
How ARIMA Did Accuracy: what matters most 2-period ahead: 0.74% (FTE) 0.50% (HC) 6-period ahead: 1.43% (FTE) 1.65% (HC) 10-period ahead: 1.40% (FTE) 2.52%(HC) Forecast error bigger into distant future Eleanor S. Fox (2005) 1.2% (4) 4.1% (8) NCES (2003) 1.9% (2) 3.6% (6) 47th AIR Annual Forum
Discussion Theoretically factors includable along with the time series itself like in regression Unemployment rate Consumer Price Index (CPI) High school student population District population Tuition Forecasts used for forecasting? 47th AIR Annual Forum
Discussion Stationarity and homogeneity Scarcity and spuriousness Seasonality and outliers Raw or cooked data Data mining and stepwise Fit and accuracy Additive or multiplicative (subset/factored) 47th AIR Annual Forum
Discussion Science and art Objective and Subjective Quantitative and qualitative Over-differencing and over-fitting Parsimony and uncertainty Simple or complex 47th AIR Annual Forum