Stock Prediction with ARIMA Kihwan Lee 8/23/2018
Exploratory Data Analysis Data Source Quandle National Unemployment Rate Apple, Google, J.P.Morgan Foreign Exchange Rate – Korea
Correlation and Bollinger Band National Unemployment Rate Observation Apple Stock Closing Price -0.5 Negative correlation, somewhat noticeable Foreign Exchange Rate with South Korea -0.73 Negative correlation, somewhat stronger Bollinger band Provides a relative definition of high and low prices of a market Low band The price has reached a relative low value. A likely time to buy. High band The price has reached a relative high value. A likely time to sell. The stock prices are well bounded by the Bollinger band Bollinger Bands
What is ARIMA? p = order of Auto-Regressive model ARMA: Auto Regression Moving Average model for the stationary data ARIMA: Auto Regression Integrated Moving Average model – a generalization of ARMA ARIMA model goes through differencing steps to eliminate the non-stationary part When the data shows evidence of non-stationarity, where initial differencing steps can be applied multiple times to eliminate the non-stationarity. ARMA model on the differenced data Auto Regression: y(t) = f(y(t-1)), the current value is regressed with its own lagged value Moving Average: regression error = a linear combination of previous error terms Integrated: data values have been replaced by the difference p = order of Auto-Regressive model D=order of differencing model q = order of Moving-Average model (p,d,q) = non-seasonal components (P,D,Q)s = seasonal components
Akaike Information Criterion vs. Bayesian Information Criterion The Akaike information criterion (AIC) An estimator of the relative quality of statistical models for a given set of data. AIC estimates the relative information lost by a given model: the less information a model loses, the higher the quality of that model. The model with lowest AIC is preferred K = number of the estimated parameters L = maximum value of the likelihood function of the model AIC = 2 * k = 2 ln (L) Dickey-Fuller Test Tests the null hypothesis that a unit root is present in an autoregressive model, meaning non-stationary The alternative hypothesis is the data is stationary Smaller P-value is preferred Accepting the null hypothesis is surprising!!!! Non-surprising is equivalent to non-stationary Surprising is equivalent to stationary The Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) It is independent of the prior. It can measure the efficiency of the parameterized model in terms of predicting the data. It penalizes the complexity of the model where complexity refers to the number of parameters in the model. the model with the lowest BIC is preferred. It is based, in part, on the likelihood function K = number of the estimated parameters L = maximum value of the likelihood function of the model X = observed data N = the number of data points in x, the numbe of observations, or equivalently, the sample size BIC = ln(n) k – 2 ln(L) Green: Recovery with a unit root Red: Recovery without a unit root Quantile - Quantile Plot 45 degree reference line plotted. Two sets of data from the same population falls approximately along this reference line. The supplied data is plotted with a set of data from the normal distribution. Following reference line means the supplied data follows the normal distribution closely.
Stock Prediction with Auto Arima Moving average? Cyclical white nose? Good enough? Apple stock price prediction from 2016 and 2018 based on the historical data Auto_arima function from pyramid.arima AIC = 1752 p = 2, q = 2, d = 1
Exploration on the original data
Exploration on the logged data
Looking for Seasonality… Case A Plot and Dickey-Fuller Test Case A: Original data Case B: Original data – Shift-by-1 data Case C: Original data – Shift-by-12 data Case D: Shift-by-1 – Shift-by-12 data Case B Case C Lag P-value Test Statistics Case A 14 1 3.715 > 10% CV Case B 10 2e-5 -5.5 < 1% CV Case C 12 0.18 -2.3 ~> 10% CV Case D 0.071 -2.7 < 10% CV Case D
Looking for Seasonality… Case A Plot and Dickey-Fuller Test logged data Case A: Logged data Case B: Logged data – Shift-by-1 data Case C: Logged data – Shift-by-12 data Case D: Shift-by-1 – Shift-by-12 data Case B Case C Lag P-value Test Statistics Case A 0.978 0.309 > 10% CV Case B 2e-30 -17.3 < 1% CV Case C 12 0.017 -3.25 < 5% CV Case D 0.012 -3.36 < 5% CV Case D
ARIMA Models Parameter Summary p,d,q P, D, Q, s AIC BIC QHIC RSS Case 1 30, 0, 0 0, 0, 0,12 1641 1758 1688 516 Case 2 30,0,0 0,0,2,12 1664 1788 1714 517 Case 3 30,0,2 1662 1794 510 Case 11 30, 1, 3 0,1,3,12 1547 1683 1601 508 Case 12 0, 0, 10 8,4,0,12 1365 1428 1391 1027 Case 13 2, 2, 2 2,2,2,12 1628 1661 591 Case 14 0, 0, 5 25,2,0,12 753 810 776 1452 p = order of Auto-Regressive model d = order of differencing model q = order of Moving-Average model (p,d,q) = non-seasonal components (P,D,Q)s = seasonal components
Case 1
Case 1 Dynamic forecast Prediction 2016 ~ 2018 stock price using historical data up to the end of 2015 One-step ahead forecast Predicting the next step using true data Future forecast for 2018 to 2010
Case 2
Case 2
Case 3
Case 3
Case 11 One-step ahead forecast Predicting the next step using true data Case 11 Dynamic forecast Prediction using true data up to certain point Future forecast
Case 12
Case 13
Case 14 Best match with random noise
Case 14 Shows surprisingly good match with the actual data for 2 years
Summary Exploratory data analysis (EDA) performed on Apple stock price. Demonstrated that Apple stock price variation stays within the Bollinger Bands for the relative high and low prices. Showed periodicity in the Apple stock price history. Demonstrated different Time-Series prediction methods using ARIMA, which were compared to the actual data. Performed Time-Series forecast of the Apple stock price using ARIMA. To-do Understand the working principle of ARIMA model. Governing equation derivation and recognize their limitations. Apply the ARIMA model to a wider range of stock prices. Apply correlations to different set of economic indicators, such as employment rate, inflation rate, trade deficit, and etc. and see if PCA can be applied.
Backup
Model Selection Criteria Akaike Information Criterion and Bayesian Information Criterion Most commonly used criteria