Happiness and Stocks Ali Javed, Tim Stevens Department of Computer Science STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R Outline Introduction Background and Dataset Experimental Setup Evaluation Conclusion STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R Applying statistical methods to understand market behavior has been a topic of research since decades. Market depends on infinite variable, not all of which have been digitized. Sentiment analysis using data from social media websites and internet is a latest topic of interest amongst researchers [https://www.marketpsych.com] STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R Data From Sep -9-2008 to Oct 10-2017 Hednometer: Daily metric of happiness using twitter data. Range: 0-10 Mean: 6.02 Standard Deviation: 6.04 S&P500 Index: Range : 1268 - 6534 NASDAQCOM: 676 – 2537 Both S&P500 and NASDAQCOM show an increasing trend throughout. STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R Proposed Research Problem To what extend is there a relationship between ”happiness” and the price of stock market? Which features have the strongest relationships with S&P500 index. Features created for project: Happiness Value S&P500 Value NASDAQCOM Value Lag variables Change variables Direction variables STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R Happiness Data Day Happiness Value Change Monday 6.014 -0.016 Tuesday 6.015 0.000 Wenesday 6.016 6.003 Thursday 6.021 0.006 Friday 6.039 0.022 STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R SP500 and HAPPINESS STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R SP500 and HAPPINESS Conclusion: No correlation between happiness and SP500 Low Pearson Correlation Coefficients <.1 for any Happiness and S&P500 variable pair. Very poor model performance, 56.7% accuracy Logistic regression for SP500_direction~HAPPINESS and 75/25 train/test split. Naïve guessing SP500_direction yields 56% accuracy. QDA and LDA perform similarly. KNN performs worse, 53% accuracy (K = 3 with 10-fold CV) STAT295– Introduction to Statistical Learning in R
SP500 and HAPPINESS 95% confidence interval of AUC = .48 - .59 STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R KNN Accuracy predicted using CFV with f=10 and repeated 15 times. Highest accuracy of 53% at K = 3 STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R SP500 AND NASDAQCOM STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R SP500 AND NASDAQCOM .9949 Pearson Correlation Coefficient SP500_direction~NASDAQCOM with logistic regression: 89.5% accurate with test data Similar results with LDA and QDA KNN 86.5% test accuracy with K = 35 and 10-fold CV Not useful- data is available at the same time Both are weighted averages of stock prices. STAT295– Introduction to Statistical Learning in R
SP500 AND NASDAQCOM 95% confidence interval of AUC = .943-.974 STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R KNN Accuracy predicted using CFV with f=10 and repeated 15 times. Highest accuracy of 86.5% at K = 35 STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R Hosmer and Lemeshow SP500~HAPPINESS : Chi = 10.77 and p-value = .21 SP500~NASDAQCOM: Chi = 16.293 and p-value = .03 p-value < .05 implies significant evidence of lack of fit. Higher Chi values are worse. Why? Our well fitting model may flounder in more subtle changes in the data. (e.g. SP500 increases by 1), which indicates a lack of fit. P-value of >0.05 is not an indication of a good fit, just an indication that there is not significant evidence to suggest a lack of good fit. STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R For the Future Pick a better dataset with more probability of correlation Less feature creation Diversify with more models such as lasso or random forests STAT295– Introduction to Statistical Learning in R
STAT295– Introduction to Statistical Learning in R Conclusion No correlation between happiness and stock data Index funds are obviously correlated and can reliably predict each other for the same time period Validating with multiple tests, and multiple models is important. Beating the market is no easy task. STAT295– Introduction to Statistical Learning in R