Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1.

Similar presentations


Presentation on theme: "Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1."— Presentation transcript:

1 Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1

2 Why should I-O and business researchers be interested in big data? Witness stories about “big data” in HR and personnel selection, with little to no mention of the (DECADES) of personnel assessment expertise in HR and I-O psychology… 2

3 Adding insult to injury… 3

4 Overview Run LASSO regression and random forests, two predictive models that are relatively novel to organizational research and useful for big data (and small data). Demonstrates a philosophy: Fit flexibly to improve prediction, but don’t overfit. Four ideas/implications related to predictive models are then discussed. 4

5 Example 1 LASSO Many R packages exist to perform 'big data' types of analyses, such as the LASSO (Least Absolute Shrinkage and Selection Operator). LASSO not only estimates regression weights like 'normal' (OLS) regression; LASSO makes many weights shrink to zero. Yeehaw! 5

6 LASSO Check out the coefficients across the range of values of the “tuning parameter,” lambda. You can see where LASSO regression consistently selects predictive variables and excludes others. Yeehaw! 6

7 LASSO Let's see what variables tend to predict the job-search behaviors of employed managers (Boudreau, 2001). 12 predictors of job search [you would use actual data – but I simulated 1,000 cases here] 7

8 job satisfaction (less sat  > search) compensation (less $  > search) gender (female  > search) Considering all other predictors at a given value of lambda… All LASSO solutions varying lambda… Nothin’……………………………………OLS 8

9 Least Angle Regression = LARS (graph = the trace of all LASSO soln's) I think of LARS as Tiptoe Regression TM, as it is more cautious than stepwise regression… Start with all coefficients = 0. The predictor with the highest validity is the first one entered. But now don't step, tiptoe…. Increase the regression weight from 0 toward its 'normal' regression value until one of the other predictors correlates with the residual (y-yhat) just as highly. Enter that predictor next. Now move the weights of those two predictors toward its 'normal' regression solution, until a third predictor correlates equally with the residual. Enter that one. Keep goin' until all predictors are entered. This method efficiently provides all LASSO solutions! 9

10 LASSO is swell, but also check out…elastic net! When several predictors are correlated, LASSO will tend to select one of them rather than the group. The elastic net will encourage selecting the group of predictors in this case: Elastic net encourages parsimony (like LASSO) yet also tries to select those groups of related variables when they are predictive (like ridge regression). OLS regressionridge regressionLASSO elastic net 10

11 In general: Cross-validation  As mentioned, other weights will work better across other samples (e.g., unit weights, but…something better?)  How to find out? o Train the model on a given set of data (develop the weights) o Test the model on a fresh set of data (apply the weights to new data; how good are predictions? 11

12 LASSO: k-fold Cross-validation o Train the model on a given set of data (develop the weights) o Test the model on a fresh set of data (apply the weights to new data; how good are predictions? 10-fold cross-validation: o Divide the data into 10 equal parts or “folds” (randomly, or stratified random) o Develop the model on 1 fold; test the model on the rest o Do this for each fold, so that each fold participates in generating model predictions. o Average the 9 predicted results across each case. o [demonstrate w/ LASSO regression] 12

13 LASSO k-fold Cross-validation optimal simpler  tuning parameter  # of predictors 13

14 Example 2 Random Forests First look at trees, used to classify data:treesclassify data Example = high (> X) = low (≤ X) Predicted Task Performance Scores Cognitive Ability Conscientiousness Biodata 3.64.14.35.6 TeamworkOpenness 3.8 3.2 14

15 Example 2 Random Forests Draw a large number of boostrapped samples (e.g., k = 500 samples, each based on 2/3 of the data, w/ replacement). For each sample, build a tree similar to the one just illustrated… but with a catch: At each node, only consider a random subset of the predictors that can split the node (square root of # all predictors is a default). This yields diverse trees: different variables at each node, different cutpoints at each variable. 15

16 Example 2 (cont’) Random Forests For each tree, look at the “out of bag” (oob) data – data that did NOT participate in building the tree. Take these data and run them down the tree to get their predicted scores. For each case, average the predicted scores across trees. Average them. case ŷ Tree 1 ŷ Tree 2… ŷ Tree k Average Predicted ŷ 1 (in tree) 3.4…2.13.554 24.5 (in tree) … 4.312 ……………… N2.6 (in tree) …2.42.561 16

17 Some thoughts while under the hood… 1. Thinking about the increasing amount of data companies have on hand: Some of those data are directly relevant to selection (lots of online applications, screening tests). Other data might be relatively indirect, but an argument can be made for selection (resume text mining). Still other data are indirect but difficult to justify even if predictive (e.g., time to complete an application online). …What do we do in this situation? If Big Data only captures the 3Vs on an ever-expanding hard drive, it is useless. Taylor 2013, HBR Blog: “We can amass all the data in the world, but if it doesn’t help to save a life, allocate resources better, fund the organization, or avoid a crisis, what good is it?” 17

18 Some thoughts while under the hood… 2. Useful ‘signals’ in data discovered through predictive modeling could be amplified by developing measures that collect more data (given enough development time, testing time, $...). (Fayyad et al., 1996) (knowledge  new measures  new knowledge) 18

19 Some thoughts while under the hood… 3. So if predictions become more accurate and robust than ever… will we understand (theorize about, legally defend) them any better? (related idea: why do we have reliability, why not just validity?) Big Data analytics is a form of engineering. Generally, our substantive research focuses on correlations, mean differences, etc. at the factor or scale level, not at the single-item level as big data might. History tells us that item-level analyses can be hard to interpret (e.g., DIF). Interpretable surprises are hard to find. 19

20 Some thoughts while under the hood… 4. Big Data analytics provides reasons/opportunities to collaborate – if there is a culture for that. Selection PurposeAvailable DataStatistical ModelingFindings/DiscoveriesFuture SelectionFuture Outcomes HR Assessment + Analytics + IT + Management + Teams/Employees +…. Ties to other org functions (perhaps served by Big Data) 20

21 Thank you! Fred Oswald foswald@rice.edu 21


Download ppt "Statistical Analysis with Big Data Dr. Fred Oswald Rice University CARMA webcast November 6, 2015 University of South Florida - Tampa, FL 1."

Similar presentations


Ads by Google