Random Forests Feb., 2016 Roger Bohn Big Data Analytics 1
Harold Colson on good library data catalogs Google Scholar Web of Science Business Source Complete INSPEC ACM Digital Library IEEE Xplore PubMed See page 2
Random Forests (DMRattle+R) Build many decision trees (e.g., 500). For each tree: Select a random subset of the training set (N); Choose different subsets of variables for each node of the decision tree (m << M); Build the tree without pruning (i.e., overfit) Classify a new entity using every decision tree: Each tree “votes” for the entity. The decision with the largest number of votes wins! The proportion of votes is the resulting score. Outcome is a pseudo probability. 0 ≤ prob ≤ 1 3
RF on weather data 4
“Model” is 100s of small Trees Each tree is quick to solve, so computationally tractable Example model from RF ## Tree 1 Rule 1 Node 30 Decision No ## ## 1: Evaporation <= 9 ## 2: Humidity3pm <= 71 ## 3: Cloud3pm <= 2.5 ## 4: WindDir9am IN ("NNE") ## 5: Sunshine <= ## 6: Temp3pm <= Final decision (yes/no, or level) just like single tree 5
Error rates. 6
Properties of RFs Often works better than other methods. Runs efficiently on large data sets. Can handle hundreds of input variables. Gives estimates of variable importance. Results easy to use, but too complex to summarize (“black box”) Cross-validation is built in: Use random set of observations for each tree. (With replacement.) Omitted observations are the validation set for that tree. 7
8
R code randomForest is one RF program. There are others. ds <- weather[train, -c(1:2, 23)] form <- RainTomorrow ~. m.rp <- rpart(form, data=ds) m.rf <- randomForest(form, data=ds, na.action=na.roughfix, importance=TRUE) 9 randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, mtry=if (!is.null(y) && !is.factor(y)) max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), replace=TRUE, classwt=NULL, cutoff, strata, sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)), nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1, maxnodes = NULL, importance=FALSE, localImp=FALSE, nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE, do.trace=FALSE, keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE, keep.inbag=FALSE,...)
Mechanics of RFs Each model uses random bag of observations ~70/30 Each time a split in a tree is considered, random selection of m predictors chosen as candidates from the full set of p predictors. The split chooses one of those m predictors, just like a single tree. A fresh selection of m predictors is taken at each split. Typically we choose m ≈ √p Number of predictors considered at each split is approximately the square root of total number of predictors. max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))), If tree is deep, most of the p variables get considered at least once. 10
11
Mechanics: combining trees Run RF 500 times, get 500 models. Check this! With many variables you may need more trees. Final prediction or classification is based on voting Usually use unweighted voting: all trees equal Can weight the votes e.g. most successful trees get highest weights. For classification: majority of trees determines classification For prediction problems (continuous outcomes): Average prediction of all the trees becomes the RF’s prediction. 12
Case study: Comparing methods 13 From: Matt Taddy, Chicago Booth School faculty.chicagobooth.edu/matt.taddy/teaching
Single tree result 14
15
16
17
18
Other concepts using trees 19
Generalize: Groups of different models! Many models are better than any 1 model Each model better at classifying some situations. “Boosting” algorithms 20
21
Comparing algorithms PropertySingle tree Random forestLogistic /regression LASSO Nonlinear relationships? Good Must pre- guess interactions same Explain to audience? GoodGood (most audiences) Selecting variables (large p) Variable importance Handle continuous outcomes (predict) Handle discrete outcomes (classify) Number of OTSUs 22
Comparing algorithms PropertySingle tree Random forestLogistic /regression LASSO Nonlinear relationships? Good Must pre- guess interactions same Explain to audience? Very good PoorVery good if trained Medium Selecting variables (large p) DecentGoodPoorVery good Variable importance WeakRelative importance Absolute importance Same Handle continuous outcomes (predict) Yes Handle discrete outcomes (classify) Yes Number of OTSUsWho are we kidding? All have plenty of OTSUs. Hence importance of validation, then test 23