Random Forests Ujjwol Subedi
Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features at each node. ◦ Trees have a uniform distribution. ◦ It can be generated efficiently and the combination of large sets of random trees generally leads to accurate models.
Decision trees Decision trees are predictive models that use a set of binary rules to calculate a target value. Two types of decision trees. ◦ Classification Classification trees are used to create categorical data sets. ◦ Regression are used to create continuous data sets.
Here is the simple example of decision trees
Definition Random forests first developed by Leo Breiman. It is group of un-pruned classification or regression trees made from random selections of samples of the training data. Random forests are way of averaging multiple deep decision trees, trained on different parts of the same training set, with goal of overcoming over-fitting problem of individual decision trees. In other words, random forests are an ensemble learning method for classification and regression that operate by constructing a lot of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
Random forests does not over fit. You can run as many trees as you want. It is fast. Running on a data set with 50,000 cases and 100 variables, it produced 100 trees in 11 minutes on a 800Mhz machine. For large data sets the major memory requirement is the storage of the data itself, and three integer arrays with the same dimensions as the data. If proximities are calculated, storage requirements grow as the number of cases times the number of trees.
How random Forest works? Each tree is grown as follows: 1.Random Record Selection: Each tree is trained on roughly 2/3 rd of the total training data. Cases are drawn at random with replacement from the original data, this sample will be the training set for growing the tree. 2. Random variable selection: Some predictor variables, say m, are selected at random out of all the predictor variables and the best split on these m is used to split the node. 3.For each tree, using leftover data, calculate the misclassification rate – out of bag (OOB) error rate and aggregate error from all the trees to determine overall the OOB error rate for the classification.
4. Each tree gives a classification and we say that the tree “votes” for that class. The forest chooses the classification having the most votes. For example: If 500 trees are grown and 400 of them predict that a particular pixel is forest and 100 predict it is a grass. Then the predicted output for that pixel will be forest.
Algorithm Let the number of training cases be N and number of variables in the classifier be M. Number m of the input variables be used to determine the decision at a node of the tree; m << M. Choose the training set for this tree by choosing n times with replacement from all N available training cases. Use the rest cases to estimate the error to estimate the error of the tree by predicting their classes. For each node of the tree, randomly choose m variables on which to base the decision at the node. Calculate the best split based on these variables in the training set. Each tree is fully grown and not pruned.
Pros and Cons Then advantages of random forests : ◦ It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier. ◦ It runs efficiently on large data sets. ◦ It can handle thousands of input variables without variable deletion. ◦ It gives estimate of what variables are important in the classification. ◦ It has an effective method for estimating missing data and maintains when large proportion of the data are missing. ◦ It computes proximities between pairs of cases that can be used in clustering, locating outliers.
Pros and Cons contd…. Disadvantages are: ◦ Random forests have been observed to over fit for some datasets with noisy classification/regression tasks. ◦ For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more labels.
Parameters When running random forests there are number of parameters that need to specified. The most common parameters are: ◦ Input training data including predictor variables. ◦ The number of trees that should be built. ◦ The number of predictor variables to be used to create the binary rule for each split. ◦ parameters to calculate information related to error and variable significance.
Terminologies related to random forest algorithm Bagging ( Bootstrap Aggregating) ◦ Generates m new training data set and each new training data set picks a sample of observations with replacement. Then m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting(for classification). The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x 1,..., x n with responses Y = y 1,..., y n, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: For b = 1,..., B:Sample, with replacement, n training examples from X, Y; call these X b, Y b. Train a decision or regression tree f b on X b, Y b. After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': or by taking the majority vote in the case of decision trees.
Terminologies contd.. Out-of-Bag error rate ◦ As the forest is built on training data, each tree is tested on the 1/3 rd of the samples not used in building that tree. This is the out-of-bag error estimate- an internal error estimate of a random forest Bootstrap sample ◦ It is a random with replacement sampling method. Proximities ◦ These are one of the most useful tools in random forests. The proximities originally formed a NxN matrix. After a tree is grown, put all of the data, both training and oob, down the tree. If cases k and n are in the same terminal node increase their proximity by one. At the end, normalize the proximities by dividing by the number of trees.
Missing Values.. Missing Data Imputation Fast way: replace missing values for a given variable using the median of the non-missing values (or the most frequent, if categorical) Better way (using proximities): 1. Start with the fast way. 2. Get proximities. 3. Replace missing values in case i by a weighted average of non- missing values, with weights proportional to the proximity between case i and the cases with the non-missing values. Repeat steps 2 and 3 a few times (5 or 6).
Variables importance RF computes two measures of variable importance, one based on a rough-and- ready measure (Gini for classification) and the other based on permutations.
Example In this tree, it advise us based on weather conditions, whether to play ball
Example contd… The random forest takes this notion to the next level by combining with notion of an ensemble.
Results and Discussions Here classification results are compared between the results of J48 and the Random forest.
Results and discussion contd.. Table shows the Precision, Recall and the F-measure for the random forest and J48 for the 20 data sets.
References andomForests/cc_home.htm andomForests/cc_home.htm ndom-forests-ensembles-and- performance-metrics/ ndom-forests-ensembles-and- performance-metrics/ Random Forests for land cover classification by Pall Oskar GislasonPall Oskar Gislason