Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn
Overview What is a Random Forest? Why do Relevance Identification? Estimating Feature Importance with a Random Forest Node Complexity Compensation Employing Feature Relevance Extension to Feature Selection
Random Forest Combination of base learners using Bagging Uses CART-based decision trees
Random Forest (cont...) Optimises split using Information Gain Selects feature randomly to perform each split Implicit Feature Selection of CART is removed
Feature Relevance: Ranking Analyse Features individually Measures of Correlation to the target Feature is relevant if: Assumes no feature interaction Fails to identify relevant features in parity problem
Feature Relevance: Subset Methods Use implicit feature selection of decision tree induction Wrapper methods Subset search methods Identifying Markov Blankets Feature is relevant if:
Relevance Identification using Average Information Gain Can identify feature interaction Reliability dependant upon node composition Irrelevant features give non-zero relevance
Node Complexity Compensation Some nodes are easier to split Requires each sample to be weighted by some measure of node complexity Data projected on to one-dimensional space For Binary Classification:
Unique & Non-Unique Arrangements Some arrangements are reflections (non- unique) Some arrangements are symmetrical about their centre (unique)
Node Complexity Compensation (cont…) niAuAu OO OE EO 0 EE Au - No. Unique Arrangements
Information Gain Density Functions Node Complexity improves measure of average IG The effect is visible when examining the IG density functions for each feature These are constructed by building a forest and recording the frequencies of IG values achieved by each feature
Information Gain Density Functions RF used to construct 500 trees on an artificial dataset IG density functions recorded for each feature
Employing Feature Relevance Feature Selection Feature Weighting Random Forest uses a Feature Sampling distribution to select each feature. Distribution can be altered in two ways Parallel: Update during forest construction Two-stage: Fixed prior to forest construction
Parallel Control update rate using confidence intervals. Assume Information Gain values have normal distribution. Statistic has a Student’s t distribution with n-1 degrees of freedom Maintain most uniform distribution within confidence bounds
Convergence Rates No. Features Av. Tree Size WBC958.3 Votes Ionosphere Friedman Pima Sonar Simple957.3
Results Data SetRFCI2S CART WBC Sonar Votes Pima Ionosphere Friedman Simple % of data used for training, 10% for testing Forests of 100 trees were tested and averaged over 100 trials
Irrelevant Features Average IG is the mean of a non-negative sample. Expected IG of an irrelevant feature is non-zero. Performance is degraded when there is a high proportion of irrelevant features.
Expected Information Gain n L - No. examples in left descendant i L - No. positive examples in left descendant
Expected Information Gain No. positive examples No. negative examples
Bounds on Expected Information Gain Upper can be approximated as Lower Bound is given by
Irrelevant Features: Bounds 100 trees built on artificial dataset Average IG recorded and bounds calculated
Friedman FS: CFS:
Simple FS: CFS:
Results Data SetCFSFWFSFW & FS WBC Sonar Votes Pima Ionosphere Friedman Simple % of data used for training, 10% for testing Forests of 100 trees were tested and averaged over 100 trials 100 trees constructed for feature evaluation in each trial
Summary Node complexity compensation improves measure of feature relevance by examining node composition Feature sampling distribution can be updated using confidence intervals to control the update rate Irrelevant features can be removed by calculating their expected performance