Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn.

Slides:



Advertisements
Similar presentations
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 9 Inferences Based on Two Samples.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Estimating a Population Variance
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Chapter 5: Confidence Intervals.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Assuming normally distributed data! Naïve Bayes Classifier.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 10 th Edition.
Ensemble Learning: An Introduction
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Sampling Methods and Sampling Distributions Chapter.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Machine Learning: Ensemble Methods
Ensemble Learning (2), Tree and Forest
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Lesson Confidence Intervals about a Population Standard Deviation.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
© 2002 Thomson / South-Western Slide 8-1 Chapter 8 Estimation with Single Samples.
Confidence Interval Proportions.
Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CLASSIFICATION: Ensemble Methods
STA291 Statistical Methods Lecture 18. Last time… Confidence intervals for proportions. Suppose we survey likely voters and ask if they plan to vote for.
1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.
CHAPTER 9 Estimation from Sample Data
Confidence Intervals. Examples: Confidence Intervals 1. Among various ethnic groups, the standard deviation of heights is known to be approximately.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 7-5 Estimating a Population Variance.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 8 Interval Estimation Population Mean:  Known Population Mean:  Known Population.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Section 9.2: Large-Sample Confidence Interval for a Population Proportion.
Estimating a Population Mean. Student’s t-Distribution.
Confidence Intervals for a Population Mean, Standard Deviation Unknown.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Chapter 7 Estimation. Chapter 7 ESTIMATION What if it is impossible or impractical to use a large sample? Apply the Student ’ s t distribution.
Machine Learning: Ensemble Methods
ESTIMATION.
Inference about Two Means - Independent Samples
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
C4.5 - pruning decision trees
Instance Based Learning
Trees, bagging, boosting, and stacking
Section 2: Estimating with Small Samples
Statistical Methods For Engineers
Roberto Battiti, Mauro Brunato
Chapter 6 Confidence Intervals.
Chapter 10 Inferences on Two Samples
Discriminative Frequent Pattern Analysis for Effective Classification
Machine Learning in Practice Lecture 23
Ensemble learning Reminder - Bagging of Trees Random Forest
Estimating a Population Variance
… 1 2 n A B V W C X 1 2 … n A … V … W … C … A X feature 1 feature 2
Presentation transcript:

Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn

Overview What is a Random Forest? Why do Relevance Identification? Estimating Feature Importance with a Random Forest Node Complexity Compensation Employing Feature Relevance Extension to Feature Selection

Random Forest Combination of base learners using Bagging Uses CART-based decision trees

Random Forest (cont...) Optimises split using Information Gain Selects feature randomly to perform each split Implicit Feature Selection of CART is removed

Feature Relevance: Ranking Analyse Features individually Measures of Correlation to the target Feature is relevant if: Assumes no feature interaction Fails to identify relevant features in parity problem

Feature Relevance: Subset Methods Use implicit feature selection of decision tree induction Wrapper methods Subset search methods Identifying Markov Blankets Feature is relevant if:

Relevance Identification using Average Information Gain Can identify feature interaction Reliability dependant upon node composition Irrelevant features give non-zero relevance

Node Complexity Compensation Some nodes are easier to split Requires each sample to be weighted by some measure of node complexity Data projected on to one-dimensional space For Binary Classification:

Unique & Non-Unique Arrangements Some arrangements are reflections (non- unique) Some arrangements are symmetrical about their centre (unique)

Node Complexity Compensation (cont…) niAuAu OO OE EO 0 EE Au - No. Unique Arrangements

Information Gain Density Functions Node Complexity improves measure of average IG The effect is visible when examining the IG density functions for each feature These are constructed by building a forest and recording the frequencies of IG values achieved by each feature

Information Gain Density Functions RF used to construct 500 trees on an artificial dataset IG density functions recorded for each feature

Employing Feature Relevance Feature Selection Feature Weighting Random Forest uses a Feature Sampling distribution to select each feature. Distribution can be altered in two ways Parallel: Update during forest construction Two-stage: Fixed prior to forest construction

Parallel Control update rate using confidence intervals. Assume Information Gain values have normal distribution. Statistic has a Student’s t distribution with n-1 degrees of freedom Maintain most uniform distribution within confidence bounds

Convergence Rates No. Features Av. Tree Size WBC958.3 Votes Ionosphere Friedman Pima Sonar Simple957.3

Results Data SetRFCI2S CART WBC Sonar Votes Pima Ionosphere Friedman Simple % of data used for training, 10% for testing Forests of 100 trees were tested and averaged over 100 trials

Irrelevant Features Average IG is the mean of a non-negative sample. Expected IG of an irrelevant feature is non-zero. Performance is degraded when there is a high proportion of irrelevant features.

Expected Information Gain n L - No. examples in left descendant i L - No. positive examples in left descendant

Expected Information Gain No. positive examples No. negative examples

Bounds on Expected Information Gain Upper can be approximated as Lower Bound is given by

Irrelevant Features: Bounds 100 trees built on artificial dataset Average IG recorded and bounds calculated

Friedman FS: CFS:

Simple FS: CFS:

Results Data SetCFSFWFSFW & FS WBC Sonar Votes Pima Ionosphere Friedman Simple % of data used for training, 10% for testing Forests of 100 trees were tested and averaged over 100 trials 100 trees constructed for feature evaluation in each trial

Summary Node complexity compensation improves measure of feature relevance by examining node composition Feature sampling distribution can be updated using confidence intervals to control the update rate Irrelevant features can be removed by calculating their expected performance