Download presentation
Presentation is loading. Please wait.
1
M. Verleysen UCL 1 Feature Selection with Mutual Information and Resampling M. Verleysen Université catholique de Louvain (Belgium) Machine Learning Group Machine Learning Group http://www.dice.ucl.ac.be/mlg/ Joint work with D. François, F. Rossi and W. Wertz.
2
M. Verleysen UCL 2 High-dimensional data: Spectrophotometry To predict sugar concentration in an orange juce sample from light absorbtion spectra 115 samples in dimension 512 Even a linear model would lead to overfitting !
3
M. Verleysen UCL 3 Material resistance classification Goal: to classify materials into “valid”, “non-valid” and “don’t know” Goal: to classify materials into “valid”, “non-valid” and “don’t know”
4
M. Verleysen UCL 4 Material resistance: features extraction Extraction of whatever features you would imagine … Extraction of whatever features you would imagine …Description Feature number Temperature of experiment 1 Original values 2-2497-119 Area under curve 25120 Numerical 1 st derivatives 26-47121-142 Widths of the curve 48-58143-153 5 th degree polynomial app. 59-70154-165 Linear approximation 71-74166-169 Quadratic approximation 75-80170-175 Max. and min. points 81-88176-183 Moments89-96184-191
5
M. Verleysen UCL 5 Why reducing dimensionality ? Theoretically not useful : Theoretically not useful : –More information means easier task –Models can ignore irrelevant features (e.g. set weights to zero) –Models can adjust metrics « In theory, practice and theory are the same. But in practice, they're not » Lot of inputs means … Lot of inputs means … Lots of parameters & high-dimensional input space Curse of dimensionality and risks of overfitting !
6
M. Verleysen UCL 6 Reduced set of variables Initial variables Initial variables Reduced set: Reduced set: –selection or –projection Advantages Advantages –selection: interpretability, easy algorithms –projection: potentially more powerful x 1, x 2, x 3, …, x N x 2, x 7, x 23, …, x N-4 y 1, y 2, y 3, …, y M (where y i = f (w i, x))
7
M. Verleysen UCL 7 Feature selection Initial variables Initial variables Reduced set: Reduced set: –Selection Based on sound statistical criteria Based on sound statistical criteria Makes interpretation easy: Makes interpretation easy: – –x 7, x 23 are the variables to take into account – –set {x 7, x 23 } is as good as set {x 2, x 44, x 47 } to serve as input to the model x 1, x 2, x 3, …, x N x 2, x 7, x 23, …, x N-4
8
M. Verleysen UCL 8 Feature selection Two ingredients are needed: Two ingredients are needed: Key Element 1 : Subset relevance assessment Key Element 1 : Subset relevance assessment –Measuring how a subset of features fits the problem Key Element 2 : Subset search policy Key Element 2 : Subset search policy –Avoiding to try all possible subsets
9
M. Verleysen UCL 9 Feature selection Two ingredients are needed: Two ingredients are needed: Key Element 1 : Subset relevance assessment Key Element 1 : Subset relevance assessment –Measuring how a subset of features fits the problem Key Element 2 : Subset search policy Key Element 2 : Subset search policy –Avoiding to try all possible subsets
10
M. Verleysen UCL 10 Optimal subset search Which subset is most relevant ? Which subset is most relevant ? [ ] [X1 X2 X3 X4][X1 X2 X3 X4] NP problem : exponential in the number of features
11
M. Verleysen UCL 11 Option 1: Best subset is … subset of best features Which subset is most relevant ? Which subset is most relevant ? Hypothesis : Best subset is the set of K most relevant features Hypothesis : Best subset is the set of K most relevant features [X1 X2 X3 X4] Naive search (Ranking)
12
M. Verleysen UCL 12 Ranking is usually not optimal Very correlated features Very correlated features Obviously, close features will be selected! Obviously, close features will be selected!
13
M. Verleysen UCL 13 Which subset is most relevant ? Which subset is most relevant ? Hypothesis : Best subset can be constructed iteratively Hypothesis : Best subset can be constructed iteratively Option 2: Best subset is … approximate solution to NP problem Iterative heuristics
14
M. Verleysen UCL 14 About the relevance criterion The relevance criterion must deal with subsets of variables !
15
M. Verleysen UCL 15 Feature selection Two ingredients are needed: Two ingredients are needed: Key Element 1 : Subset relevance assessment Key Element 1 : Subset relevance assessment –Measuring how a subset of features fits the problem Key Element 2 : Subset search policy Key Element 2 : Subset search policy –Avoiding to try all possible subsets
16
M. Verleysen UCL 16 Mutual information Mutual information is Mutual information is –Bounded below by 0 –Not bounded above by 1 –Bounded above by the (unknown) entropies
17
M. Verleysen UCL 17 Mutual information is difficult to estimate probability density functions are not known probability density functions are not known integrals cannot be computed exactly integrals cannot be computed exactly X can be high-dimensional X can be high-dimensional
18
M. Verleysen UCL 18 Estimation in HD Traditional MI estimators: Traditional MI estimators: –histograms –kernels (Parzen windows) NOT appropriate for high dimension Kraskov's estimator (k-NN counts) Kraskov's estimator (k-NN counts) Still not very appropriate, but works better... Principle: when data are close in the X space, are the corresponding Y close too ?
19
M. Verleysen UCL 19 Kraskov's estimator (k-NN counts) Principle: to count the # of neighbors in X versus the number of neighbors in Y Principle: to count the # of neighbors in X versus the number of neighbors in Y X Y X Y
20
M. Verleysen UCL 20 Kraskov's estimator (k-NN counts) Principle: to count the # of neighbors in X versus the number of neighbors in Y Principle: to count the # of neighbors in X versus the number of neighbors in Y X Y X Y Nearest neighbors in X and Y coincide: high mutual information Nearest neighbors in X and Y do not coincide: low mutual information
21
M. Verleysen UCL 21 MI estimation Mutual Information estimators require the tuning of a parameter: Mutual Information estimators require the tuning of a parameter: –bins in histograms –Kernel variance in Parzen –K in k-NN based estimator (Kraskov) Unfortunately, the MI estimator is not very robust to this parameter… Unfortunately, the MI estimator is not very robust to this parameter…
22
M. Verleysen UCL 22 Robustness of MI estimator 100 samples 100 samples
23
M. Verleysen UCL 23 Sensitivity to stopping criterion Forward search: stop when MI does not increase anymore Forward search: stop when MI does not increase anymore In theory: is it valid? In theory: is it valid?
24
M. Verleysen UCL 24 Sensitivity to stopping criterion Forward search: stop when MI does not increase anymore Forward search: stop when MI does not increase anymore In theory: is it valid? In theory: is it valid? Answer: NO, because Answer: NO, because
25
M. Verleysen UCL 25 Sensitivity to stopping criterion Forward search: stop when MI does not increase anymore Forward search: stop when MI does not increase anymore In theory: NOT OK! In theory: NOT OK! In practice: ??? In practice: ???
26
M. Verleysen UCL 26 In summary Two problems: Two problems: –The number k of neighbors in the k-NN estimator –When to stop?
27
M. Verleysen UCL 27 Number of neighbors? How to select k (the number of neighbors)? How to select k (the number of neighbors)?
28
M. Verleysen UCL 28 Number of neighbors? How to select k (the number of neighbors)? How to select k (the number of neighbors)? Idea: compare the (distributions of) the MI between Y and Idea: compare the (distributions of) the MI between Y and 1.a relevant feature X 2.a non-relevant one X .
29
M. Verleysen UCL 29 The best value for k The optimal value of k best separates the distributions (ex: Student-like test) The optimal value of k best separates the distributions (ex: Student-like test)
30
M. Verleysen UCL 30 How to obtain these distributions? Distribution of MI(Y, X ): Distribution of MI(Y, X ): –use non-overlapping subsets X [i] –compute I (X [i], Y) Distribution of MI(Y, X ): eliminate the relation between X and Y Distribution of MI(Y, X ): eliminate the relation between X and Y –How? Permute X -> X –use non-overlapping subsets X [j] –compute I (X [j], Y)
31
M. Verleysen UCL 31 The stopping criterion Observed difficulty: the MI estimation depends on the size of the feature set: Observed difficulty: the MI estimation depends on the size of the feature set: even if MI (X k,Y ) = 0. Avoid comparing MI on subsets of different sizes! Avoid comparing MI on subsets of different sizes! Compare with Compare with
32
M. Verleysen UCL 32 The stopping criterion 95% percentiles of the permutation distribution
33
M. Verleysen UCL 33 The stopping criterion 100 datasets (MC simulations) 100 datasets (MC simulations) # of features 123456 Max. mutual information 745331410 Stopping criterion 011252296 # of informative features 12345 Max. mutual information 74533141 Stopping criterion 01166617
34
M. Verleysen UCL 34 "Housing" benchmark Dataset origin: StatLib library, Carnegie Mellon Univ. Dataset origin: StatLib library, Carnegie Mellon Univ. Concerns housing values in suburbs of Boston Concerns housing values in suburbs of Boston Attributes: Attributes: 1.CRIM per capita crime rate by town 2.ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3.INDUS proportion of non-retail business acres per town 4.CHAS Charles River dummy variable (= 1 if tract bounds river; otherw. 0) 5.NOX nitric oxides concentration (parts per 10 million) 6.RM average number of rooms per dwelling 7.AGE proportion of owner-occupied units built prior to 1940 8.DIS weighted distances to five Boston employment centres 9.RAD index of accessibility to radial highways 10.TAX full-value property-tax rate per $10,000 11.PTRATIO pupil-teacher ratio by town 12.B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13.LSTAT % lower status of the population 14.MEDV Median value of owner-occupied homes in $1000's
35
M. Verleysen UCL 35 The stopping criterion Housing dataset Housing dataset
36
M. Verleysen UCL 36 The stopping criterion Housing dataset Housing dataset RBFN performances on test set: - all features: RMSE = 18.97 - 2 features (max MI): RMSE = 19.39 - Selected features: RMSE = 9.48
37
M. Verleysen UCL 37 The stopping criterion Spectral analysis (Nitrogen dataset) Spectral analysis (Nitrogen dataset) 141 IR spectra, 1050 wavelengths 141 IR spectra, 1050 wavelengths 105 spectra for training, 36 for test 105 spectra for training, 36 for test Functional preprocessing (B-splines) Functional preprocessing (B-splines)
38
M. Verleysen UCL 38 The stopping criterion Spectral analysis (Nitrogen dataset) Spectral analysis (Nitrogen dataset) RBFN performances on test set: - all features: RMSE = 3.12 - 6 features (max MI): RMSE = 0.78 - Selected features: RMSE = 0.66 SHANGHAÏ
39
M. Verleysen UCL 39 The stopping criterion Delve-Census dataset Delve-Census dataset 104 features used 104 features used 22784 data 22784 data –14540 for test –8 x 124 for training (to study variability)
40
M. Verleysen UCL 40 The stopping criterion Delve-Census dataset Delve-Census dataset
41
M. Verleysen UCL 41 The stopping criterion Delve-Census dataset Delve-Census dataset RMSE on test set RMSE on test set
42
M. Verleysen UCL 42 Conclusion Selection of variables by Mutual Information may improve learning performances and increases interpretability … Selection of variables by Mutual Information may improve learning performances and increases interpretability … …if used in an adequate way ! …if used in an adequate way ! Reference: Reference: –D. François, F. Rossi, V. Wertz and M. Verleysen, Resampling methods for parameter-free and robust feature selection with mutual information, Neurocomputing, Volume 70, Issues 7-9, March 2007, Pages 1276-1288 Volume 70, Issues 7-9Volume 70, Issues 7-9 Thanks to my co-authors for (most part of…) the work! Thanks to my co-authors for (most part of…) the work!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.