M. Verleysen UCL 1 Feature Selection with Mutual Information and Resampling M. Verleysen Université catholique de Louvain (Belgium) Machine Learning Group.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECG Signal processing (2)
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
An Introduction of Support Vector Machine
Computer vision: models, learning and inference Chapter 8 Regression.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Programming and Simulations Frank Witmer 6 January 2011.
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Feature Selection Presented by: Nafise Hatamikhah
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Sparse vs. Ensemble Approaches to Supervised Learning
Instance based learning K-Nearest Neighbor Locally weighted regression Radial basis functions.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Learning from Observations Chapter 18 Section 1 – 4.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Instance Based Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Evaluating Hypotheses
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
POSTER TEMPLATE BY: Cluster-Based Modeling: Exploring the Linear Regression Model Space Student: XiaYi(Sandy) Shen Advisor:
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Chapter 4 –Dimension Reduction Data Mining for Business Analytics Shmueli, Patel & Bruce.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
FAKE GAME updates Pavel Kordík
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
VISUALIZATION TECHNIQUES UTILIZING THE SENSITIVITY ANALYSIS OF MODELS Ivo Kondapaneni, Pavel Kordík, Pavel Slavík Department of Computer Science and Engineering,
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Boosting and Additive Trees (2)
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
ECON734: Spatial Econometrics – Lab 2
Roberto Battiti, Mauro Brunato
Machine Learning Feature Creation and Selection
RANDOM FORESTS
The basic notions related to machine learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
ECON734: Spatial Econometrics – Lab 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Feature Selection Methods
Chapter 4 –Dimension Reduction
Presentation transcript:

M. Verleysen UCL 1 Feature Selection with Mutual Information and Resampling M. Verleysen Université catholique de Louvain (Belgium) Machine Learning Group Machine Learning Group Joint work with D. François, F. Rossi and W. Wertz.

M. Verleysen UCL 2 High-dimensional data: Spectrophotometry To predict sugar concentration in an orange juce sample from light absorbtion spectra 115 samples in dimension 512 Even a linear model would lead to overfitting !

M. Verleysen UCL 3 Material resistance classification Goal: to classify materials into “valid”, “non-valid” and “don’t know” Goal: to classify materials into “valid”, “non-valid” and “don’t know”

M. Verleysen UCL 4 Material resistance: features extraction Extraction of whatever features you would imagine … Extraction of whatever features you would imagine …Description Feature number Temperature of experiment 1 Original values Area under curve Numerical 1 st derivatives Widths of the curve th degree polynomial app Linear approximation Quadratic approximation Max. and min. points Moments

M. Verleysen UCL 5 Why reducing dimensionality ? Theoretically not useful : Theoretically not useful : –More information means easier task –Models can ignore irrelevant features (e.g. set weights to zero) –Models can adjust metrics « In theory, practice and theory are the same. But in practice, they're not » Lot of inputs means … Lot of inputs means … Lots of parameters & high-dimensional input space  Curse of dimensionality and risks of overfitting !

M. Verleysen UCL 6 Reduced set of variables Initial variables Initial variables Reduced set: Reduced set: –selection or –projection Advantages Advantages –selection: interpretability, easy algorithms –projection: potentially more powerful x 1, x 2, x 3, …, x N x 2, x 7, x 23, …, x N-4 y 1, y 2, y 3, …, y M (where y i = f (w i, x))

M. Verleysen UCL 7 Feature selection Initial variables Initial variables Reduced set: Reduced set: –Selection Based on sound statistical criteria Based on sound statistical criteria Makes interpretation easy: Makes interpretation easy: – –x 7, x 23 are the variables to take into account – –set {x 7, x 23 } is as good as set {x 2, x 44, x 47 } to serve as input to the model x 1, x 2, x 3, …, x N x 2, x 7, x 23, …, x N-4

M. Verleysen UCL 8 Feature selection Two ingredients are needed: Two ingredients are needed: Key Element 1 : Subset relevance assessment Key Element 1 : Subset relevance assessment –Measuring how a subset of features fits the problem Key Element 2 : Subset search policy Key Element 2 : Subset search policy –Avoiding to try all possible subsets

M. Verleysen UCL 9 Feature selection Two ingredients are needed: Two ingredients are needed: Key Element 1 : Subset relevance assessment Key Element 1 : Subset relevance assessment –Measuring how a subset of features fits the problem Key Element 2 : Subset search policy Key Element 2 : Subset search policy –Avoiding to try all possible subsets

M. Verleysen UCL 10 Optimal subset search Which subset is most relevant ? Which subset is most relevant ? [ ] [X1 X2 X3 X4][X1 X2 X3 X4] NP problem : exponential in the number of features

M. Verleysen UCL 11 Option 1: Best subset is … subset of best features Which subset is most relevant ? Which subset is most relevant ? Hypothesis : Best subset is the set of K most relevant features Hypothesis : Best subset is the set of K most relevant features [X1 X2 X3 X4] Naive search (Ranking)

M. Verleysen UCL 12 Ranking is usually not optimal Very correlated features Very correlated features Obviously, close features will be selected! Obviously, close features will be selected!

M. Verleysen UCL 13 Which subset is most relevant ? Which subset is most relevant ? Hypothesis : Best subset can be constructed iteratively Hypothesis : Best subset can be constructed iteratively Option 2: Best subset is … approximate solution to NP problem Iterative heuristics

M. Verleysen UCL 14 About the relevance criterion The relevance criterion must deal with subsets of variables !

M. Verleysen UCL 15 Feature selection Two ingredients are needed: Two ingredients are needed: Key Element 1 : Subset relevance assessment Key Element 1 : Subset relevance assessment –Measuring how a subset of features fits the problem Key Element 2 : Subset search policy Key Element 2 : Subset search policy –Avoiding to try all possible subsets

M. Verleysen UCL 16 Mutual information Mutual information is Mutual information is –Bounded below by 0 –Not bounded above by 1 –Bounded above by the (unknown) entropies

M. Verleysen UCL 17 Mutual information is difficult to estimate probability density functions are not known probability density functions are not known integrals cannot be computed exactly integrals cannot be computed exactly X can be high-dimensional X can be high-dimensional

M. Verleysen UCL 18 Estimation in HD Traditional MI estimators: Traditional MI estimators: –histograms –kernels (Parzen windows) NOT appropriate for high dimension Kraskov's estimator (k-NN counts) Kraskov's estimator (k-NN counts) Still not very appropriate, but works better... Principle: when data are close in the X space, are the corresponding Y close too ?

M. Verleysen UCL 19 Kraskov's estimator (k-NN counts) Principle: to count the # of neighbors in X versus the number of neighbors in Y Principle: to count the # of neighbors in X versus the number of neighbors in Y X Y X Y

M. Verleysen UCL 20 Kraskov's estimator (k-NN counts) Principle: to count the # of neighbors in X versus the number of neighbors in Y Principle: to count the # of neighbors in X versus the number of neighbors in Y X Y X Y Nearest neighbors in X and Y coincide: high mutual information Nearest neighbors in X and Y do not coincide: low mutual information

M. Verleysen UCL 21 MI estimation Mutual Information estimators require the tuning of a parameter: Mutual Information estimators require the tuning of a parameter: –bins in histograms –Kernel variance in Parzen –K in k-NN based estimator (Kraskov) Unfortunately, the MI estimator is not very robust to this parameter… Unfortunately, the MI estimator is not very robust to this parameter…

M. Verleysen UCL 22 Robustness of MI estimator 100 samples 100 samples

M. Verleysen UCL 23 Sensitivity to stopping criterion Forward search: stop when MI does not increase anymore Forward search: stop when MI does not increase anymore In theory: is it valid? In theory: is it valid?

M. Verleysen UCL 24 Sensitivity to stopping criterion Forward search: stop when MI does not increase anymore Forward search: stop when MI does not increase anymore In theory: is it valid? In theory: is it valid? Answer: NO, because Answer: NO, because

M. Verleysen UCL 25 Sensitivity to stopping criterion Forward search: stop when MI does not increase anymore Forward search: stop when MI does not increase anymore In theory: NOT OK! In theory: NOT OK! In practice: ??? In practice: ???

M. Verleysen UCL 26 In summary Two problems: Two problems: –The number k of neighbors in the k-NN estimator –When to stop?

M. Verleysen UCL 27 Number of neighbors? How to select k (the number of neighbors)? How to select k (the number of neighbors)?

M. Verleysen UCL 28 Number of neighbors? How to select k (the number of neighbors)? How to select k (the number of neighbors)? Idea: compare the (distributions of) the MI between Y and Idea: compare the (distributions of) the MI between Y and 1.a relevant feature X 2.a non-relevant one X .

M. Verleysen UCL 29 The best value for k The optimal value of k best separates the distributions (ex: Student-like test) The optimal value of k best separates the distributions (ex: Student-like test)

M. Verleysen UCL 30 How to obtain these distributions? Distribution of MI(Y, X ): Distribution of MI(Y, X ): –use non-overlapping subsets X [i] –compute I (X [i], Y) Distribution of MI(Y, X  ): eliminate the relation between X and Y Distribution of MI(Y, X  ): eliminate the relation between X and Y –How? Permute X -> X  –use non-overlapping subsets X  [j] –compute I (X  [j], Y)

M. Verleysen UCL 31 The stopping criterion Observed difficulty: the MI estimation depends on the size of the feature set: Observed difficulty: the MI estimation depends on the size of the feature set: even if MI (X k,Y ) = 0. Avoid comparing MI on subsets of different sizes! Avoid comparing MI on subsets of different sizes! Compare with Compare with

M. Verleysen UCL 32 The stopping criterion 95% percentiles of the permutation distribution

M. Verleysen UCL 33 The stopping criterion 100 datasets (MC simulations) 100 datasets (MC simulations) # of features Max. mutual information Stopping criterion # of informative features Max. mutual information Stopping criterion

M. Verleysen UCL 34 "Housing" benchmark Dataset origin: StatLib library, Carnegie Mellon Univ. Dataset origin: StatLib library, Carnegie Mellon Univ. Concerns housing values in suburbs of Boston Concerns housing values in suburbs of Boston Attributes: Attributes: 1.CRIM per capita crime rate by town 2.ZN proportion of residential land zoned for lots over 25,000 sq.ft. 3.INDUS proportion of non-retail business acres per town 4.CHAS Charles River dummy variable (= 1 if tract bounds river; otherw. 0) 5.NOX nitric oxides concentration (parts per 10 million) 6.RM average number of rooms per dwelling 7.AGE proportion of owner-occupied units built prior to DIS weighted distances to five Boston employment centres 9.RAD index of accessibility to radial highways 10.TAX full-value property-tax rate per $10, PTRATIO pupil-teacher ratio by town 12.B 1000(Bk )^2 where Bk is the proportion of blacks by town 13.LSTAT % lower status of the population 14.MEDV Median value of owner-occupied homes in $1000's

M. Verleysen UCL 35 The stopping criterion Housing dataset Housing dataset

M. Verleysen UCL 36 The stopping criterion Housing dataset Housing dataset RBFN performances on test set: - all features: RMSE = features (max MI): RMSE = Selected features: RMSE = 9.48

M. Verleysen UCL 37 The stopping criterion Spectral analysis (Nitrogen dataset) Spectral analysis (Nitrogen dataset) 141 IR spectra, 1050 wavelengths 141 IR spectra, 1050 wavelengths 105 spectra for training, 36 for test 105 spectra for training, 36 for test Functional preprocessing (B-splines) Functional preprocessing (B-splines)

M. Verleysen UCL 38 The stopping criterion Spectral analysis (Nitrogen dataset) Spectral analysis (Nitrogen dataset) RBFN performances on test set: - all features: RMSE = features (max MI): RMSE = Selected features: RMSE = 0.66 SHANGHAÏ

M. Verleysen UCL 39 The stopping criterion Delve-Census dataset Delve-Census dataset 104 features used 104 features used data data –14540 for test –8 x 124 for training (to study variability)

M. Verleysen UCL 40 The stopping criterion Delve-Census dataset Delve-Census dataset

M. Verleysen UCL 41 The stopping criterion Delve-Census dataset Delve-Census dataset RMSE on test set RMSE on test set

M. Verleysen UCL 42 Conclusion Selection of variables by Mutual Information may improve learning performances and increases interpretability … Selection of variables by Mutual Information may improve learning performances and increases interpretability … …if used in an adequate way ! …if used in an adequate way ! Reference: Reference: –D. François, F. Rossi, V. Wertz and M. Verleysen, Resampling methods for parameter-free and robust feature selection with mutual information, Neurocomputing, Volume 70, Issues 7-9, March 2007, Pages Volume 70, Issues 7-9Volume 70, Issues 7-9 Thanks to my co-authors for (most part of…) the work! Thanks to my co-authors for (most part of…) the work!