Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Slides from: Doug Gray, David Poole
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
Lecture 4: Embedded methods
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
x – independent variable (input)
Decision Tree Algorithm
Feature Selection for Regression Problems
Reduced Support Vector Machine
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Collaborative Filtering Matrix Factorization Approach
Efficient Model Selection for Support Vector Machines
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université.
Non-Bayes classifiers. Linear discriminants, neural networks.
Feature selection with Neural Networks Dmitrij Lagutin, T Variable Selection for Regression
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Deep Feedforward Networks
Machine Learning Basics
Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Collaborative Filtering Matrix Factorization Approach
Mathematical Foundations of BME Reza Shadmehr
Chapter 7: Transformations
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Presentation transcript:

Introduction to variable selection I Qi Yu

2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem may happen; Input dimension is too large; the curse of dimensionality problem may happen; Poor model may be built with additional unrelated inputs or not enough relevant inputs; Poor model may be built with additional unrelated inputs or not enough relevant inputs; Complex models which contain too many inputs is more different to understand Complex models which contain too many inputs is more different to understand

3 Two broad classes of variable selection methods: filter and wrapper Filter method is a pre-processing step, which is independent of the learning algorithm. The inputs subset is chosen by an evaluation criterion, which measures the relation of each subset of input variables with the output.

4 Two broad classes of variable selection methods: filter and wrapper Learning model is used as a part of evaluation function and also to induce the final learning model. Optimizing the parameters of the model by measuring some cost functions. Finally, the set of inputs can be selected using LOO, bootstrap or other re-sampling techniques.

5 Comparsion of filter and wrapper: Wrapper method tries to solve real problem, hence the criterion can be really optimaized; but it is potentially very time consuming Wrapper method tries to solve real problem, hence the criterion can be really optimaized; but it is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration. Filter method is much faster but it do not incorporate learning. Filter method is much faster but it do not incorporate learning.

6 Embeded methods In contrast of filter and wrapper approaches, in embedded methods the features selection part can not be separated from the learning part. In contrast of filter and wrapper approaches, in embedded methods the features selection part can not be separated from the learning part. Structure of the class of function under consideration plays a crucial role Structure of the class of function under consideration plays a crucial role Existing embedded methods are reviewed based on a unifying mathematical framework. Existing embedded methods are reviewed based on a unifying mathematical framework.

7 Embeded methods Forward-Backward Methods Optimization of scaling factors Sparsity term

8 Forward-Backward Methods Forward selection methods: these methods start with one or a few features selected according to a method specific selection criteria. More features are iteratively added until a stopping criterion is met. Backward elimination methods: methods of this type start with all features and iteratively remove one feature or bunches of features. Nested methods: during an iteration features can be added as well as removed from the data.

9 Forward selection Forward selection with Least squares Forward selection with Least squares Grafting Grafting Decision trees Decision trees

10 Forward selection with Least squares 1. Start with and 2. Find the component i such that is minimal. 3. Add i to S 4. Recompute the residuals Y with P S Y 5. Stop or go back to 2

11 Grafting For fixed, Perkins suggested minimizing the function: For fixed, Perkins suggested minimizing the function: over the set of parameters which defines over the set of parameters which defines To solve this in a forward way: In every iteration the working set of parameters is extended by one and the newly obtained objective function is minimized over the enlarged working set. The selection criterion for new parameters is.

12 Decision trees Decision trees are iteratively build by splitting the data depending on the value of a specific feature. A widely used criterion for the importance of a feature is the mutual information between feature i and the outputs Y : where H is the entropy and

13 Backward Elimination Recursive Feature Elimination (RFE), given that one wishes to employ only input dimensions in the final decision rule, attempts to find the best subset of size by a kind of greedy backward selection. Algorithm of RFE in the linear case: 1: repeat 2: Find w and b by training a linear SVM. 3: Remove the feature with the smallest value 4: until features remain.

14 Embeded methods Forward-Backward Methods Optimization of scaling factors Sparsity term

15 Optimization of scaling factors Scaling Factors for SVM Automatic Relevance Determination Variable Scaling: Extension to Maximum Entropy Discrimination

16 Scaling Factors for SVM Feature selection is performed by scaling the input parameters by a vector. Larger values of indicate more useful features. Thus the problem is now one of choosing the best kernel of the form: We wish to find the optimal parameters which can be optimized by many criterias, i.e. gradient descent on the R 2 W 2 bound, span bound or a validation error.

17 Optimization of scaling factors Scaling Factors for SVM Automatic Relevance Determination Variable Scaling: Extension to Maximum Entropy Discrimination

18 Automatic Relevance Determination In a probabilistic framework, a model of the likelihood of the data is chosen P(y|w) as well as a prior on the weight vector, P(w). To predict the output of a test point x, the average of f w (x) over the posterior distribution P(w|y) is computed, that is using function f wMAP to predict. w MAP is the vector of parameters called the Maximum a Posteriori (MAP), i.e.

19 Variable Scaling: Extension to Maximum Entropy Discrimination The Maximum Entropy Discrimination (MED) framework is a probabilistic model in which one does not learn parameters of a model, but distributions over them. Feature selection can be easily integrated in this framework. For this purpose, one has to specify a prior probability p 0 that a feature is active.

20 Variable Scaling: Extension to Maximum Entropy Discrimination If w i would be the weight associated with a given feature for a linear model, then the expectation of this weight modified as follows: This has the effect of discarding the components for which This algorithm ignores features whose weights are smaller than a threshold.

21 Sparsity term In the case of linear models, indicator variables are not necessary as feature selection can be enforced on the parameters of the model directly. This is generally done by adding a sparsity term to the objective function that the model minimizes. Feature Selection as an Optimization Problem Concave Minimization

22 Feature Selection as an Optimization Problem Most linear models that we consider can be understood as the result of the following minimization: where measures the loss of function on the training point

23 Feature Selection as an Optimization Problem Examples of empirical errors are: 1. l 1 hinge loss 2. l 2 loss 3. Logistic loss

24 Concave Minimization In the case of linear models, feature selection can be understood as the optimization problem: For example, Bradley proposed to approximate the function as: Weston et al. use a slightly different function. They replace the l 0 norm by:

25 Summary of embeded methods Embeded method is built upon the concept of scaling factors. We discussed embedded methods along how they approximate the proposed optimization problems: Explicit removal or addition of features - the scaling factors are optimized over the discrete set {0, 1} n in a greedy iteration; Optimization of scaling factors over the compact interval [0, 1] n, and Linear approaches, that directly enforce sparsity of the model parameters.

Thank you !