Lecture 4: Embedded methods

Slides:

Advertisements

Similar presentations

Neural Networks and Kernel Methods

Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Neural networks Introduction Fitting neural networks

Neural Networks and SVM Stat 600. Neural Networks History: started in the 50s and peaked in the 90s Idea: learning the way the brain does. Numerous applications.

Linear Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Supervised Learning Recap

CMPUT 466/551 Principal Source: CMU

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Lecture 14 – Neural Networks

Sparse vs. Ensemble Approaches to Supervised Learning

Speaker Adaptation for Vowel Classification

Feature selection methods from correlation to causality Isabelle Guyon NIPS 2008 workshop on kernel learning.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Sparse vs. Ensemble Approaches to Supervised Learning

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

Feature selection methods Isabelle Guyon IPAM summer school on Mathematics in Brain Imaging. July 2008.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Feature selection and causal discovery fundamentals and applications Isabelle Guyon

Classification and Prediction: Regression Analysis

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Artificial Neural Networks

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University.

315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.

Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Lecture 2: Learning without Over-learning Isabelle Guyon

NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.

Feature selection with Neural Networks Dmitrij Lagutin, T Variable Selection for Regression

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Biointelligence Laboratory, Seoul National University

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Classification and Regression Trees

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

PREDICT 422: Practical Machine Learning

Artificial Neural Networks

Boosting and Additive Trees (2)

Probabilistic Models for Linear Regression

Roberto Battiti, Mauro Brunato

Machine Learning Feature Creation and Selection

Machine Learning Today: Reading: Maria Florina Balcan

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Probabilistic Models with Latent Variables

Linear Model Selection and regularization

Statistical Learning Dong Liu Dept. EEIS, USTC.

Support Vector Machine I

Neural networks (1) Traditional multi-layer perceptrons

Artificial Intelligence 10. Neural Networks

Presentation transcript:

Lecture 4: Embedded methods Isabelle Guyon isabelle@clopinet.com

Filters,Wrappers, and Embedded methods All features Filter Feature subset Predictor All features Wrapper Multiple Feature subsets Predictor Wrappers are All features Embedded method Feature subset Predictor

Filters Methods: Criterion: Measure feature/feature subset “relevance” Search: Usually order features (individual feature ranking or nested subsets of features) Assessment: Use statistical tests Are (relatively) robust against overfitting May fail to select the most “useful” features Results:

Wrappers Criterion: A risk functional Methods: Criterion: A risk functional Search: Search the space of feature subsets Assessment: Use cross-validation Can in principle find the most “useful” features, but Are prone to overfitting Results:

Embedded Methods Criterion: A risk functional Search: Search guided by the learning process Assessment: Use cross-validation Similar to wrappers, but Less computationally expensive Less prone to overfitting Results:

Three “Ingredients” Filters Embedded Wrappers Assessment Criterion Statistical tests Single feature ranking Cross validation Performance bounds Nested subset, forward selection/ backward elimination Heuristic or stochastic search Exhaustive search Single feature relevance Relevance in context Feature subset learning machine Filters Statistical tests Single feature ranking Cross validation Performance bounds Nested subset, forward selection/ backward elimination Heuristic or stochastic search Exhaustive search Single feature relevance Relevance in context Feature subset learning machine Embedded Statistical tests Single feature ranking Cross validation Performance bounds Nested subset, forward selection/ backward elimination Heuristic or stochastic search Exhaustive search Single feature relevance Relevance in context Feature subset learning machine Wrappers Assessment Criterion Search

Forward Selection (wrapper) Start n n-1 n-2 1 … Also referred to as SFS: Sequential Forward Selection

Forward Selection (embedded) Start n n-1 n-2 1 … Guided search: we do not consider alternative paths.

Forward Selection with GS Stoppiglia, 2002. Gram-Schmidt orthogonalization. Select a first feature Xν(1)with maximum cosine with the target cos(xi, y)=x.y/||x|| ||y|| For each remaining feature Xi Project Xi and the target Y on the null space of the features already selected Compute the cosine of Xi with the target in the projection Select the feature Xν(k)with maximum cosine with the target in the projection. Embedded method for the linear least square predictor

Forward Selection w. Trees Tree classifiers, like CART (Breiman, 1984) or C4.5 (Quinlan, 1993) Choose f1 Choose f2 At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f1 f2

Backward Elimination (wrapper) Also referred to as SBS: Sequential Backward Selection 1 n-2 n-1 n … Start

Backward Elimination (embedded) 1 n-2 n-1 n … Start

Backward Elimination: RFE RFE-SVM, Guyon, Weston, et al, 2002 Start with all the features. Train a learning machine f on the current subset of features by minimizing a risk functional J[f]. For each (remaining) feature Xi, estimate, without retraining f, the change in J[f] resulting from the removal of Xi. Remove the feature Xν(k) that results in improving or least degrading J. Embedded method for SVM, kernel methods, neural nets.

Scaling Factors Now we can do gradient descent! Idea: Transform a discrete space into a continuous space. s=[s1, s2, s3, s4] Discrete indicators of feature presence: si {0, 1} Continuous scaling factors: si  IR Now we can do gradient descent!

Next few slides: André Elisseeff Formalism ( chap. 5) Definition: an embedded feature selection method is a machine learning algorithm that returns a model using a limited number of features. Training set output Learning algorithm Next few slides: André Elisseeff

Parameterization Consider the following set of functions parameterized by  and where   {0,1}n represents the use (i=1) or rejection of feature i. 1=1 3=0 output

Example: Kernel methods f(a, sox) = Si ai k(soxi, sox) N n X xi m a 1 1 1 … 0 0 0 … s

Feature selection as an optimization problem Find  and  that minimize a risk functional: loss function unknown distribution Problem: we do not know P(x, y)… … all we have are training examples (x1, y1), (x2, y2), … (xm, ym)

Approximations of R[f] Empirical risk: Rtrain[f] = (1/n) i=1:m L(f(xi; w), yi) Guaranteed risk: with proba (1-d), R[f]  Rgua[f] Rgua[f] = Rtrain[f] + e(d,C) Structural risk minimization: Sk = { w | ||w||2 < wk2 }, w1<w2<…<wk min Rtrain[f] s.t. ||w||2 < wk2 Regularized risk Rreg[f,g] = Rtrain[f] + g ||w||2 If you minimize the training error you will get into the overfitting pb : you will not generalize well (explain “over-fitting”) Isabelle Comments : In the risk minimization framework, we address directly the classification or regression problems. Generality of PAC: give the example of the sample mean. Epsilon and delta are positive values. Delta is given, it is the risk of being wrong that we are willing to accept. Epsilon is a function of the complexity of the family of functions, of the number of training examples and of delta. A capacity is a measure of complexity. The VC dim is a capacity = the largest number of examples one can shatter (assign to labels in all possible ways.) A problem is PAC learnable if there exists a learning algorithm, which can attain the epsilon-delta condition for given values of epsilon and delta. It is efficiently PAC-learnable if there exists an algorithm that can PAC learn in time polynomial is epsilon, delta, and log(m). The sample complexity is the smallest training set size that makes the problem become PAC-learnable. Two-part objectives: loss-> noise model; regularizer-> uncertainty about the model

Carrying out the optimization How to minimize ? Most approaches use the following method: This optimization is often done by relaxing the constraint   {0,1}n as   [0,1]n

Add/Remove features 1 Many learning algorithms are cast into a minimization of some regularized functional: What does G() become if one feature is removed? Sometimes, G can only increase… (e.g. SVM) Regularization capacity control Empirical error

Gradient of f wrt. ith feature at point xk Add/Remove features 2 It can be shown (under some conditions) that the removal of one feature will induce a change in G proportional to: Gradient of f wrt. ith feature at point xk Examples: Linear SVM RFE (() = (w) = i wi2)

Add/Remove features - RFE Recursive Feature Elimination Minimize estimate of R(,) wrt.  Minimize the estimate R(,) wrt.  and under a constraint that only limited number of features must be selected

Add/Remove feature summary Many algorithms can be turned into embedded methods for feature selections by using the following approach: Choose an objective function that measure how well the model returned by the algorithm performs “Differentiate” (or sensitivity analysis) this objective function according to the  parameter (i.e. how does the value of this function change when one feature is removed and the algorithm is rerun) Select the features whose removal (resp. addition) induces the desired change in the objective function (i.e. minimize error estimate, maximize alignment with target, etc.) What makes this method an ‘embedded method’ is the use of the structure of the learning algorithm to compute the gradient and to search/weight relevant features.

Gradient descent - 1 How to minimize ? Most approaches use the following method: Would it make sense to perform just a gradient step here too? Gradient step in [0,1]n.

Gradient descent 2 Advantage of this approach: can be done for non-linear systems (e.g. SVM with Gaussian kernels) can mix the search for features with the search for an optimal regularization parameters and/or other kernel parameters. Drawback: heavy computations back to gradient based machine algorithms (early stopping, initialization, etc.)

Gradient descent summary Many algorithms can be turned into embedded methods for feature selections by using the following approach: Choose an objective function that measure how well the model returned by the algorithm performs Differentiate this objective function according to the  parameter Performs a gradient descent on . At each iteration, rerun the initial learning algorithm to compute its solution on the new scaled feature space. Stop when no more changes (or early stopping, etc.) Threshold values to get list of features and retrain algorithm on the subset of features. Difference from add/remove approach is the search strategy. It still uses the inner structure of the learning model but it scales features rather than selecting them.

Changing the structure Shrinkage (weight decay, ridge regression, SVM): Sk = { w | ||w||2< wk }, w1<w2<…<wk g1 > g2 > g3 >… > gk (g is the ridge) Feature selection (0-norm SVM): Sk = { w | ||w||0< sk }, s1<s2<…<sk (s is the number of features) Feature selection (lasso regression, 1-norm SVM): Sk = { w | ||w||1< bk },

The l0 SVM Replace the regularizer ||w||2 by the l0 norm Further replace by i log( + |wi|) Boils down to the following multiplicative update algorithm:

The l1 SVM The version of the SVM where ||w||2 is replace by the l1 norm i |wi| can be considered as an embedded method: Only a limited number of weights will be non zero (tend to remove redundant features) Difference from the regular SVM where redundant features are all included (non zero weights)

Mechanical interpretation w2 J = l ||w||22 + ||w-w*||2 w* w1 w1 w2 w* J = ||w||22 + 1/l ||w-w*||2 Ridge regression w1 w2 w* J = ||w||1 + 1/l ||w-w*||2 Lasso

Embedded method - summary Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms: Find a functional that represents your prior knowledge about what a good model is. Add the s weights into the functional and make sure it’s either differentiable or you can perform a sensitivity analysis efficiently Optimize alternatively according to a and s Use early stopping (validation set) or your own stopping criterion to stop and select the subset of features Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc…

Book of the NIPS 2003 challenge Feature Extraction, Foundations and Applications I. Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book