Download presentation
Presentation is loading. Please wait.
Published byCandace Wilkerson Modified over 8 years ago
1
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
2
Abstract A method of feature selection for Support Vector Machines (SVMs) A efficient Wrapper method. Superior to some standard feature selection algorithm
3
1. Introduction Importance of feature selection in supervised learning Generalization performance, running time requirements, and interpretational issues SVM State-of-the-art in classification & regression Objective To select a subset of features while preserving of improving the discriminative ability of a classifier
4
2. The Feature Selection problem Definition Find the m << n features with the smallest expected generalization error, which is unknown and must be estimated. Given fixed set of functions y = f(x, ), Find a preprocessing of the data x ( x), {0,1} n, and the parameters minimizing Subject to || || 0 =m, where F(x,y) is unknown, V( , ) is a loss function.
5
Four issues on Feature Selections Starting point Forward/backward/Middle Search organization Exhaustive search (2 N )/ Heuristic search Evaluation strategy Filter (independent on any learning algorithm) Wrapper (bias of a particular induction algorithm) Stopping criterion from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
7
Feature filters Forward selection: only addition to the subset. Backward elimination: only deletion from the subtion Example FOCUS [Almuallim and Dieterich 92] Finds the minimum combination of filters that divides the training data into pure classes. Feature minimizing entropy. Most discriminating feature. (if its value differs between positive and negative examples) from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
8
Wrapper filters Use a induction algorithm to estimate the merit of feature subsets. Consider inductive bias. Better results but slower. (repeatedly call the induction algorithm) from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
9
Correlation based Feature Selection Relevant: if their values vary systemically with category membership. Redundant: if one or more of the other features are correlated with a feature. A good feature subset if one that contains features highly correlated with the class, yet uncorrelated with each other. from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
10
Pearson Correlation coefficient z: outside variable, c: composite, k: number of components, r zi : correlation between components and the outside variable, r ii : inter-correlation between components higher r zi higher r zc lower r ii higher r zc larger k higher r zc from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999
11
2. (cont ’ d) Taking advantage of the performance increase of wrapper methods whilst avoiding their computational complexity.
12
3. Support Vector Learning Idea: map x into a high dimensianl space (x) and make an optimal hyper-plane in this space. Mapping ( ) is performed by a kernel function K( , ). Decision function given by SVM Optimal hyper-plane is the large margin classifier. (reduced to QP problem)
13
3. Support Vector Learning Where images are the mapping (x 1 ),…, (x l ). Theorem 1 justifies that the performance depends on both R and M, where R is controlled by the mapping function ( ). Theorem 1. If images of training data of size l belonging to a sphere of size R are separable with the corresponding margin M, then the expectation of the error probability has the bound where expectation is taken over sets of training data of size l.
14
4. Feature Selection for SVMs Remind the objective Minimize via , . Enlarge the set of functions
15
4. Feature Selection for SVMs Minimize over . Where (Vapink’s statistical learning theory)
16
4. Feature Selection for SVMs Finding the minimum of R 2 W 2 over requires searching over all possible subsets of n features which is combinatorial problem. Approximate algorithm: Approximate the binary valued vector {0,1} n, with a real valued vector . Optimize R 2 W 2 by gradient descent.
17
4. Feature Selection for SVMs Summary Find minimum of ( , ) by minimizing approximate integer programming For large enough as p 0 only m elements of will be nonzero, approximating optimization problem ( , ).
18
5. Experiments (Toy data) Comparison Standard SVMs, feature selection algorithm, three classical filter methods (Pearson correlation coefficients, Fischer criterion score, Kolmogorov- Smirnov test). Description Linear SVM for linear problem, second order polynomial kernel for nonlinear problem 2 best features were selected for comparison.
19
Fischer criterion score Kolmogorov-Smirnov test : mean value for r-th feature in the positive and negative classes where f r denotes the r-th feature from each training example, and P is the corresponding empirical distribution.
20
5. Experiments (Toy data) Data Linear data 6 of 202 were relevant. Probability of y = 1 or –1 are equal. {x 1, x 2, x 3 } were drawn as x i =yN(i,1) with prob. 0.7 and x i =N(0,1) with prob 0.3.{x 4, x 5, x 6 } were drawn as x i =yN(i–3,1) with prob. 0.7 and x i =N(0,1) with prob 0.3. Other features are noise x i =N(0,20) Nonlinear data 2 of 52 were relevant. If y=–1, {x 1,x 2 } are drawn from N( 1, ) or N( 2, ) with equal prob. =I. If y=1, {x 1,x 2 } are drawn from two normal distribution. Other features are noise x i =N(0,20)
21
(a) Linear problem, (b) nonlinear problem both with many irrelevant features. The x-axis is the number of training points, and the y-axis the test error as a fraction of test points. (a) (b) 5. Experiments (Toy data)
22
5. Experiments (Real-life Data) Comparision (1) Minimizing R 2 W 2 (2) Rank ordering features according to how much they change the decision boundary and removing those that cause the least change. (3) Fischer criterion score
23
5. Experiments (Real-life Data) Data Face detection training set 2,429/13,229, test set 104/2M, dimension 1,740 Pedestrian detection training set 924/10,044, test set 124/800K, dimension 1,326 Cancer morphology classification training 38/ test 34, dimension 7,129 1 error using all genes. 0 errors for (1) using 20 genes, 5 errors for (2), 3 errors for (3)
24
5. Experiments (Real-life Data) (a) The top ROC curves are for 725 features and the bottom one for 120 features for face detection. (b) ROC curves using all features and 120 features for pedestrian detection. Solid: all features, solid with a circle: (1), dotted and dashed: (2), dotted: (3) (a) (b)
25
6. Conclusion Introduce a wrapper method to perform feature selection for SVMs. Computationally feasible for high dimensional datasets compared to existing wrapper methods Superior to some filter methods.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.