Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Minimum Redundancy and Maximum Relevance Feature Selection
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
Lecture 4: Embedded methods
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
x – independent variable (input)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Reduced Support Vector Machine
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Lecture 10: Support Vector Machines
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Whole Genome Expression Analysis
Efficient Model Selection for Support Vector Machines
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
The following slides are taken from:
COSC 4368 Machine Learning Organization
Presentation transcript:

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Abstract A method of feature selection for Support Vector Machines (SVMs) A efficient Wrapper method. Superior to some standard feature selection algorithm

1. Introduction Importance of feature selection in supervised learning  Generalization performance, running time requirements, and interpretational issues SVM  State-of-the-art in classification & regression Objective  To select a subset of features while preserving of improving the discriminative ability of a classifier

2. The Feature Selection problem Definition  Find the m << n features with the smallest expected generalization error, which is unknown and must be estimated. Given fixed set of functions y = f(x,  ), Find a preprocessing of the data x  (  x),  {0,1} n, and the parameters  minimizing Subject to ||  || 0 =m, where F(x,y) is unknown, V( ,  ) is a loss function.

Four issues on Feature Selections Starting point  Forward/backward/Middle Search organization  Exhaustive search (2 N )/ Heuristic search Evaluation strategy  Filter (independent on any learning algorithm)  Wrapper (bias of a particular induction algorithm) Stopping criterion from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Feature filters Forward selection: only addition to the subset. Backward elimination: only deletion from the subtion Example  FOCUS [Almuallim and Dieterich 92]  Finds the minimum combination of filters that divides the training data into pure classes.  Feature minimizing entropy.  Most discriminating feature. (if its value differs between positive and negative examples) from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Wrapper filters Use a induction algorithm to estimate the merit of feature subsets. Consider inductive bias. Better results but slower. (repeatedly call the induction algorithm) from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Correlation based Feature Selection Relevant: if their values vary systemically with category membership. Redundant: if one or more of the other features are correlated with a feature. A good feature subset if one that contains features highly correlated with the class, yet uncorrelated with each other. from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

Pearson Correlation coefficient  z: outside variable, c: composite, k: number of components, r zi : correlation between components and the outside variable, r ii : inter-correlation between components  higher r zi  higher r zc  lower r ii  higher r zc  larger k  higher r zc from Mark A. Holl, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, 1999

2. (cont ’ d) Taking advantage of the performance increase of wrapper methods whilst avoiding their computational complexity.

3. Support Vector Learning Idea: map x into a high dimensianl space  (x) and make an optimal hyper-plane in this space.  Mapping  (  ) is performed by a kernel function K( ,  ).  Decision function given by SVM  Optimal hyper-plane is the large margin classifier. (reduced to QP problem)

3. Support Vector Learning  Where images are the mapping  (x 1 ),…,  (x l ). Theorem 1 justifies that the performance depends on both R and M, where R is controlled by the mapping function  (  ). Theorem 1. If images of training data of size l belonging to a sphere of size R are separable with the corresponding margin M, then the expectation of the error probability has the bound where expectation is taken over sets of training data of size l.

4. Feature Selection for SVMs Remind the objective  Minimize via , . Enlarge the set of functions

4. Feature Selection for SVMs Minimize over . Where (Vapink’s statistical learning theory)

4. Feature Selection for SVMs Finding the minimum of R 2 W 2 over  requires searching over all possible subsets of n features which is combinatorial problem. Approximate algorithm:  Approximate the binary valued vector  {0,1} n, with a real valued vector .  Optimize R 2 W 2 by gradient descent.

4. Feature Selection for SVMs Summary  Find minimum of  ( ,  ) by minimizing approximate integer programming  For large enough as p  0 only m elements of  will be nonzero, approximating optimization problem  ( ,  ).

5. Experiments (Toy data) Comparison  Standard SVMs, feature selection algorithm, three classical filter methods (Pearson correlation coefficients, Fischer criterion score, Kolmogorov- Smirnov test). Description  Linear SVM for linear problem, second order polynomial kernel for nonlinear problem  2 best features were selected for comparison.

Fischer criterion score Kolmogorov-Smirnov test : mean value for r-th feature in the positive and negative classes where f r denotes the r-th feature from each training example, and P is the corresponding empirical distribution.

5. Experiments (Toy data) Data  Linear data  6 of 202 were relevant. Probability of y = 1 or –1 are equal. {x 1, x 2, x 3 } were drawn as x i =yN(i,1) with prob. 0.7 and x i =N(0,1) with prob 0.3.{x 4, x 5, x 6 } were drawn as x i =yN(i–3,1) with prob. 0.7 and x i =N(0,1) with prob 0.3. Other features are noise x i =N(0,20)  Nonlinear data  2 of 52 were relevant. If y=–1, {x 1,x 2 } are drawn from N(  1,  ) or N(  2,  ) with equal prob.  =I. If y=1, {x 1,x 2 } are drawn from two normal distribution. Other features are noise x i =N(0,20)

(a) Linear problem, (b) nonlinear problem both with many irrelevant features. The x-axis is the number of training points, and the y-axis the test error as a fraction of test points. (a) (b) 5. Experiments (Toy data)

5. Experiments (Real-life Data) Comparision  (1) Minimizing R 2 W 2  (2) Rank ordering features according to how much they change the decision boundary and removing those that cause the least change.  (3) Fischer criterion score

5. Experiments (Real-life Data) Data  Face detection  training set 2,429/13,229, test set 104/2M, dimension 1,740  Pedestrian detection  training set 924/10,044, test set 124/800K, dimension 1,326  Cancer morphology classification  training 38/ test 34, dimension 7,129  1 error using all genes. 0 errors for (1) using 20 genes, 5 errors for (2), 3 errors for (3)

5. Experiments (Real-life Data) (a) The top ROC curves are for 725 features and the bottom one for 120 features for face detection. (b) ROC curves using all features and 120 features for pedestrian detection. Solid: all features, solid with a circle: (1), dotted and dashed: (2), dotted: (3) (a) (b)

6. Conclusion Introduce a wrapper method to perform feature selection for SVMs. Computationally feasible for high dimensional datasets compared to existing wrapper methods Superior to some filter methods.