Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Slides:

Advertisements

Similar presentations

ECG Signal processing (2)

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

An Introduction of Support Vector Machine

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,

Supervised Learning Recap

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,

The loss function, the normal equation,

Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

Locally Constraint Support Vector Clustering

The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.

Kernel Technique Based on Mercer’s Condition (1909)

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.

Reduced Support Vector Machine

Binary Classification Problem Learn a Classifier from the Training Set

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Support Vector Machines

Difficulties with Nonlinear SVM for Large Problems  The nonlinear kernel is fully dense  Computational complexity depends on  Separating surface depends.

Mathematical Programming in Support Vector Machines

Incremental Support Vector Machine Classification Second SIAM International Conference on Data Mining Arlington, Virginia, April 11-13, 2002 Glenn Fung.

Efficient Model Selection for Support Vector Machines

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Knowledge-Based Breast Cancer Prognosis Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Computation and Informatics in Biology and Medicine.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.

Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison

Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison November 14, 2015 TexPoint.

Multiple Instance Learning via Successive Linear Programming Olvi Mangasarian Edward Wild University of Wisconsin-Madison.

RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.

Biointelligence Laboratory, Seoul National University

Linear Models for Classification

CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.

Exact Differentiable Exterior Penalty for Linear Programming Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison December 20, 2015 TexPoint.

Feature Selection in k-Median Clustering Olvi Mangasarian and Edward Wild University of Wisconsin - Madison.

Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.

Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.

Survival-Time Classification of Breast Cancer Patients and Chemotherapy Yuh-Jye Lee, Olvi Mangasarian & W. H. Wolberg UW Madison & UCSD La Jolla Computational.

Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Knowledge-Based Nonlinear Support Vector Machine Classifiers Glenn Fung, Olvi Mangasarian & Jude Shavlik COLT 2003, Washington, DC. August 24-27, 2003.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen, Zne-Jung Lee

Concave Minimization for Support Vector Machine Classifiers

University of Wisconsin - Madison

Minimal Kernel Classifiers

Presentation transcript:

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data Mining Techniques with Applications IEEE International Conference on Data Mining Omaha, Nebraska, October 28, 2007

Example ___ _ ___ _ x1x1 x2x2 However, data is nonlinearly separable using only the feature x 2 Best linear classifier that uses only 1 feature selects the feature x 1 Feature selection in nonlinear classification is important Data is nonlinearly separable: In general nonlinear kernels use both x 1 and x 2

Outline  Minimize the number of input space features selected by a nonlinear kernel classifier  Start with a standard 1-norm nonlinear support vector machine (SVM)  Add 0-1 diagonal matrix to suppress or keep features  Leads to a nonlinear mixed-integer program  Introduce algorithm to obtain a good local solution to the resulting mixed-integer program  Evaluate algorithm on two public datasets from the UCI repository and synthetic NDCC data

K(x 0, A 0 )u =  1  Support Vector Machines K(A +, A 0 )u ¸ e  +e K(A , A 0 )u · e   e + _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ K(x 0, A 0 )u =  K(x 0, A 0 )u =  Slack variable y ¸ 0 allows points to be on the wrong side of the bounding surface  x 2 R n  SVM defined by parameters u and threshold  of the nonlinear surface  A contains all data points {+…+} ½ A + {  …  } ½ A   e is a vector of ones SVMs Minimize e 0 s (||u|| 1 at solution) to reduce overfitting Minimize e 0 y (hinge loss or plus function or max{, 0}) to fit data Linear kernel: (K(A, B)) ij = (AB) ij = A i B ¢ j = K(A i, B ¢ j ) Gaussian kernel, parameter  (K(A, B)) ij = exp(-  ||A i 0 -B ¢ j || 2 )

Reduced Feature SVM If E ii is 0 the i th feature is removed  To suppress features, add the number of features present (e 0 Ee) to the objective with weight  ¸ 0  As  is increased, more features will be removed from the classifier All features are present in the kernel matrix K(A, A 0 ) Replace A with AE, where E is a diagonal n £ n matrix with E ii 2 {1, 0}, i = 1, …, n Start with Full SVM

Reduced Feature SVM (RFSVM) 1)Initialize diagonal matrix E randomly 2)For fixed 0-1 values E, solve the SVM linear program to obtain (u, , y, s) 3)Fix (u, , s) and sweep through E repeatedly as follows:  For each component of E replace 1 by 0 and conversely provided the change decreases the overall objective function by more than tol 4)Go to (3) if a change was made in the last sweep, otherwise continue to (5) 5)Solve the SVM linear program with the new matrix E. If the objective decrease is less than tol, stop, otherwise go to (3)

RFSVM Convergence (for tol = 0)  Objective function value converges  Each step decreases the objective  Objective is bounded below by 0  Limit of the objective function value is attained at any accumulation point of the sequence of iterates  Accumulation point is a “local minimum solution”  Continuous variables are optimal for the fixed integer variables  Changing any single integer variable will not decrease the objective

Experimental Results  Classification accuracy versus number of features used  Compare our RFSVM to Relief and RFE (Recursive Feature Elimination)  Results given on two public datasets from the UCI repository  Ability of RFSVM to handle problems with up to 1000 features tested on synthetic NDCC datasets  Set feature selection parameter  = 1

Relief and RFE  Relief  Kira and Rendell, 1992  Filter method: feature selection is a preprocessing procedure  Features are selected as relevant if they tend to have different feature values for points in different classes  RFE (Recursive Feature Elimination)  Guyon, Weston, Barnhill, and Vapnik, 2002  Wrapper method: feature selection is based on classification  Features are selected as relevant if removing them causes a large change in the margin of an SVM

Ionosphere Dataset 351 Points in R 34 Nonlinear SVM with no feature selection Linear 1-norm SVM Even for feature selection parameter  = 0, some features may be removed when removing them decreases the hinge loss Note that accuracy decreases slightly until about 10 features remain, and then decreases more sharply as they are removed  Number of features used Cross-validation accuracy If the appropriate value of  is selected, RFSVM can obtain higher accuracy using fewer features than SVM1

Normally Distributed Clusters on Cubes Dataset (Thompson, 2006)  Points are generated from normal distributions centered at vertices of 1-norm cubes  Dataset is not linearly separable

RFSVM vs. SVM without Feature Selection (NKSVM1) on NDCC Data with 100 True Features and 1000 Irrelevant Features RFSVM vs. SVM without Feature Selection (NKSVM1) on NDCC Data with 20 True Features and Varying Numbers of Irrelevant Features Each point is the average test set correctness over 10 datasets with 200 training, 200 tuning, and 1000 testing points When 480 irrelevant features are added, the accuracy of RFSVM is 45% higher than that of NKSVM NKSVM RFSVM Average Accuracy on 1000 Test Points

Conclusion  New rigorous formulation with precise objective for feature selection in nonlinear SVM classifiers  Obtain a local solution to the resulting mixed-integer program  Alternate between a linear program to compute continuous variables and successive sweeps to update the integer variables  Efficiently learns accurate nonlinear classifiers with reduced numbers of features  Handles problems with 1000 features, 900 of which are irrelevant

Questions?  Websites with links to papers and talks    NDCC generator 

Running Time on the Ionosphere Dataset  Averages 5.7 sweeps through the integer variables  Averages 3.4 linear programs  75% of the time consumed in objective function evaluations  15% of time consumed in solving linear programs  Complete experiment (1960 runs) took 1 hour  3 GHz Pentium 4  Written in MATLAB  CPLEX 9.0 used to solve the linear programs  Gaussian kernel written in C

Sonar Dataset 208 Points in R 60  Number of features used Cross-validation accuracy

Related Work  Approaches that use specialized kernels  Weston, Mukherjee, Chapelle, Pontil, Poggio, and Vapnik, 2000: structural risk minimization  Gold, Holub, and Sollich, 2005: Bayesian interpretation  Zhang, 2006: smoothing spline ANOVA kernels  Margin-based approach  Frölich and Zell, 2004: remove features if there is little change to the margin if they are removed  Other approaches which combine feature selection with basis reduction  Bi, Bennett, Embrechts, Breneman, and Song, 2003  Avidan, 2004

Future Work  Datasets with more features  Reduce the number of objective function evaluations  Limit the number of integer cycles  Other ways to update the integer variables  Application to regression problems  Automatic choice of 

Algorithm  Global solution to nonlinear mixed-integer program cannot be found efficiently  Requires solving 2 n linear programs  For fixed values of the integer diagonal matrix, the problem is reduced to an ordinary SVM linear program  Solution strategy: alternate optimization of continuous and integer variables:  For fixed values of E, solve a linear program for (u, , y, s)  For fixed values of (u, , s), sweep through the components of E and make updates which decrease the objective function

Notation  Data points represented as rows of an m £ n matrix A  Data labels of +1 or -1 are given as elements of an m £ m diagonal matrix D  Example  XOR: 4 points in R 2  Points (0, 1), (1, 0) have label +1  Points (0, 0), (1, 1) have label  1  Kernel K(A, B) : R m £ n £ R n £ k ! R m £ k  Linear kernel: (K(A, B)) ij = (AB) ij = A i B ¢ j = K(A i, B ¢ j )  Gaussian kernel, parameter  (K(A, B)) ij = exp(-  ||A i 0 - B ¢ j || 2 )

Methodology  UCI Datasets  To reduce running time, 1/11 of each dataset was used as a tuning set to select and the kernel parameter  Remaining 10/11 used for 10-fold cross validation  Procedure repeated 5 times for each dataset with different random choice of tuning set each time  NDCC  Generate multiple datasets with 200 training, 200 tuning, and 1000 testing points