Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

ECG Signal processing (2)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
SVM—Support Vector Machines
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
Support Vector Machines
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Efficient Model Selection for Support Vector Machines
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
An Introduction to Support Vector Machines (M. Law)
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Considering Cost Asymmetry in Learning Classifiers Presented by Chunping Wang Machine Learning Group, Duke University May 21, 2007 by Bach, Heckerman and.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Biointelligence Laboratory, Seoul National University
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Geometrical intuition behind the dual problem
Support Vector Machines
An Introduction to Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Support Vector Machines
Support vector machines
SVMs for Document Ranking
Presentation transcript:

Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY

Convex Machine Learning Convex optimization approaches to machine learning has been major obsession of machine learning for last ten years. But are the problems really convex?

Outline The myth of convex machine learning Bilevel Programming Model Selection Regression Classification Extensions to other machine learning tasks Discussion

Modeler’s Choices Data Function Loss/Regularization Optimization Algorithm CONVEX! w

Many Hidden Choices Data: Variable Selection Scaling Feature Construction Missing Data Outlier removal Function Family: linear, kernel (introduces kernel parameters) Optimization model loss function regularization Parameters/Constraints

Data Function Loss/Regularization Cross-Validation Strategy Generalization Error NONCONVEX Cross-Validation: C,ε, [X,y] Optimization Algorithm w

How does modeler make choices? Best training set error Experience/policy Estimate of generalization error Cross-validation Bounds Optimize generalization error estimate Fiddle around. Grid Search Gradient methods Bilevel Programming

Splitting Data for T-fold CV

CV via Grid Search For every C, ε For every validation set, Solve model on corresp. training set, and to estimate loss for Estimate generalization error for C, ε Return best values for C,ε Make final model using C,ε C ε

Bilevel Program for T folds Prior Approaches: Golub et al., 1979, Generalized Cross-Validation for one parameter in Ridge Regression CV as Continuous Optimization Problem T inner-level training problems Outer-level validation problem

Benefit: More Design Variables Add feature box constraint: in the inner-level problems.

-insensitive Loss Function

Inner-level Problem for t-th Fold

Optimality (KKT) conditions for fixed

Key Transformation KKT for the inner level training problems are necessary and sufficient Replace lower level problems by their KKT Conditions Problem becomes an Mathematical Programming Problem with Equilibrium Constraints (MPEC)

Bilevel Problem as MPEC Replace T inner-level problems with corresponding optimality conditions

MPEC to NLP via Inexact Cross Validation Relax “hard” equilibrium constraints to “soft” inexact constraints tol is some user-defined tolerance.

Solvers Strategy: Proof of concept using nonlinear general purpose solvers from NEOS on NLP FILTER, SNOPT Sequential Quadratic Programming Methods FILTER results almost always better. Many possible alternatives: Integer Programming Branch and Bound Lagrangian Relaxations

Computational Experiments: DATA Synthetic (5,10,15)-D Data with Gaussian and Laplacian noise and (3,7,10) relevant features. NLP: 3-fold CV Results: 30 to 90 train, 1000 test points, 10 trials QSAR/Drug Design 4 datasets, 600+ dimensions reduced to 25 top principal components. NLP: 5-fold CV Results: 40 – 100 train, rest test, 20 trials

Cross-validation Methods Compared Unconstrained Grid: Try 3 values each for C,ε Constrained Grid: Try 3 values each for C, ε, and {0, 1} for each component of Bilevel/FILTER: Nonlinear program solved using off-the-shelf SQP algorithm via NEOS

15-D Data: Objective Value

15-D Data: Computational Time

15-D Data: TEST MAD

QSAR Data: Objective Value

QSAR Data: Computation Time

QSAR Data: TEST MAD

Classification Cross Validation Given sample data from two classes. Find classification function that minimizes out-of- sample estimate of classification error 1

Lower level - SVM Define parallel planes Minimize points on wrong side Maximize margin of separation

Lower Level Loss Function: Hinge Loss Measures distance of points that violate the appropriate hyperplane constraints,

Lower Level Problem: SVC with box

Inner-level KKT Conditions

Outer-level Loss Functions Misclassification Minimization Loss (MM) Loss function used in classical CV Loss = 1, if validation pt misclassified, 0, otherwise (computed using step function, ) Hinge Loss (HL) Both inner and outer levels use same loss function Loss = distance from (computed using max function, )

Hinge Loss is Convex Approx. of Misclassification Minimization Loss

Hinge Loss Bilevel Program (BilevelHL) Replace max in outer level objective with convex constraints Replace inner-level problems with KKT conditions

Hinge Loss MPEC

Misclassification Min. Bilevel Program (BilevelMM) Misclassifications are counted using the step function, defined component wise for a n-vector as

The Step Function Mangasarian (1994) showed that and that any solution,, to the LP satisfies

Misclassifications in the Validation Set Validation point misclassified when the sign of is negative i.e., This can be recast for all validation points (within the t-th fold) as

Misclassification Minimization Bilevel Program (revisited) Inner-level problems to determine misclassified validation points Inner-level training problems Outer-level average misclassification minimization

Misclassification Minimization MPEC

Inexact Cross Validation NLP Both BilevelHL and BilevelMM MPECs are transformed to NLP by relaxing equilibrium constraints (inexact CV) Solved using FILTER on NEOS These are compared with classical cross validation: unconstrained and constrained grid.

Experiments: Data sets 3-fold cross validation for model selection Average results for 20 train test splits

Computational Time

Training CV Error

Testing Error

Number of Variables

Progress Cross Validation is a bilevel problem solvable by continuous optimization methods Off-the-shelf NLP algorithm – FILTER solved classification and regression Bilevel Optimization extendable to many Machine Learning problems

Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Variable Selection/Scaling Multi-task Learning Semi-supervised Learning Generative methods

Semi-supervised Learning Have labeled data, and unlabeled data Treat missing labels,, as design variables in the outer level Lower level problems are still convex

Semi-supervised Regression Outer level minimizes error on labeled data to find optimal parameters and labels -insensitive loss on labeled data in inner level -insensitive loss on unlabeled data in inner level Inner level regularization

Discussion New capacity offers new possibilities: Outer level objectives? Inner level problem? classification, ranking, semi-supervised, missing values, kernel selection, variable selection, … Need special purpose algorithms for greater efficiency, scalability, robustness This work was supported by Office of Naval Research Grant N

Experiments: Bilevel CV Procedure Run BilevelMM/BilevelHL to compute optimal parameters, Drop descriptors with small Create model on all training data using Compute test error on hold-out set

Experiments: Grid Search CV Procedure Unconstrained Grid: Try 6 values for on a log10 scale Constrained Grid: Try 6 values for and {0, 1} for each component of (perform RFE if necessary) Create model on all training data using optimal grid point Compute test error on hold-out set

Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Different Regularizations (L 1, elastic nets) Enhanced Feature Selection Multi-task Learning Semi-supervised Learning Generative methods

Enhanced Feature Selection Assume at most descriptors allowed Introduce outer-level constraint (with counting the non-zero elements of ) Rewrite constraint, observing that Get additional conditions for

Kernel Bilevel Discussion Pros Performs model selection in feature space Performs feature selection in input space Cons Highly nonlinear model Difficult to solve

Kernel Classification (MPEC form)

Is it okay to do 3-folds

Applying the “kernel trick” Drop the box constraint, Eliminate from the optimality conditions Replace with an appropriate

Feature Selection with Kernels Parameterize the kernel with such that if the n-th descriptor vanishes from Linear kernel Polynomial kernel Gaussian kernel

Kernel Regression (Bilevel form)