Download presentation
Presentation is loading. Please wait.
Published byDwight Byron Gaines Modified over 9 years ago
1
Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY
2
Convex Machine Learning Convex optimization approaches to machine learning has been major obsession of machine learning for last ten years. But are the problems really convex?
3
Outline The myth of convex machine learning Bilevel Programming Model Selection Regression Classification Extensions to other machine learning tasks Discussion
4
Modeler’s Choices Data Function Loss/Regularization Optimization Algorithm CONVEX! w
5
Many Hidden Choices Data: Variable Selection Scaling Feature Construction Missing Data Outlier removal Function Family: linear, kernel (introduces kernel parameters) Optimization model loss function regularization Parameters/Constraints
6
Data Function Loss/Regularization Cross-Validation Strategy Generalization Error NONCONVEX Cross-Validation: C,ε, [X,y] Optimization Algorithm w
7
How does modeler make choices? Best training set error Experience/policy Estimate of generalization error Cross-validation Bounds Optimize generalization error estimate Fiddle around. Grid Search Gradient methods Bilevel Programming
8
Splitting Data for T-fold CV
9
CV via Grid Search For every C, ε For every validation set, Solve model on corresp. training set, and to estimate loss for Estimate generalization error for C, ε Return best values for C,ε Make final model using C,ε C ε
10
Bilevel Program for T folds Prior Approaches: Golub et al., 1979, Generalized Cross-Validation for one parameter in Ridge Regression CV as Continuous Optimization Problem T inner-level training problems Outer-level validation problem
11
Benefit: More Design Variables Add feature box constraint: in the inner-level problems.
12
-insensitive Loss Function
13
Inner-level Problem for t-th Fold
14
Optimality (KKT) conditions for fixed
15
Key Transformation KKT for the inner level training problems are necessary and sufficient Replace lower level problems by their KKT Conditions Problem becomes an Mathematical Programming Problem with Equilibrium Constraints (MPEC)
16
Bilevel Problem as MPEC Replace T inner-level problems with corresponding optimality conditions
17
MPEC to NLP via Inexact Cross Validation Relax “hard” equilibrium constraints to “soft” inexact constraints tol is some user-defined tolerance.
18
Solvers Strategy: Proof of concept using nonlinear general purpose solvers from NEOS on NLP FILTER, SNOPT Sequential Quadratic Programming Methods FILTER results almost always better. Many possible alternatives: Integer Programming Branch and Bound Lagrangian Relaxations
19
Computational Experiments: DATA Synthetic (5,10,15)-D Data with Gaussian and Laplacian noise and (3,7,10) relevant features. NLP: 3-fold CV Results: 30 to 90 train, 1000 test points, 10 trials QSAR/Drug Design 4 datasets, 600+ dimensions reduced to 25 top principal components. NLP: 5-fold CV Results: 40 – 100 train, rest test, 20 trials
20
Cross-validation Methods Compared Unconstrained Grid: Try 3 values each for C,ε Constrained Grid: Try 3 values each for C, ε, and {0, 1} for each component of Bilevel/FILTER: Nonlinear program solved using off-the-shelf SQP algorithm via NEOS
21
15-D Data: Objective Value
22
15-D Data: Computational Time
23
15-D Data: TEST MAD
24
QSAR Data: Objective Value
25
QSAR Data: Computation Time
26
QSAR Data: TEST MAD
27
Classification Cross Validation Given sample data from two classes. Find classification function that minimizes out-of- sample estimate of classification error 1
28
Lower level - SVM Define parallel planes Minimize points on wrong side Maximize margin of separation
29
Lower Level Loss Function: Hinge Loss Measures distance of points that violate the appropriate hyperplane constraints,
30
Lower Level Problem: SVC with box
31
Inner-level KKT Conditions
32
Outer-level Loss Functions Misclassification Minimization Loss (MM) Loss function used in classical CV Loss = 1, if validation pt misclassified, 0, otherwise (computed using step function, ) Hinge Loss (HL) Both inner and outer levels use same loss function Loss = distance from (computed using max function, )
33
Hinge Loss is Convex Approx. of Misclassification Minimization Loss
34
Hinge Loss Bilevel Program (BilevelHL) Replace max in outer level objective with convex constraints Replace inner-level problems with KKT conditions
35
Hinge Loss MPEC
36
Misclassification Min. Bilevel Program (BilevelMM) Misclassifications are counted using the step function, defined component wise for a n-vector as
37
The Step Function Mangasarian (1994) showed that and that any solution,, to the LP satisfies
38
Misclassifications in the Validation Set Validation point misclassified when the sign of is negative i.e., This can be recast for all validation points (within the t-th fold) as
39
Misclassification Minimization Bilevel Program (revisited) Inner-level problems to determine misclassified validation points Inner-level training problems Outer-level average misclassification minimization
40
Misclassification Minimization MPEC
41
Inexact Cross Validation NLP Both BilevelHL and BilevelMM MPECs are transformed to NLP by relaxing equilibrium constraints (inexact CV) Solved using FILTER on NEOS These are compared with classical cross validation: unconstrained and constrained grid.
42
Experiments: Data sets 3-fold cross validation for model selection Average results for 20 train test splits
43
Computational Time
44
Training CV Error
45
Testing Error
46
Number of Variables
47
Progress Cross Validation is a bilevel problem solvable by continuous optimization methods Off-the-shelf NLP algorithm – FILTER solved classification and regression Bilevel Optimization extendable to many Machine Learning problems
48
Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Variable Selection/Scaling Multi-task Learning Semi-supervised Learning Generative methods
49
Semi-supervised Learning Have labeled data, and unlabeled data Treat missing labels,, as design variables in the outer level Lower level problems are still convex
50
Semi-supervised Regression Outer level minimizes error on labeled data to find optimal parameters and labels -insensitive loss on labeled data in inner level -insensitive loss on unlabeled data in inner level Inner level regularization
51
Discussion New capacity offers new possibilities: Outer level objectives? Inner level problem? classification, ranking, semi-supervised, missing values, kernel selection, variable selection, … Need special purpose algorithms for greater efficiency, scalability, robustness This work was supported by Office of Naval Research Grant N00014-06-1-0014.
52
Experiments: Bilevel CV Procedure Run BilevelMM/BilevelHL to compute optimal parameters, Drop descriptors with small Create model on all training data using Compute test error on hold-out set
53
Experiments: Grid Search CV Procedure Unconstrained Grid: Try 6 values for on a log10 scale Constrained Grid: Try 6 values for and {0, 1} for each component of (perform RFE if necessary) Create model on all training data using optimal grid point Compute test error on hold-out set
54
Extending Bilevel Approach to other Machine Learning Problems Kernel Classification/Regression Different Regularizations (L 1, elastic nets) Enhanced Feature Selection Multi-task Learning Semi-supervised Learning Generative methods
55
Enhanced Feature Selection Assume at most descriptors allowed Introduce outer-level constraint (with counting the non-zero elements of ) Rewrite constraint, observing that Get additional conditions for
56
Kernel Bilevel Discussion Pros Performs model selection in feature space Performs feature selection in input space Cons Highly nonlinear model Difficult to solve
57
Kernel Classification (MPEC form)
58
Is it okay to do 3-folds
59
Applying the “kernel trick” Drop the box constraint, Eliminate from the optimality conditions Replace with an appropriate
60
Feature Selection with Kernels Parameterize the kernel with such that if the n-th descriptor vanishes from Linear kernel Polynomial kernel Gaussian kernel
61
Kernel Regression (Bilevel form)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.