Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Empirical Algorithmics Reading Group Oct 11, 2007 Tuning Search Algorithms for Real-World Applications: A Regression Tree Based Approach by Thomas Bartz-Beielstein.
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Machine Learning and Data Mining Linear regression
Kriging.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Computer vision: models, learning and inference Chapter 8 Regression.
On the Potential of Automated Algorithm Configuration Frank Hutter, University of British Columbia, Vancouver, Canada. Motivation for automated tuning.
Supervised Learning Recap
Indian Statistical Institute Kolkata
Model Assessment and Selection
Model Assessment, Selection and Averaging
Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machines and Kernel Methods
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Frank Hutter, Holger Hoos, Kevin Leyton-Brown
Evaluating Hypotheses
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Maximum Likelihood (ML), Expectation Maximization (EM)
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Study of Sparse Online Gaussian Process for Regression EE645 Final Project May 2005 Eric Saint Georges.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Gaussian process modelling
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Efficient Model Selection for Support Vector Machines
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian Edward Wild University of Wisconsin Madison.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Parallel Algorithm Configuration Frank Hutter, Holger Hoos, Kevin Leyton-Brown University of British Columbia, Vancouver, Canada.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Gaussian Processes Li An Li An
Experimental Algorithmics Reading Group, UBC, CS Presented paper: Fine-tuning of Algorithms Using Fractional Experimental Designs and Local Search by Belarmino.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter.
Gaussian Processes For Regression, Classification, and Prediction.
Options and generalisations. Outline Dimensionality Many inputs and/or many outputs GP structure Mean and variance functions Prior information Multi-output,
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
8 Sept 2006, DEMA2006Slide 1 An Introduction to Computer Experiments and their Design Problems Tony O’Hagan University of Sheffield.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Machine Learning Basics
Data Mining (and machine learning)
Probabilistic Models with Latent Variables
10701 / Machine Learning Today: - Cross validation,
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Support Vector Machines 2
Presentation transcript:

Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter

Motivation Parameter tuning is important Recent approaches (ParamILS, racing, CALIBRA) “only” return the best parameter configuration  Extra information would be nice, e.g. - The most important parameter is X - The effect of parameters X and Y is largely independent - For parameter X options 1 and 2 are bad, 3 is best, 4 is decent  ANOVA is one tool for that, but has limitations (e.g. discretization of parameters, linear model)

More motivation Support the actual design process by providing feedback about parameters  E.g. parameter X should always be i (code gets simpler!!) Predictive models of runtime are widely applicable  Prediction can be updated based on new information (such as “the algorithm has been unsuccessfully running for X seconds”)  (True) portfolios of algorithms Once we can learn a function f:  ! runtime, learning a function g:  X  ! runtime should be a simply extension (X=inst. charac., Lin learns h: X  ! runtime)

The problem setting For now: static algorithm configuration, i. e. find the best fixed parameter setting across instances  But as mentioned above this approach extends to PIAC (per instance algorithm configuration) Randomized algorithms: variance for a single instance (runtime distributions) High inter-instance variance in hardness We focus on minimizing runtime  But the approach also applies to other objectives  (Special treatment of censoring and cost for gathering a data point is then simply not necessary) We focus on optimizing averages across instances  Generalization to other objectives may not be straight-forward

Learning a predictive model Supervised learning problem, regression  Given training data (x 1, o 1 ), …, (x n, o n ), learn function f such that f(x i ) ¼ o i What is a data point x i ?  1) Predictive model of average cost - Average of how many instances/runs ? - Not too many data points, but each one very costly - Doesn’t have to be average cost, could be anything  2) Predictive model of single costs, get average cost by aggregation - Have to deal with ten thousands of data points - If predictions are Gaussian, the aggregates are Gaussian (means and variances add)

Desired properties of model 1) Discrete and continuous inputs  Parameters are discrete/continuous  Instances features (so far) all continuous 2) Censoring  When a run times out we only have a lower bound on its true runtime 3) Scalability: tens of thousands of points 4) Explicit predictive uncertainties 5) Accuracy of predictions Considered models:  Linear regression (basis functions? especially for discrete inputs)  regression trees (no uncertainty estimates)  Gaussian processes (4&5 ok, 1 done, 2 almost done, hopefully 3)

Coming up 1) Implemented: model average runtimes, optimize based on that model  Censoring “almost” integrated 2) Further TODOs:  Active learning criterion under noise  Scaling: Bayesian committee machine

Active learning for function optimization  EGO [Jones, Schonlau & Welch, 1998]  Assumes deterministic functions - Here: averages over 100 instances  Start with a Latin hypercube design - Run the algorithm, get (x i,o i ) pairs  While not terminate - Fit the model (kernel parameter optimization, all continuous) - Find best point to sample (optimization in the space of parameter configurations) - Run the algorithm at that point, add new (x,y) pair

Active learning criterion EGO uses maximum expected improvement  EI(x) = s p(y|  x,  2 x ) max(0, f_min-y) dy - Easy to evaluate (can be solved in closed form) Problem in EGO: sometimes not the actual runtime y is modeled, but a transformation, e.g. log(y)  Expected improvement then needs to be adapted:  EI(x) = s p(y|  x,  2 x ) max(0, f_min-exp(y)) dy - Easy to evaluate (can still be solved in closed form) Take into account cost of sample:  EI(x) = s p(y|  x,  2 x ) 1/exp(y) max(0, f_min-exp(y)) dy - Easy to evaluate (can still be solved in closed form) - Not implemented yet (the others are implemented)

Which kernel to use? Kernel: distance measure between two data points  Low distance ! high correlation Squared exponential, Matern, etc:  SE: k(x, x’) =  s exp(-  l i  (x i -x i ’) 2 ) For discrete parameters: new Hamming distance kernel   s epx(-  l i  (x i  x i ’) )  Positive definite by reduction to String kernels “Automatic relevance determination”  One length scale parameter l i per dimension  Many kernel parameters lead to - Problems with overfitting - Very long runtimes for kernel parameter optimization - For CPLEX: 60 extra parameters, about 15h for a single kernel parameter optimization using DIRECT, without any improvement Thus: no length scale parameters. Only two parameters: noise  n, and overall variability of the signal,  s

How to optimize kernel parameters? Objective  Standard: maximizing marginal likelihood - Doesn’t work under censoring  Alternative: maximizing likelihood of unseen data using cross-validation - Efficient when not too many folds k are used:  Marginal likelihood requires inversion of N by N matrix  Cross validation with k=2 requires inversions of two N/2 by N/2 matrices. In practice still quite a bit slower (some algebra tricks may help) Algorithm  Using DIRECT (DIviding RECTangles), global sampling-based method (does not scale to high dim)

How to optimize exp. improvement? Currently only 3 algorithms to be tuned:  SAPS (4 continuous params)  SPEAR(26 parameters, about half of them discrete) - For now continuous ones are discretized  CPLEX(60 params, 50 of them discrete) - For now continuous ones are discretized Purely continuous/purely discrete optimization  DIRECT / multiple restart local search

TODO: integrate censoring Learning with censored data almost done  (needs solid testing since it’ll be central later) Active selection of censoring threshold ?  Something simple might suffice, such as picking cutoff equal to predicted runtime or to the best runtime so far - Integration bounds in expected improvement would change, but nothing else Runtime  With censoring about 3 times slower than without (Newton’s method takes about 3 steps)  „Good“ scaling - 42 points: 19 seconds; 402 points: 143 seconds - Maybe Newton does not need as many steps with more points

Treat as “completed at threshold”, 4s

Don’t use censored data, 4s

Laplace approximation to posterior, 10s

Schmee & Hahn, 21 iterations, 36s

Anecdotal: Lin’s original implementation of Schmee & Hahn, on my machine – beware of normpdf

A counterintuitive example from practice (same hyperparameters in same rows)

Preliminary results and demo Experiments with noise-free kernel  Great cross-validation results for SPEAR & CPLEX  Poor cross-validation results for SAPS Explanation  Even when averaging 100 instances, the response is NOT noise-free  SAPS is continuous: - can pick configurations arbitrarily close to each other - if results differ substantially SE kernel must have huge variance ! very poor results  Matern kernel works better for SAPS

TODOs Finish censoring Consider noise (even possible when averaging over instances), change active learning criterion Scaling Efficiency of implementation: reusing work for multiple predictions

TODO: Active learning under noise [Williams, Santner, and Notz, 2000] Very heavy on notation  But there is good stuff in there 1) Actively choose a parameter setting  Best setting so far is not known ! f min is now a random variable - take joint samples of performance from predictive distributions for all settings tried so far, take min of those samples, compute expected improvement as if that min was the deterministic f min - Average the exp. improvements computed for 100 independent samples 2) Actively choose an instance to run for that parameter setting: minimizing posterior variance

TODO: scaling Bayesian committee machine  More or less a mixture of GPs, each of them on a small subset of data (cluster data ahead of time)  Fairly straight-forward wrapper around GP code (or really any code that provides Gaussian predictions)  Maximizing cross-validated performance is easy  In principle could update by just updating one component at a time - But in practice once we re-optimize hyperparameters we’re changing every component anyways - Likewise we can do rank-1 updates for the basic GPs, but a single matrix inversion is really not the expensive part (rather the 1000s of matrix inversions for kernel parameter optimization)

Future work We can get main effects and interaction effects, much like in ANOVA  The integrals seem to be solvable in closed form We can get plots of predicted mean and variance as one parameter is varied, marginalized over all others  Similarly as two or three are varied  This allows for plots of interactions