Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter.

Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter

Motivation Parameter tuning is important Recent approaches (ParamILS, racing, CALIBRA) “only” return the best parameter configuration  Extra information would be nice, e.g. - The most important parameter is X - The effect of parameters X and Y is largely independent - For parameter X options 1 and 2 are bad, 3 is best, 4 is decent  ANOVA is one tool for that, but has limitations (e.g. discretization of parameters, linear model)

More motivation Support the actual design process by providing feedback about parameters  E.g. parameter X should always be i (code gets simpler!!) Predictive models of runtime are widely applicable  Prediction can be updated based on new information (such as “the algorithm has been unsuccessfully running for X seconds”)  (True) portfolios of algorithms Once we can learn a function f:  ! runtime, learning a function g:  X  ! runtime should be a simply extension (X=inst. charac., Lin learns h: X  ! runtime)

The problem setting For now: static algorithm configuration, i. e. find the best fixed parameter setting across instances  But as mentioned above this approach extends to PIAC (per instance algorithm configuration) Randomized algorithms: variance for a single instance (runtime distributions) High inter-instance variance in hardness We focus on minimizing runtime  But the approach also applies to other objectives  (Special treatment of censoring and cost for gathering a data point is then simply not necessary) We focus on optimizing averages across instances  Generalization to other objectives may not be straight-forward

Learning a predictive model Supervised learning problem, regression  Given training data (x 1, o 1 ), …, (x n, o n ), learn function f such that f(x i ) ¼ o i What is a data point x i ?  1) Predictive model of average cost - Average of how many instances/runs ? - Not too many data points, but each one very costly - Doesn’t have to be average cost, could be anything  2) Predictive model of single costs, get average cost by aggregation - Have to deal with ten thousands of data points - If predictions are Gaussian, the aggregates are Gaussian (means and variances add)

Desired properties of model 1) Discrete and continuous inputs  Parameters are discrete/continuous  Instances features (so far) all continuous 2) Censoring  When a run times out we only have a lower bound on its true runtime 3) Scalability: tens of thousands of points 4) Explicit predictive uncertainties 5) Accuracy of predictions Considered models:  Linear regression (basis functions? especially for discrete inputs)  regression trees (no uncertainty estimates)  Gaussian processes (4&5 ok, 1 done, 2 almost done, hopefully 3)

Coming up 1) Implemented: model average runtimes, optimize based on that model  Censoring “almost” integrated 2) Further TODOs:  Active learning criterion under noise  Scaling: Bayesian committee machine

Active learning for function optimization  EGO [Jones, Schonlau & Welch, 1998]  Assumes deterministic functions - Here: averages over 100 instances  Start with a Latin hypercube design - Run the algorithm, get (x i,o i ) pairs  While not terminate - Fit the model (kernel parameter optimization, all continuous) - Find best point to sample (optimization in the space of parameter configurations) - Run the algorithm at that point, add new (x,y) pair

Active learning criterion EGO uses maximum expected improvement  EI(x) = s p(y|  x,  2 x ) max(0, f_min-y) dy - Easy to evaluate (can be solved in closed form) Problem in EGO: sometimes not the actual runtime y is modeled, but a transformation, e.g. log(y)  Expected improvement then needs to be adapted:  EI(x) = s p(y|  x,  2 x ) max(0, f_min-exp(y)) dy - Easy to evaluate (can still be solved in closed form) Take into account cost of sample:  EI(x) = s p(y|  x,  2 x ) 1/exp(y) max(0, f_min-exp(y)) dy - Easy to evaluate (can still be solved in closed form) - Not implemented yet (the others are implemented)

How to optimize exp. improvement? Currently only 3 algorithms to be tuned:  SAPS (4 continuous params)  SPEAR(26 parameters, about half of them discrete) - For now continuous ones are discretized  CPLEX(60 params, 50 of them discrete) - For now continuous ones are discretized Purely continuous/purely discrete optimization  DIRECT / multiple restart local search

GPs: which kernel to use? Kernel: distance measure between two data points  Low distance ! high correlation Squared exponential, Matern, etc:  SE: k(x, x’) =  s exp(-  l i  (x i -x i ’) 2 ) For discrete parameters: new Hamming distance kernel   s epx(-  l i  (x i  x i ’) )  Positive definite by reduction to String kernels “Automatic relevance determination”  One length scale parameter l i per dimension  Many kernel parameters lead to - Problems with overfitting - Very long runtimes for kernel parameter optimization - For CPLEX: 60 extra parameters, about 15h for a single kernel parameter optimization using DIRECT, without any improvement Thus: no length scale parameters. Only two parameters: noise  n, and overall variability of the signal,  s

Continuing from last week … where were we?  Start with a Latin hypercube design - Run the algorithm, get (x i,o i ) pairs  While not terminate - Fit the model (kernel parameter optimization, all continuous)  Haven’t covered yet, coming up  Censoring will come in here - Find best point to sample (optimization in the space of parameter configurations)  Covered last week - Run the algorithm at that point, add new (x,y) pair

How to optimize kernel parameters? Objective  Standard: maximizing marginal likelihood p(o) = s p(o|f) p(f) df - Doesn’t work under censoring  Alternative: maximizing likelihood of unseen data using cross-validation p(o test |  test,  test ) - Efficient when not too many folds k are used:  Marginal likelihood requires inversion of N by N matrix  Cross validation with k=2 requires inversions of two N/2 by N/2 matrices. In practice faster for large N Algorithm  Using DIRECT (DIviding RECTangles), global sampling-based method (does not scale to high dim)

Censoring complicates predictions P(f 1:N |o 1:N ) / p(f 1:N ) £ p(o 1:N |f 1:N ), both Gaussian  For censored data point o i, p(o i |f i ) =  ((o i -  i )/  i ), not Gaussian at all  But product p(f 1:N |o 1:N ) = p(f 1:N ) £ p(o 1:N |f 1:N ) closer to Gaussian  Laplace approximation: find mode of p(f 1:N |o 1:N ), use Hessian at that point as second order approximation of the precision matrix  Finding mode: gradient & Hessian-based numerical optimization in N dimensions, where N=number of data points - Without censoring closed form, but still N 3 How to score a kernel parameter configuration?  Cross-validated likelihood of unseen test data under predictive distribution  I.e. for each fold, learn a model under censoring, and predict the unseen validation data

Don’t use censored data, 4s

Treat as “completed at threshold”, 4s

Laplace approximation to posterior, 10s

Schmee & Hahn, 21 iterations, 36s

Anecdotal: Lin’s original implementation of Schmee & Hahn, on my machine – beware of normpdf

A counterintuitive example from practice (same hyperparameters in same rows)

TODO: Active learning under noise [Williams, Santner, and Notz, 2000] Very heavy on notation  But there is good stuff in there 1) Actively choose a parameter setting  Best setting so far is not known ! f min is now a random variable  Take joint samples f 1:N (i) of performance from predictive distribution for all settings tried so far (sample from our Gaussian approximation to p(f 1:N |o 1:N ) - take min of those samples, compute expected improvement as if that min was the deterministic f min - Average the exp. improvements computed for 100 independent samples - Efficiency: the most costly part in evaluating expected improvement at a parameter configuration is the probabilistic prediction with the GP; even with many samples we only need to predict once  2) Actively choose an instance to run for that parameter setting: minimizing posterior variance

TODO: Integrating expected cost into AL criterion EI criterion discussed last time that takes into account the cost of a sample:  EI(x) = s p(y|  x,  2 x ) 1/exp(y) max(0, f_min-exp(y)) dy - Easy to evaluate (can still be solved in closed form)  The above approach for noisy functions re-uses EI for deterministic functions, so it could use this Open question: should the cost be taken into account when selecting an instance for that parameter setting? Another open question: how to select the censoring threshold?  Something simple might suffice, such as picking cutoff equal to predicted runtime or to the best runtime so far - Integration bounds in expected improvement would change, but nothing else

TODO: scaling Bayesian committee machine  More or less a mixture of GPs, each of them on a small subset of data (cluster data ahead of time)  Fairly straight-forward wrapper around GP code (actually around any code that provides Gaussian predictions)  Maximizing cross-validated performance is easy  In principle could update by just updating one component at a time - But in practice once we re-optimize kernel parameters we’re changing every component anyways - Likewise we can do rank-1 updates for the basic GPs, but a single matrix inversion is really not the expensive part (rather the 1000s of matrix inversions for kernel parameter optimization)

Preliminary results and demo Experiments with noise-free kernel  Great cross-validation results for SPEAR & CPLEX  Poor cross-validation results for SAPS Explanation  Even when averaging 100 instances, the response is NOT noise-free  SAPS is continuous: - can pick configurations arbitrarily close to each other - if results differ substantially SE kernel must have huge variance ! very poor results  Matern kernel works better for SAPS

Future work (figures from EGO paper) We can get main effects and interaction effects, much like in ANOVA  The integrals seem to be solvable in closed form We can get plots of predicted mean and variance as one parameter is varied, marginalized over all others  Similarly as two or three are varied  This allows for plots of interactions

Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter.

Similar presentations

Presentation on theme: "Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter.

Similar presentations

Presentation on theme: "Parameter tuning based on response surface models An update on work in progress EARG, Feb 27 th, 2008 Presenter: Frank Hutter."— Presentation transcript:

Similar presentations

About project

Feedback