Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas Krause Caltech Sham Kakade Matthias Seeger Wharton Saarland theory and practice collide
2 Multi-armed bandits At each time t pick arm i; get independent payoff f t with mean u i Classic model for exploration – exploitation tradeoff Extensively studied (Robbins ’52, Gittins ’79) Typically assume each arm is tried multiple times Goal: minimize regret … u1u1 u2u2 u3u3 uKuK
3 Infinite-armed bandits … p1p1 p2p2 p3p3 pkpk …p∞p∞ p1p1 p2p2 … In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential
Optimizing Noisy, Unknown Functions Given: Set of possible inputs D; black-box access to unknown function f Want: Adaptive choice of inputs from D maximizing Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … Sampling is expensive Algorithms evaluated using regret Goal: minimize
5 Running example: Noisy Search How to find the hottest point in a building? Many noisy sensors available but sampling is expensive D: set of sensors; : temperature at chosen at step i Observe Goal: Find with minimal number of queries
6 Relating to us: Active learning for PMF A bandit setting for movie recommendation Task: recommend movies for a new user M-armed Bandit Movie item as arm of bandit For a new user i At each round t, pick a movie j Observe a rating X ij Goal: maximize cumulative reward sum of the ratings of all recommended movies Model: PMF X=UV+E, where U: N*K matrix, V: K*M matrix, E: N*M matrix, zero-mean normal distributed Assume movie feature V is fully observed. User feature U i is unknown at first Xi(j) = Ui Vj + ε (regard the ith row vector of X as a function Xi) Xi(.): random linear function
Key insight: Exploit correlation Sampling f(x) at one point x yields information about f(x’) for points x’ near x In this paper: Model correlation using a Gaussian process (GP) prior for f 7 Temperature is spatially correlated
Gaussian Processes to model payoff f Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 8 Normal dist. (1-D Gaussian) Multivariate normal (n-D Gaussian) Gaussian process (∞-D Gaussian)
9 Thinking about GPs Kernel function K(x, x’) specifies covariance Encodes smoothness assumptions x f(x) P(f(x)) f(x)
10 Example of GPs Squared exponential kernel K(x,x’) = exp(-(x-x’) 2 /h 2 ) Bandwidth h=.1 Distance |x-x’| Bandwidth h=.3 Samples from P(f)
11 Gaussian process optimization [e.g., Jones et al ’98] x f(x) Goal: Adaptively pick inputs such that Key question: how should we pick samples? So far, only heuristics: Expected Improvement [Močkus et al. ‘78] Most Probable Improvement [Močkus ‘89] Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!
12 Simple algorithm for GP optimization In each round t do: Pick Observe Use Bayes’ rule to get posterior mean Can get stuck in local maxima! 12 x f(x)
13 Uncertainty sampling Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! 13 x f(x)
14 Avoiding unnecessary samples Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound! x f(x) Best lower bound
15 Upper Confidence Bound (UCB) Algorithm Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic) x f(x) Pick input that maximizes Upper Confidence Bound (UCB): How should we choose ¯ t ? Need theory!
16 How well does UCB work? Intuitively, performance should depend on how “learnable” the function is 16 “Easy”“Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse growth of information gain Bandwidth h=.3 Bandwidth h=.1
Learnability and information gain We show that regret bounds depend on how quickly we can gain information Mathematically: Establishes a novel connection between GP optimization and Bayesian experimental design 17 T
18 Performance of optimistic sampling 18 Theorem If we choose ¯ t = £ (log t), then with high probability, Hereby The slower γ T grows, the easier f is to learn Key question: How quickly does γ T grow? Maximal information gain due to sampling!
Learnability and information gain Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] Our bounds depend on “rate” of diminishment 19 Little diminishing returns Returns diminish fast
Dealing with high dimensions Theorem: For various popular kernels, we have: Linear: ; Squared-exponential: ; Matérn with, ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 20
What if f is not from a GP? In practice, f may not be Gaussian Theorem: Let f lie in the RKHS of kernel K with, and let the noise be bounded almost surely by. Choose.Then with high probab., Frees us from knowing the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 21
Experiments: UCB vs. heuristics Temperature data 46 sensors deployed at Intel Research, Berkeley Collected data for 5 days (1 sample/minute) Want to adaptively find highest temperature as quickly as possible Traffic data Speed data from 357 sensors deployed along highway I-880 South Collected during 6am-11am, for one month Want to find most congested (lowest speed) area as quickly as possible 22
Comparison: UCB vs. heuristics 23 GP-UCB compares favorably with existing heuristics
24 Assumptions on f Linear? [Dani et al, ’07] Lipschitz-continuous (bounded slope) [Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but
Conclusions First theoretical guarantees and convergence rates for GP optimization Both true prior and agnostic case covered Performance depends on “learnability”, captured by maximal information gain Connects GP Bandit Optimization & Experimental Design! Performance on real data comparable to other heuristics 25