Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas Krause Caltech Sham Kakade Matthias Seeger.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayesian Belief Propagation

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

EKF, UKF TexPoint fonts used in EMF.

CS479/679 Pattern Recognition Dr. George Bebis

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.

Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.

Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.

Niranjan Srinivas Andreas Krause Caltech Caltech

From Cognitive Science and Machine Learning Summer School 2010

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Online Distributed Sensor Selection Daniel Golovin, Matthew Faulkner, Andreas Krause theory and practice collide 1.

Online Learning in Complex Environments

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Near-Optimal Sensor Placements in Gaussian Processes Carlos Guestrin Andreas KrauseAjit Singh Carnegie Mellon University.

Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.

Efficient Informative Sensing using Multiple Robots

Algorithmic and Economic Aspects of Networks Nicole Immorlica.

Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.

The Value of Knowing a Demand Curve: Regret Bounds for Online Posted-Price Auctions Bobby Kleinberg and Tom Leighton.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.

Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.

Non-myopic Informative Path Planning in Spatio-Temporal Models Alexandra Meliou Andreas Krause Carlos Guestrin Joe Hellerstein.

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

Machine Learning CMPT 726 Simon Fraser University

1 Efficient planning of informative paths for multiple robots Amarjeet Singh *, Andreas Krause +, Carlos Guestrin +, William J. Kaiser *, Maxim Batalin.

Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.

CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.

Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case

Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Gaussian process modelling

Myopic Policies for Budgeted Optimization with Constrained Experiments Javad Azimi, Xiaoli Fern, Alan Fern Oregon State University AAAI, July

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Sparse Gaussian Process Classification With Multiple Classes Matthias W. Seeger Michael I. Jordan University of California, Berkeley

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Gaussian Processes For Regression, Classification, and Prediction.

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

Javad Azimi, Ali Jalali, Xiaoli Fern Oregon State University University of Texas at Austin In NIPS 2011, Workshop in Bayesian optimization, experimental.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.

3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Distributed Submodular Maximization in Massive Datasets

CSCI 5822 Probabilistic Models of Human and Machine Learning

Tuning bandit algorithms in stochastic environments

Chapter 2: Evaluative Feedback

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Near-Optimal Sensor Placements in Gaussian Processes

Chapter 2: Evaluative Feedback

Probabilistic Surrogate Models

Presentation transcript:

Gaussian Process Optimization in the Bandit Setting: No Regret & Experimental Design Niranjan Srinivas Andreas Krause Caltech Sham Kakade Matthias Seeger Wharton Saarland theory and practice collide

2 Multi-armed bandits At each time t pick arm i; get independent payoff f t with mean u i Classic model for exploration – exploitation tradeoff Extensively studied (Robbins ’52, Gittins ’79) Typically assume each arm is tried multiple times Goal: minimize regret … u1u1 u2u2 u3u3 uKuK

3 Infinite-armed bandits … p1p1 p2p2 p3p3 pkpk …p∞p∞ p1p1 p2p2 … In many applications, number of arms is huge (sponsored search, sensor selection) Cannot try each arm even once Assumptions on payoff function f essential

Optimizing Noisy, Unknown Functions Given: Set of possible inputs D; black-box access to unknown function f Want: Adaptive choice of inputs from D maximizing Many applications: robotic control [Lizotte et al. ’07], sponsored search [Pande & Olston, ’07], clinical trials, … Sampling is expensive Algorithms evaluated using regret Goal: minimize

5 Running example: Noisy Search How to find the hottest point in a building? Many noisy sensors available but sampling is expensive D: set of sensors; : temperature at chosen at step i Observe Goal: Find with minimal number of queries

6 Relating to us: Active learning for PMF A bandit setting for movie recommendation Task: recommend movies for a new user M-armed Bandit Movie item as arm of bandit For a new user i At each round t, pick a movie j Observe a rating X ij Goal: maximize cumulative reward sum of the ratings of all recommended movies Model: PMF X=UV+E, where U: N*K matrix, V: K*M matrix, E: N*M matrix, zero-mean normal distributed Assume movie feature V is fully observed. User feature U i is unknown at first Xi(j) = Ui Vj + ε (regard the ith row vector of X as a function Xi) Xi(.): random linear function

Key insight: Exploit correlation Sampling f(x) at one point x yields information about f(x’) for points x’ near x In this paper: Model correlation using a Gaussian process (GP) prior for f 7 Temperature is spatially correlated

Gaussian Processes to model payoff f Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 8 Normal dist. (1-D Gaussian) Multivariate normal (n-D Gaussian) Gaussian process (∞-D Gaussian)

9 Thinking about GPs Kernel function K(x, x’) specifies covariance Encodes smoothness assumptions x f(x) P(f(x)) f(x)

10 Example of GPs Squared exponential kernel K(x,x’) = exp(-(x-x’) 2 /h 2 ) Bandwidth h=.1 Distance |x-x’| Bandwidth h=.3 Samples from P(f)

11 Gaussian process optimization [e.g., Jones et al ’98] x f(x) Goal: Adaptively pick inputs such that Key question: how should we pick samples? So far, only heuristics: Expected Improvement [Močkus et al. ‘78] Most Probable Improvement [Močkus ‘89] Used successfully in machine learning [Ginsbourger et al. ‘08, Jones ‘01, Lizotte et al. ’07] No theoretical guarantees on their regret!

12 Simple algorithm for GP optimization In each round t do: Pick Observe Use Bayes’ rule to get posterior mean Can get stuck in local maxima! 12 x f(x)

13 Uncertainty sampling Pick: That’s equivalent to (greedily) maximizing information gain Popular objective in Bayesian experimental design (where the goal is pure exploration of f) But…wastes samples by exploring f everywhere! 13 x f(x)

14 Avoiding unnecessary samples Key insight: Never need to sample where Upper Confidence Bound (UCB) < best lower bound! x f(x) Best lower bound

15 Upper Confidence Bound (UCB) Algorithm Naturally trades off explore and exploit; no samples wasted Regret bounds: classic [Auer ’02] & linear f [Dani et al. ‘07] But none in the GP optimization setting! (popular heuristic) x f(x) Pick input that maximizes Upper Confidence Bound (UCB): How should we choose ¯ t ? Need theory!

16 How well does UCB work? Intuitively, performance should depend on how “learnable” the function is 16 “Easy”“Hard” The quicker confidence bands collapse, the easier to learn Key idea: Rate of collapse  growth of information gain Bandwidth h=.3 Bandwidth h=.1

Learnability and information gain We show that regret bounds depend on how quickly we can gain information Mathematically: Establishes a novel connection between GP optimization and Bayesian experimental design 17 T

18 Performance of optimistic sampling 18 Theorem If we choose ¯ t = £ (log t), then with high probability, Hereby The slower γ T grows, the easier f is to learn Key question: How quickly does γ T grow? Maximal information gain due to sampling!

Learnability and information gain Information gain exhibits diminishing returns (submodularity) [Krause & Guestrin ’05] Our bounds depend on “rate” of diminishment 19 Little diminishing returns Returns diminish fast

Dealing with high dimensions Theorem: For various popular kernels, we have: Linear: ; Squared-exponential: ; Matérn with, ; Smoothness of f helps battle curse of dimensionality! Our bounds rely on submodularity of 20

What if f is not from a GP? In practice, f may not be Gaussian Theorem: Let f lie in the RKHS of kernel K with, and let the noise be bounded almost surely by. Choose.Then with high probab., Frees us from knowing the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 21

Experiments: UCB vs. heuristics Temperature data 46 sensors deployed at Intel Research, Berkeley Collected data for 5 days (1 sample/minute) Want to adaptively find highest temperature as quickly as possible Traffic data Speed data from 357 sensors deployed along highway I-880 South Collected during 6am-11am, for one month Want to find most congested (lowest speed) area as quickly as possible 22

Comparison: UCB vs. heuristics 23 GP-UCB compares favorably with existing heuristics

24 Assumptions on f Linear? [Dani et al, ’07] Lipschitz-continuous (bounded slope) [Kleinberg ‘08] Fast convergence; But strong assumption Very flexible, but

Conclusions First theoretical guarantees and convergence rates for GP optimization Both true prior and agnostic case covered Performance depends on “learnability”, captured by maximal information gain Connects GP Bandit Optimization & Experimental Design! Performance on real data comparable to other heuristics 25