Taming the monster: A fast and simple algorithm for contextual bandits

Slides:

Advertisements

Similar presentations

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Advertisements

IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.

Satyen Kale (Yahoo! Research) Joint work with Sanjeev Arora (Princeton)

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.

Search-Based Structured Prediction

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.

Experimental Design, Response Surface Analysis, and Optimization

Planning under Uncertainty

Kuang-Hao Liu et al Presented by Xin Che 11/18/09.

1 Part I Artificial Neural Networks Sofia Nikitaki.

Sparse vs. Ensemble Approaches to Supervised Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.

Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Radial Basis Function Networks

Online Learning Algorithms

Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

online convex optimization (with partial information)

 1  Outline  stages and topics in simulation  generation of random variates.

The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

Thesis Proposal PrActive Learning: Practical Active Learning, Generalizing Active Learning for Real-World Deployments.

Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

Benk Erika Kelemen Zsolt

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Efficient k-Coverage Algorithms for Wireless Sensor Networks Mohamed Hefeeda.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.

Online Learning Rong Jin. Batch Learning Given a collection of training examples D Learning a classification model from D What if training examples are.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Linear Programming Maximize Subject to Worst case polynomial time algorithms for linear programming 1.The ellipsoid algorithm (Khachian, 1979) 2.Interior.

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Chapter 6 Neural Network.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Approximation Algorithms based on linear programming.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

CWR 6536 Stochastic Subsurface Hydrology Optimal Estimation of Hydrologic Parameters.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Basics of Multi-armed Bandit Problems

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Real World Interactive Learning

Deep Feedforward Networks

Chapter 5: Monte Carlo Methods

Classification with Perceptrons Reading:

Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo

RL methods in practice Alekh Agarwal.

CSCI B609: “Foundations of Data Science”

Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits

The Nonstochastic Multiarmed Bandit Problem

Chapter 2: Evaluative Feedback

On Approximating Covering Integer Programs

Neural networks (1) Traditional multi-layer perceptrons

Christine Fry and Alex Park March 30th, 2004

Chapter 2: Evaluative Feedback

Presentation transcript:

Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob Schapire

Learning to interact: example #1 Loop: 1. Patient arrives with symptoms, medical history, genome, … 2. Physician prescribes treatment. 3. Patient’s health responds (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes.

Learning to interact: example #2 Loop: 1. User visits website with profile, browsing history, … 2. Website operator choose content/ads to display. 3. User reacts to content/ads (e.g., click, “like”). Goal: choose content/ads that yield desired user behavior.

Contextual bandit setting (i.i.d. version) Set X of contexts/features and K possible actions For t = 1,2,…,T: 0. Nature draws (xt, rt) from distribution D over X × [0,1]K. 1. Observe context xt. [e.g., user profile, browsing history] 2. Choose action at ϵ [K]. [e.g., content/ad to display] 3. Collect reward rt(at). [e.g., indicator of click or positive feedback] Goal: algorithm for choosing actions at that yield high reward. Contextual setting: use features xt to choose good actions at. Bandit setting: rt(a) for a ≠ at is not observed. Exploration vs. exploitation A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good).

Learning objective and difficulties No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X  [K] (e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π: … a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies; explicit bookkeeping is computationally infeasible when Π is large. Classical multi-arm bandit: compete against best single arm --- no good in applications. INSTEAD: want to compete against best policy in some rich policy class. [Define policy: function mapping contexts to actions] [Define regret: difference between total reward collected by best policy (i.e., solution to OFF-LINE FULL-INFO problem) and the total rewards collected by the learner] WHY IS THIS HARD? 1. Feedback for chosen action only informs us about a subset of policies; 2. Explicit bookkeeping is infeasible when policy space is very large.

Arg max oracle (AMO) Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex relax., backprop) for cost-sensitive multiclass learning algorithms. AMO is abstraction for efficient search of a policy class. GIVEN: fully-labeled data set, return policy in policy class that has maximum total reward. i.e., solve OFF-LINE FULL-INFO problem. Generally computationally hard, but in practice we have effective heuristics for very rich policy classes. Still, not clear how to use this because we only have PARTIAL FEEDBACK.

Our results New fast and simple algorithm for contextual bandits Optimal regret bound (up to log factors): Amortized calls to argmax oracle (AMO) per round. Comparison to previous work [Thompson’33]: no general analysis. [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies. [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round. [DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round. Note: Exp4 also works in adversarial setting. New algorithm: - Statistical performance: Achieves statistically optimal regret bound - Computation benchmark: “oracle complexity” --- how many times it has to call AMO. - Computational performance: sublinear in number of rounds (i.e., vanishing per round complexity). Previous algorithms: either statistically suboptimal or computationally more complex --- challenging to achieve both simultaneously. NOTE: we crucially rely on IID assumption, whereas EXP4 works in adversarial setting.

Rest of this talk Action distributions, reward estimates via inverse probability weights [oldies but goodies] Algorithm for finding policy distributions that balance exploration/exploitation Warm-start / epoch trick New New Want to compete against policy class, so we’ll hedge using a policy distribution. Show some basic techniques that show why learning a policy distribution is tricky. Then describe new algorithm for finding policy distribution that balances explore/exploit. Conclude with a brief word about warm start / epoch trick.

Basic algorithm structure (same as Exp4) Start with initial distribution Q1 over policies Π. For t=1,2,…,T: 0. Nature draws (xt,rt) from distribution D over X × [0,1]K. 1. Observe context xt. 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt). 2b. Draw action at from pt. 3. Collect reward rt(at). 4. Compute new distribution Qt+1 over policies Π. Maintain policy distribution Q --- need to do this efficiently, so we’ll make sure it’s sparse. After seeing context x_t, need to pick action a_t “smoothed” projection of Q_t to get p_t randomly pick a_t according to p_t. After collecting reward, update policy distribution from Q_t to Q_{t+1}.

Inverse probability weighting (old trick) Importance-weighted estimate of reward from round t: Unbiased, and has range & variance bounded by 1/pt(a). Can estimate total reward and regret of any policy: Old trick for unbiased estimates of rewards for all actions --- including actions you didn’t take. - Estimate zero for actions you didn’t take - For action you take, scale up its reward by inverse of probability of taking that probability. - Upshot: can estimate total reward of any policy pi through time t --- use this with AMO! So where does p_t come from?

Constructing policy distributions Optimization problem (OP): Find policy distribution Q such that: Low estimated regret (LR) – “exploitation" Low estimation variance (LV) – “exploration” Theorem: If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most We will repeatedly call AMO to construction policy distribution --- sorta how boosting uses WL. Before details: describe optimization (feasibility) problem (over space of policy distributions!) such that using such solutions imply exploitatios Low estimated regret bound (LR) --- for exploitation. Low estimation variance (LV) --- for explortations. --- Theorem says: distributions that satisfy (LR) and (LV) have optimal explore/exploit trade-off.

Feasibility Feasibility of (OP): implied by minimax argument. Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.

Coordinate descent algorithm Claim: Can check by making one AMO call per iteration. INPUT: Initial weights Q. LOOP: IF (LR) is violated, THEN replace Q by cQ. IF there is a policy π causing (LV) to be violated, THEN UPDATE Q(π) = Q(π) + α. ELSE RETURN Q. Above, both 0 < c < 1 and α have closed form expressions. (Technical detail: actually optimize over sub-distributions Q that may sum to < 1.) We use coordinate descent to iteratively construct a policy distribution Q. In each iteration, we add at most one new policy to the support of Q. [Technical detail: optimize over subdistributions] Repeat: If (LR) violated, re-scale so it’s satisfied. If some policy is causing (LV) to be violated, increase its weight. --- Each iteration requires just one AMO call.

Iteration bound for coordinate descent # steps of coordinate descent = Also gives bound on sparsity of Q. Analysis via a potential function argument.

Warm-start If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need coordinate descent iterations over all T rounds. Caveat: need one AMO call/round to even check if (OP) is solved.

Epoch trick Regret analysis: Qt has low instantaneous expected regret (crucially relying on i.i.d. assumption). Therefore same Qt can be used for O(t) more rounds! Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,… Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors). Squares: only update on rounds 12,22,32,42,… Total of O(T1/2) updates, each requiring AMO calls, on average.

Experiments Bandit problem derived from classification task (RCV1). Algorithm Epsilon-greedy Bagging Linear UCB “Online Cover” [Supervised] Loss 0.095 0.059 0.128 0.053 0.051 Time (seconds) 22 339 212000 17 6.9 Bandit problem derived from classification task (RCV1). Reporting progressive validation loss. “Online Cover” = variant with stateful AMO.