Taming the monster: A fast and simple algorithm for contextual bandits PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob Schapire
Learning to interact: example #1 Loop: 1. Patient arrives with symptoms, medical history, genome, … 2. Physician prescribes treatment. 3. Patient’s health responds (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes.
Learning to interact: example #2 Loop: 1. User visits website with profile, browsing history, … 2. Website operator choose content/ads to display. 3. User reacts to content/ads (e.g., click, “like”). Goal: choose content/ads that yield desired user behavior.
Contextual bandit setting (i.i.d. version) Set X of contexts/features and K possible actions For t = 1,2,…,T: 0. Nature draws (xt, rt) from distribution D over X × [0,1]K. 1. Observe context xt. [e.g., user profile, browsing history] 2. Choose action at ϵ [K]. [e.g., content/ad to display] 3. Collect reward rt(at). [e.g., indicator of click or positive feedback] Goal: algorithm for choosing actions at that yield high reward. Contextual setting: use features xt to choose good actions at. Bandit setting: rt(a) for a ≠ at is not observed. Exploration vs. exploitation A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good).
Learning objective and difficulties No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X [K] (e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π: … a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies; explicit bookkeeping is computationally infeasible when Π is large. Classical multi-arm bandit: compete against best single arm --- no good in applications. INSTEAD: want to compete against best policy in some rich policy class. [Define policy: function mapping contexts to actions] [Define regret: difference between total reward collected by best policy (i.e., solution to OFF-LINE FULL-INFO problem) and the total rewards collected by the learner] WHY IS THIS HARD? 1. Feedback for chosen action only informs us about a subset of policies; 2. Explicit bookkeeping is infeasible when policy space is very large.
Arg max oracle (AMO) Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex relax., backprop) for cost-sensitive multiclass learning algorithms. AMO is abstraction for efficient search of a policy class. GIVEN: fully-labeled data set, return policy in policy class that has maximum total reward. i.e., solve OFF-LINE FULL-INFO problem. Generally computationally hard, but in practice we have effective heuristics for very rich policy classes. Still, not clear how to use this because we only have PARTIAL FEEDBACK.
Our results New fast and simple algorithm for contextual bandits Optimal regret bound (up to log factors): Amortized calls to argmax oracle (AMO) per round. Comparison to previous work [Thompson’33]: no general analysis. [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies. [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round. [DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round. Note: Exp4 also works in adversarial setting. New algorithm: - Statistical performance: Achieves statistically optimal regret bound - Computation benchmark: “oracle complexity” --- how many times it has to call AMO. - Computational performance: sublinear in number of rounds (i.e., vanishing per round complexity). Previous algorithms: either statistically suboptimal or computationally more complex --- challenging to achieve both simultaneously. NOTE: we crucially rely on IID assumption, whereas EXP4 works in adversarial setting.
Rest of this talk Action distributions, reward estimates via inverse probability weights [oldies but goodies] Algorithm for finding policy distributions that balance exploration/exploitation Warm-start / epoch trick New New Want to compete against policy class, so we’ll hedge using a policy distribution. Show some basic techniques that show why learning a policy distribution is tricky. Then describe new algorithm for finding policy distribution that balances explore/exploit. Conclude with a brief word about warm start / epoch trick.
Basic algorithm structure (same as Exp4) Start with initial distribution Q1 over policies Π. For t=1,2,…,T: 0. Nature draws (xt,rt) from distribution D over X × [0,1]K. 1. Observe context xt. 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt). 2b. Draw action at from pt. 3. Collect reward rt(at). 4. Compute new distribution Qt+1 over policies Π. Maintain policy distribution Q --- need to do this efficiently, so we’ll make sure it’s sparse. After seeing context x_t, need to pick action a_t “smoothed” projection of Q_t to get p_t randomly pick a_t according to p_t. After collecting reward, update policy distribution from Q_t to Q_{t+1}.
Inverse probability weighting (old trick) Importance-weighted estimate of reward from round t: Unbiased, and has range & variance bounded by 1/pt(a). Can estimate total reward and regret of any policy: Old trick for unbiased estimates of rewards for all actions --- including actions you didn’t take. - Estimate zero for actions you didn’t take - For action you take, scale up its reward by inverse of probability of taking that probability. - Upshot: can estimate total reward of any policy pi through time t --- use this with AMO! So where does p_t come from?
Constructing policy distributions Optimization problem (OP): Find policy distribution Q such that: Low estimated regret (LR) – “exploitation" Low estimation variance (LV) – “exploration” Theorem: If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most We will repeatedly call AMO to construction policy distribution --- sorta how boosting uses WL. Before details: describe optimization (feasibility) problem (over space of policy distributions!) such that using such solutions imply exploitatios Low estimated regret bound (LR) --- for exploitation. Low estimation variance (LV) --- for explortations. --- Theorem says: distributions that satisfy (LR) and (LV) have optimal explore/exploit trade-off.
Feasibility Feasibility of (OP): implied by minimax argument. Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.
Coordinate descent algorithm Claim: Can check by making one AMO call per iteration. INPUT: Initial weights Q. LOOP: IF (LR) is violated, THEN replace Q by cQ. IF there is a policy π causing (LV) to be violated, THEN UPDATE Q(π) = Q(π) + α. ELSE RETURN Q. Above, both 0 < c < 1 and α have closed form expressions. (Technical detail: actually optimize over sub-distributions Q that may sum to < 1.) We use coordinate descent to iteratively construct a policy distribution Q. In each iteration, we add at most one new policy to the support of Q. [Technical detail: optimize over subdistributions] Repeat: If (LR) violated, re-scale so it’s satisfied. If some policy is causing (LV) to be violated, increase its weight. --- Each iteration requires just one AMO call.
Iteration bound for coordinate descent # steps of coordinate descent = Also gives bound on sparsity of Q. Analysis via a potential function argument.
Warm-start If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need coordinate descent iterations over all T rounds. Caveat: need one AMO call/round to even check if (OP) is solved.
Epoch trick Regret analysis: Qt has low instantaneous expected regret (crucially relying on i.i.d. assumption). Therefore same Qt can be used for O(t) more rounds! Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,… Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors). Squares: only update on rounds 12,22,32,42,… Total of O(T1/2) updates, each requiring AMO calls, on average.
Experiments Bandit problem derived from classification task (RCV1). Algorithm Epsilon-greedy Bagging Linear UCB “Online Cover” [Supervised] Loss 0.095 0.059 0.128 0.053 0.051 Time (seconds) 22 339 212000 17 6.9 Bandit problem derived from classification task (RCV1). Reporting progressive validation loss. “Online Cover” = variant with stateful AMO.