Download presentation
Published byAliyah Dolphin Modified over 9 years ago
1
Taming the monster: A fast and simple algorithm for contextual bandits
PRESENTED BY Satyen Kale Joint work with Alekh Agarwal, Daniel Hsu, John Langford, Lihong Li and Rob Schapire
2
Learning to interact: example #1
Loop: 1. Patient arrives with symptoms, medical history, genome, … 2. Physician prescribes treatment. 3. Patient’s health responds (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes.
3
Learning to interact: example #2
Loop: 1. User visits website with profile, browsing history, … 2. Website operator choose content/ads to display. 3. User reacts to content/ads (e.g., click, “like”). Goal: choose content/ads that yield desired user behavior.
4
Contextual bandit setting (i.i.d. version)
Set X of contexts/features and K possible actions For t = 1,2,…,T: 0. Nature draws (xt, rt) from distribution D over X × [0,1]K. 1. Observe context xt. [e.g., user profile, browsing history] 2. Choose action at ϵ [K]. [e.g., content/ad to display] 3. Collect reward rt(at). [e.g., indicator of click or positive feedback] Goal: algorithm for choosing actions at that yield high reward. Contextual setting: use features xt to choose good actions at. Bandit setting: rt(a) for a ≠ at is not observed. Exploration vs. exploitation A simple setting that captures a large class of interactive learning problems is the contextual bandit problem. Here we consider iid version of the problem, where in each round t, … GOAL: choose actions that yield high reward over the T rounds. MUST USE CONTEXT: no single action is good in all situations. BANDIT PROBLEM: (or PARTIAL LABEL PROBLEM): don’t see rewards for actions you don’t take. Need exploration (take actions just to learn about them) balanced with exploitation (take actions known to be good).
5
Learning objective and difficulties
No single action is good in all situations – need to exploit context. Policy class Π: set of functions (“policies”) from X [K] (e.g., advice of experts, linear classifiers, neural networks). Regret (i.e., relative performance) to policy class Π: … a strong benchmark if Π contains a policy with high reward. Difficulties: feedback on action only informs about subset of policies; explicit bookkeeping is computationally infeasible when Π is large. Classical multi-arm bandit: compete against best single arm --- no good in applications. INSTEAD: want to compete against best policy in some rich policy class. [Define policy: function mapping contexts to actions] [Define regret: difference between total reward collected by best policy (i.e., solution to OFF-LINE FULL-INFO problem) and the total rewards collected by the learner] WHY IS THIS HARD? 1. Feedback for chosen action only informs us about a subset of policies; 2. Explicit bookkeeping is infeasible when policy space is very large.
6
Arg max oracle (AMO) Given fully-labeled data (x1, r1),…,(xt, rt), AMO returns Abstraction for efficient search of policy class Π. In practice: implement using standard heuristics (e.g., convex relax., backprop) for cost-sensitive multiclass learning algorithms. AMO is abstraction for efficient search of a policy class. GIVEN: fully-labeled data set, return policy in policy class that has maximum total reward. i.e., solve OFF-LINE FULL-INFO problem. Generally computationally hard, but in practice we have effective heuristics for very rich policy classes. Still, not clear how to use this because we only have PARTIAL FEEDBACK.
7
Our results New fast and simple algorithm for contextual bandits
Optimal regret bound (up to log factors): Amortized calls to argmax oracle (AMO) per round. Comparison to previous work [Thompson’33]: no general analysis. [ACBFS’02]: Exp4 algorithm; optimal regret, enumerates policies. [LZ’07]: ε-greedy variant; suboptimal regret, one AMO call/round. [DHKKLRZ’11]: “monster paper”; optimal regret, O(T5K4) AMO calls/round. Note: Exp4 also works in adversarial setting. New algorithm: - Statistical performance: Achieves statistically optimal regret bound - Computation benchmark: “oracle complexity” --- how many times it has to call AMO. - Computational performance: sublinear in number of rounds (i.e., vanishing per round complexity). Previous algorithms: either statistically suboptimal or computationally more complex --- challenging to achieve both simultaneously. NOTE: we crucially rely on IID assumption, whereas EXP4 works in adversarial setting.
8
Rest of this talk Action distributions, reward estimates via inverse probability weights [oldies but goodies] Algorithm for finding policy distributions that balance exploration/exploitation Warm-start / epoch trick New New Want to compete against policy class, so we’ll hedge using a policy distribution. Show some basic techniques that show why learning a policy distribution is tricky. Then describe new algorithm for finding policy distribution that balances explore/exploit. Conclude with a brief word about warm start / epoch trick.
9
Basic algorithm structure (same as Exp4)
Start with initial distribution Q1 over policies Π. For t=1,2,…,T: 0. Nature draws (xt,rt) from distribution D over X × [0,1]K. 1. Observe context xt. 2a. Compute distribution pt over actions {1,2,…,K} (based on Qt and xt). 2b. Draw action at from pt. 3. Collect reward rt(at). 4. Compute new distribution Qt+1 over policies Π. Maintain policy distribution Q --- need to do this efficiently, so we’ll make sure it’s sparse. After seeing context x_t, need to pick action a_t “smoothed” projection of Q_t to get p_t randomly pick a_t according to p_t. After collecting reward, update policy distribution from Q_t to Q_{t+1}.
10
Inverse probability weighting (old trick)
Importance-weighted estimate of reward from round t: Unbiased, and has range & variance bounded by 1/pt(a). Can estimate total reward and regret of any policy: Old trick for unbiased estimates of rewards for all actions --- including actions you didn’t take. - Estimate zero for actions you didn’t take - For action you take, scale up its reward by inverse of probability of taking that probability. - Upshot: can estimate total reward of any policy pi through time t --- use this with AMO! So where does p_t come from?
11
Constructing policy distributions
Optimization problem (OP): Find policy distribution Q such that: Low estimated regret (LR) – “exploitation" Low estimation variance (LV) – “exploration” Theorem: If we obtain policy distributions Qt via solving (OP), then with high probability, regret after T rounds is at most We will repeatedly call AMO to construction policy distribution --- sorta how boosting uses WL. Before details: describe optimization (feasibility) problem (over space of policy distributions!) such that using such solutions imply exploitatios Low estimated regret bound (LR) --- for exploitation. Low estimation variance (LV) --- for explortations. --- Theorem says: distributions that satisfy (LR) and (LV) have optimal explore/exploit trade-off.
12
Feasibility Feasibility of (OP): implied by minimax argument.
Monster solution [DHKKLRZ’11]: solves variant of (OP) with ellipsoid algorithm, where Separation Oracle = AMO + perceptron + ellipsoid.
13
Coordinate descent algorithm
Claim: Can check by making one AMO call per iteration. INPUT: Initial weights Q. LOOP: IF (LR) is violated, THEN replace Q by cQ. IF there is a policy π causing (LV) to be violated, THEN UPDATE Q(π) = Q(π) + α. ELSE RETURN Q. Above, both 0 < c < 1 and α have closed form expressions. (Technical detail: actually optimize over sub-distributions Q that may sum to < 1.) We use coordinate descent to iteratively construct a policy distribution Q. In each iteration, we add at most one new policy to the support of Q. [Technical detail: optimize over subdistributions] Repeat: If (LR) violated, re-scale so it’s satisfied. If some policy is causing (LV) to be violated, increase its weight. --- Each iteration requires just one AMO call.
14
Iteration bound for coordinate descent
# steps of coordinate descent = Also gives bound on sparsity of Q. Analysis via a potential function argument.
15
Warm-start If we warm-start coordinate descent (initialize with Qt to get Qt+1), then only need coordinate descent iterations over all T rounds. Caveat: need one AMO call/round to even check if (OP) is solved.
16
Epoch trick Regret analysis: Qt has low instantaneous expected regret (crucially relying on i.i.d. assumption). Therefore same Qt can be used for O(t) more rounds! Epoching: Split T rounds into epochs, solve (OP) once per epoch. Doubling: only update on rounds 21,22,23,24,… Total of O(log T) updates, so overall # AMO calls unchanged (up to log factors). Squares: only update on rounds 12,22,32,42,… Total of O(T1/2) updates, each requiring AMO calls, on average.
17
Experiments Bandit problem derived from classification task (RCV1).
Algorithm Epsilon-greedy Bagging Linear UCB “Online Cover” [Supervised] Loss 0.095 0.059 0.128 0.053 0.051 Time (seconds) 22 339 212000 17 6.9 Bandit problem derived from classification task (RCV1). Reporting progressive validation loss. “Online Cover” = variant with stateful AMO.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.