Download presentation
1
The K-armed Dueling Bandits Problem
COLT 2009 Cornell University Yisong Yue Josef Broder Robert Kleinberg Thorsten Joachims
2
Multi-armed Bandits K bandits (arms / actions / strategies)
Each time step, algorithm chooses bandit based on prior actions and feedback. Only observe feedback from actions taken. Partial Feedback Online Learning Problem
3
Optimizing Retrieval Functions
Interactively learn the best retrieval function Search users provide implicit feedback E.g., what they click on Seems like a natural bandit problem Each retrieval function is a bandit Feedback is clicks Find the best w.r.t. clickthrough rate Assumes clicks → explicit absolute feedback!
4
What Results do Users View/Click?
[Joachims et al., TOIS 2007]
5
Team-Game Interleaving
(u=thorsten, q=“svm”) f1(u,q) r1 f2(u,q) r2 1. Kernel Machines 2. Support Vector Machine 3. An Introduction to Support Vector Machines 4. Archives of SUPPORT-VECTOR-MACHINES 5. SVM-Light Support Vector Machine light/ 1. Kernel Machines 2. SVM-Light Support Vector Machine light/ 3. Support Vector Machine and Kernel ... References 4. Lucent Technologies: SVM demo applet 5. Royal Holloway Support Vector Machine Interleaving(r1,r2) 1. Kernel Machines T2 2. Support Vector Machine T1 3. SVM-Light Support Vector Machine T2 light/ 4. An Introduction to Support Vector Machines T1 5. Support Vector Machine and Kernel ... References T2 6. Archives of SUPPORT-VECTOR-MACHINES ... T1 7. Lucent Technologies: SVM demo applet T2 Mix results of f1 and f2 Relative feedback More reliable This is the evaluation method. Get the ranking for the learned retrieval function and for the standard retrieval function (e.g. Google). Combine both rankings into one combined ranking in a “fair and unbiased” way. This means, that at each position in the combined ranking the number of links from “learned” equals the number of links from “google” plus/minus 1. So, if user have no preference for a ranking function, with 50/50 chance they will click on links from either ranking function. We then evaluate, if the users click on links from one ranking function significantly more often. In the example, the lowest click in the combined ranking is 7. Due to the “fair” merging, the user has seen the top 4 from both rankings. Tracing back where the clicked on links came from, 3 links were in the top 4 from “Learned”, but only one in the top 4 from “google”. So, “learned” wins on this query. Note that this is a blind test. Users do not know which retrieval function the link came from. In particular, we use the same abstract generator. Interpretation: (r1 > r2) ↔ clicks(T1) > clicks(T2) [Radlinski, Kurup, Joachims; CIKM 2008]
6
Dueling Bandits Problem
Given K bandits b1, …, bK Each iteration: compare (duel) two bandits E.g., interleaving two retrieval functions Comparison is noisy Each comparison result independent Comparison probabilities initially unknown Comparison probabilities fixed over time Total preference ordering, initially unknown
7
Dueling Bandits Problem
Want to find best (or good) bandit Similar to finding the max w/ noisy comparisons Ours is a regret minimization setting Choose pair (bt, bt’) to minimize regret: (% users who prefer best bandit over chosen ones)
8
Related Work Existing MAB settings Partial monitoring problem
[Lai & Robbins, ’85] [Auer et al., ’02] PAC setting [Even-Dar et al., ’06] Partial monitoring problem [Cesa-Bianchi et al., ’06] Computing with noisy comparisons Finding max [Feige et al., ’97] Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08] Continuous-armed Dueling Bandits Problem [Yue & Joachims, ’09]
9
Assumptions P(bi > bj) = ½ + εij (distinguishability)
Strong Stochastic Transitivity For three bandits bi > bj > bk : Monotonicity property Stochastic Triangle Inequality Diminishing returns property Satisfied by many standard models E.g., Logistic/Bradley-Terry
10
Naïve Approach In deterministic case, O(K) comparisons to find max
Extend to noisy case: Repeatedly compare until confident one is better Problem: comparing two awful (but similar) bandits Waste comparisons to see which awful bandit is better Incur high regret for each comparison Also applies to elimination tournaments
11
Interleaved Filter Choose candidate bandit at random ►
12
Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared ►
13
Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ ►
14
Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ Repeat process with new candidate Remove all empirically worse bandits ►
15
Interleaved Filter Choose candidate bandit at random
Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ Repeat process with new candidate Remove all empirically worse bandits Continue until 1 candidate left ►
16
Regret Analysis Define a round to be all the time steps for
a particular candidate bandit Halts round when 1 - δ confidence met Treat candidate bandits as random walk Takes log K rounds to reach best bandit Define a match to be all the comparisons between two bandits in a round O(K) total matches in expectation Constant fraction of bandits removed each round
17
Regret Analysis O(K) total matches Each match incurs regret
Depends on δ = K-2T-1 Finds best bandit w.p. 1-1/T Expected regret:
18
Lower Bound Example Order bandits b1 > b2 > … > bK
P(b > b’) = ½ + ε Each match takes comparisons Pay Θ(ε) regret for each comparison Accumulated regret over all matches is at least
19
Moving Forward Extensions Live user studies Drifting user interests
Changing user population / document collection Contextualization / side information Cost-sensitive active learning w/ relative feedback Live user studies Interactive experiments on search service Sheds insight; guides future design
20
Extra Slides
21
Per-Match Regret Number of comparisons in match bi vs bj :
ε1i > εij : round ends before concluding bi > bj ε1i < εij : conclude bi > bj before round ends, remove bj Pay ε1i + ε1j regret for each comparison By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij} Thus by stochastic transitivity accumulated regret is
22
Number of Rounds Assume all superior bandits have
equal prob of defeating candidate Worst case scenario under transitivity Model this as a random walk rj transitions to each ri (i < j) with equal probability Compute total number of steps before reaching r1 (i.e., r*) Can show O(log K) w.h.p. using Chernoff bound
23
Total Matches Played O(K) matches played in each round
Naïve analysis yields O(K log K) total However, all empirically worse bandits are also removed at the end of each round Will not participate in future rounds Assume worst case that inferior bandits have ½ chance of being empirically worse Can show w.h.p. that O(K) total matches are played over O(log N) rounds
24
Removing Inferior Bandits
At conclusion of each round Remove any empirically worse bandits Intuition: High confidence that winner is better than incumbent candidate Empirically worse bandits cannot be “much better” than incumbent candidate Can show via Hoeffding bound that winner is also better than empirically worse bandits with high confidence Preserves 1-1/T confidence overall that we’ll find the best bandit
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.