The K-armed Dueling Bandits Problem

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

An Interactive Learning Approach to Optimizing Information Retrieval Systems Yahoo! August 24 th, 2010 Yisong Yue Cornell University.
ICML 2009 Yisong Yue Thorsten Joachims Cornell University
Structured Prediction and Active Learning for Information Retrieval
IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.
Lazy Paired Hyper-Parameter Tuning
Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Practical and Reliable Retrieval Evaluation Through Online Experimentation WSDM Workshop on Web Search Click Data February 12 th, 2012 Yisong Yue Carnegie.
1 Adaptive Submodularity: A New Approach to Active Learning and Stochastic Optimization Joint work with Andreas Krause 1 Daniel Golovin.
Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Optimizing search engines using clickthrough data
Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.
Fast Convergence of Selfish Re-Routing Eyal Even-Dar, Tel-Aviv University Yishay Mansour, Tel-Aviv University.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Online learning, minimizing regret, and combining expert advice
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Beat the Mean Bandit Yisong Yue (CMU) & Thorsten Joachims (Cornell)
Certainty Equivalent and Stochastic Preferences June 2006 FUR 2006, Rome Pavlo Blavatskyy Wolfgang Köhler IEW, University of Zürich.
Linear Submodular Bandits and their Application to Diversified Retrieval Yisong Yue (CMU) & Carlos Guestrin (CMU) Optimizing Recommender Systems Every.
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
The Value of Knowing a Demand Curve: Regret Bounds for Online Posted-Price Auctions Bobby Kleinberg and Tom Leighton.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
Competitive Auctions and Digital Goods Andrew Goldberg, Jason Hartline, and Andrew Wright presenting: Keren Horowitz, Ziv Yirmeyahu.
Quantum Algorithms II Andrew C. Yao Tsinghua University & Chinese U. of Hong Kong.
Online Search Evaluation with Interleaving Filip Radlinski Microsoft.
Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.
Photo-realistic Rendering and Global Illumination in Computer Graphics Spring 2012 Stochastic Radiosity K. H. Ko School of Mechatronics Gwangju Institute.
Chapter 13: Inference in Regression
Modern Retrieval Evaluations Hongning Wang
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.
Learning with Humans in the Loop Josef Broder, Olivier Chapelle, Geri Gay, Arpita Ghosh, Laura Granka, Thorsten Joachims, Bobby Kleinberg, Madhu Kurup,
Some Vignettes from Learning Theory Robert Kleinberg Cornell University Microsoft Faculty Summit, 2009.
The Dueling Bandits Problem Yisong Yue. Outline Brief Overview of Multi-Armed Bandits Dueling Bandits – Mathematical properties – Connections to other.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.
Universit at Dortmund, LS VIII
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Search Engines that Learn from Implicit Feedback Jiawen, Liu Speech Lab, CSIE National Taiwan Normal University Reference: Search Engines that Learn from.
Instructor Neelima Gupta Table of Contents Review of Lower Bounding Techniques Decision Trees Linear Sorting Selection Problems.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
© Copyright McGraw-Hill 2004
Paired Experiments and Interleaving for Retrieval Evaluation Thorsten Joachims, Madhu Kurup, Filip Radlinski Department of Computer Science Department.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
IJCAI’07 Emergence of Norms through Social Learning Partha Mukherjee, Sandip Sen and Stéphane Airiau Mathematical and Computer Sciences Department University.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
The Standard Genetic Algorithm Start with a “population” of “individuals” Rank these individuals according to their “fitness” Select pairs of individuals.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.
Basics of Multi-armed Bandit Problems
Figure 5: Change in Blackjack Posterior Distributions over Time.
Accurately Interpreting Clickthrough Data as Implicit Feedback
C. Kiekintveld et al., AAMAS 2009 (USC) Presented by Neal Gupta
Minimum Spanning Tree 8/7/2018 4:26 AM
Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo
Lecture 18: Uniformity Testing Monotonicity Testing
Evidence from Behavior
How does Clickthrough Data Reflect Retrieval Quality?
Presentation transcript:

The K-armed Dueling Bandits Problem COLT 2009 Cornell University Yisong Yue Josef Broder Robert Kleinberg Thorsten Joachims

Multi-armed Bandits K bandits (arms / actions / strategies) Each time step, algorithm chooses bandit based on prior actions and feedback. Only observe feedback from actions taken. Partial Feedback Online Learning Problem

Optimizing Retrieval Functions Interactively learn the best retrieval function Search users provide implicit feedback E.g., what they click on Seems like a natural bandit problem Each retrieval function is a bandit Feedback is clicks Find the best w.r.t. clickthrough rate Assumes clicks → explicit absolute feedback!

What Results do Users View/Click? [Joachims et al., TOIS 2007]

Team-Game Interleaving (u=thorsten, q=“svm”) f1(u,q)  r1 f2(u,q)  r2 1. Kernel Machines http://svm.first.gmd.de/ 2. Support Vector Machine http://jbolivar.freeservers.com/ 3. An Introduction to Support Vector Machines http://www.support-vector.net/ 4. Archives of SUPPORT-VECTOR-MACHINES ... http://www.jiscmail.ac.uk/lists/SUPPORT... 5. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 1. Kernel Machines http://svm.first.gmd.de/ 2. SVM-Light Support Vector Machine http://ais.gmd.de/~thorsten/svm light/ 3. Support Vector Machine and Kernel ... References http://svm.research.bell-labs.com/SVMrefs.html 4. Lucent Technologies: SVM demo applet http://svm.research.bell-labs.com/SVT/SVMsvt.html 5. Royal Holloway Support Vector Machine http://svm.dcs.rhbnc.ac.uk Interleaving(r1,r2) 1. Kernel Machines T2 http://svm.first.gmd.de/ 2. Support Vector Machine T1 http://jbolivar.freeservers.com/ 3. SVM-Light Support Vector Machine T2 http://ais.gmd.de/~thorsten/svm light/ 4. An Introduction to Support Vector Machines T1 http://www.support-vector.net/ 5. Support Vector Machine and Kernel ... References T2 http://svm.research.bell-labs.com/SVMrefs.html 6. Archives of SUPPORT-VECTOR-MACHINES ... T1 http://www.jiscmail.ac.uk/lists/SUPPORT... 7. Lucent Technologies: SVM demo applet T2 http://svm.research.bell-labs.com/SVT/SVMsvt.html Mix results of f1 and f2 Relative feedback More reliable This is the evaluation method. Get the ranking for the learned retrieval function and for the standard retrieval function (e.g. Google). Combine both rankings into one combined ranking in a “fair and unbiased” way. This means, that at each position in the combined ranking the number of links from “learned” equals the number of links from “google” plus/minus 1. So, if user have no preference for a ranking function, with 50/50 chance they will click on links from either ranking function. We then evaluate, if the users click on links from one ranking function significantly more often. In the example, the lowest click in the combined ranking is 7. Due to the “fair” merging, the user has seen the top 4 from both rankings. Tracing back where the clicked on links came from, 3 links were in the top 4 from “Learned”, but only one in the top 4 from “google”. So, “learned” wins on this query. Note that this is a blind test. Users do not know which retrieval function the link came from. In particular, we use the same abstract generator. Interpretation: (r1 > r2) ↔ clicks(T1) > clicks(T2) [Radlinski, Kurup, Joachims; CIKM 2008]

Dueling Bandits Problem Given K bandits b1, …, bK Each iteration: compare (duel) two bandits E.g., interleaving two retrieval functions Comparison is noisy Each comparison result independent Comparison probabilities initially unknown Comparison probabilities fixed over time Total preference ordering, initially unknown

Dueling Bandits Problem Want to find best (or good) bandit Similar to finding the max w/ noisy comparisons Ours is a regret minimization setting Choose pair (bt, bt’) to minimize regret: (% users who prefer best bandit over chosen ones)

Related Work Existing MAB settings Partial monitoring problem [Lai & Robbins, ’85] [Auer et al., ’02] PAC setting [Even-Dar et al., ’06] Partial monitoring problem [Cesa-Bianchi et al., ’06] Computing with noisy comparisons Finding max [Feige et al., ’97] Binary search [Karp & Kleinberg, ’07] [Ben-Or & Hassidim, ’08] Continuous-armed Dueling Bandits Problem [Yue & Joachims, ’09]

Assumptions P(bi > bj) = ½ + εij (distinguishability) Strong Stochastic Transitivity For three bandits bi > bj > bk : Monotonicity property Stochastic Triangle Inequality Diminishing returns property Satisfied by many standard models E.g., Logistic/Bradley-Terry

Naïve Approach In deterministic case, O(K) comparisons to find max Extend to noisy case: Repeatedly compare until confident one is better Problem: comparing two awful (but similar) bandits Waste comparisons to see which awful bandit is better Incur high regret for each comparison Also applies to elimination tournaments

Interleaved Filter Choose candidate bandit at random ►

Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared ►

Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ ►

Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ Repeat process with new candidate Remove all empirically worse bandits ►

Interleaved Filter Choose candidate bandit at random Make noisy comparisons (Bernoulli trial) against all other bandits simultaneously Maintain mean and confidence interval for each pair of bandits being compared …until another bandit is better With confidence 1 – δ Repeat process with new candidate Remove all empirically worse bandits Continue until 1 candidate left ►

Regret Analysis Define a round to be all the time steps for a particular candidate bandit Halts round when 1 - δ confidence met Treat candidate bandits as random walk Takes log K rounds to reach best bandit Define a match to be all the comparisons between two bandits in a round O(K) total matches in expectation Constant fraction of bandits removed each round

Regret Analysis O(K) total matches Each match incurs regret Depends on δ = K-2T-1 Finds best bandit w.p. 1-1/T Expected regret:

Lower Bound Example Order bandits b1 > b2 > … > bK P(b > b’) = ½ + ε Each match takes comparisons Pay Θ(ε) regret for each comparison Accumulated regret over all matches is at least

Moving Forward Extensions Live user studies Drifting user interests Changing user population / document collection Contextualization / side information Cost-sensitive active learning w/ relative feedback Live user studies Interactive experiments on search service Sheds insight; guides future design

Extra Slides

Per-Match Regret Number of comparisons in match bi vs bj : ε1i > εij : round ends before concluding bi > bj ε1i < εij : conclude bi > bj before round ends, remove bj Pay ε1i + ε1j regret for each comparison By triangle inequality ε1i + ε1j ≤ 2*max{ε1i , εij} Thus by stochastic transitivity accumulated regret is

Number of Rounds Assume all superior bandits have equal prob of defeating candidate Worst case scenario under transitivity Model this as a random walk rj transitions to each ri (i < j) with equal probability Compute total number of steps before reaching r1 (i.e., r*) Can show O(log K) w.h.p. using Chernoff bound

Total Matches Played O(K) matches played in each round Naïve analysis yields O(K log K) total However, all empirically worse bandits are also removed at the end of each round Will not participate in future rounds Assume worst case that inferior bandits have ½ chance of being empirically worse Can show w.h.p. that O(K) total matches are played over O(log N) rounds

Removing Inferior Bandits At conclusion of each round Remove any empirically worse bandits Intuition: High confidence that winner is better than incumbent candidate Empirically worse bandits cannot be “much better” than incumbent candidate Can show via Hoeffding bound that winner is also better than empirically worse bandits with high confidence Preserves 1-1/T confidence overall that we’ll find the best bandit