Multi-armed Bandit Problems with Dependent Arms

Slides:

Advertisements

Similar presentations

Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute.

Advertisements

MCMC estimation in MlwiN

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Decision Theoretic Planning

Reinforcement Learning

Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.

An Introduction to Markov Decision Processes Sarah Hickmott

Planning under Uncertainty

Bandits for Taxonomies: A Model-based Approach Sandeep Pandey Deepak Agarwal Deepayan Chakrabarti Vanja Josifovski.

Optimal resampling using machine learning Jesse McCrosky.

Algorithmic and Economic Aspects of Networks Nicole Immorlica.

Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.

1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April

Optimization via Search CPSC 315 – Programming Studio Spring 2009 Project 2, Lecture 4 Adapted from slides of Yoonsuck Choe.

Nov 14 th  Homework 4 due  Project 4 due 11/26.

Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.

Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Purnamrita Sarkar (UC Berkeley) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.) 1.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 10 Introduction to Estimation.

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)

Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.

1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Search and Planning for Inference and Learning in Computer Vision

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

Verification & Validation

Simulation Selection Problems: Overview of an Economic Analysis Based On Paper By: Stephen E. Chick Noah Gans Presented By: Michael C. Jones MSIM 852.

Random Sampling, Point Estimation and Maximum Likelihood.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

1 Estimation From Sample Data Chapter 08. Chapter 8 - Learning Objectives Explain the difference between a point and an interval estimate. Construct and.

Chapter 10 Introduction to Estimation Sir Naseer Shahzada.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.

Javad Azimi, Ali Jalali, Xiaoli Fern Oregon State University University of Texas at Austin In NIPS 2011, Workshop in Bayesian optimization, experimental.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Chance Constrained Robust Energy Efficiency in Cognitive Radio Networks with Channel Uncertainty Yongjun Xu and Xiaohui Zhao College of Communication Engineering,

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.

Monte-Carlo Planning:

Bandits for Taxonomies: A Model-based Approach

Optimal Electricity Supply Bidding by Markov Decision Process

Multi-armed Bandit Problems with Dependent Arms

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Presentation transcript:

Multi-armed Bandit Problems with Dependent Arms Sandeep Pandey (spandey@cs.cmu.edu) Deepayan Chakrabarti (deepay@yahoo-inc.com) Deepak Agarwal (dagarwal@yahoo-inc.com)

Background: Bandits Bandit “arms” μ1 μ2 μ3 (unknown reward probabilities) Pull arms sequentially so as to maximize the total expected reward Show ads on a webpage to maximize clicks Product recommendation to maximize sales

“Skiing, snowboarding” Dependent Arms Reward probabilities μi are generally assumed to be independent of each other What if they are dependent? E.g., ads on similar topics, using similar text/phrases, should have similar rewards “Skiing, snowboarding” “Skiing, snowshoes” “Snowshoe rental” “Get Vonage!” μ1=0.3 μ2=0.28 μ2=0.31 μ3=10-6

Dependent Arms Reward probabilities μi are generally assumed to be independent of each other What if they are dependent? E.g., ads on similar topics, using similar text/phrases, should have similar rewards A click on one ad  other “similar” ads may generate clicks as well Can we increase total reward using this dependency?

Cluster Model of Dependence Arm 1 Arm 2 Arm 3 Arm 4 Cluster 1 Cluster 2 # pulls of arm i Successes si ~ Bin(ni, μi) μi ~ f(π[i]) No dependence across clusters Some distribution (known) Cluster-specific parameter (unknown)

Cluster Model of Dependence Arm 1 Arm 2 Arm 3 Arm 4 μi ~ f(π1) μi ~ f(π2) Total reward: Discounted: ∑ αt.E[R(t)], α = discounting factor Undiscounted: ∑ E[R(t)] t=0 ∞ T

Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 x1 x2 x”1 x”2 Optimal Policy: Compute an (“index”, arm) pair for each cluster Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

Discounted Reward Arm 2 x’1 x’2 The optimal policy can be computed using per-cluster MDPs only. MDP for cluster 1 Pull Arm 1 Reduces the problem to smaller state spaces Reduces to Gittins’ Theorem [1979] for independent bandits Approximation bounds on the index for k-step lookahead x1 x2 x”1 x”2 Optimal Policy: Compute an (“index”, arm) pair for each cluster Pick the cluster with the largest index, and pull the corresponding arm Arm 4 x’3 x’4 MDP for cluster 2 Pull Arm 3 x3 x4 x”3 x”4

Cluster Model of Dependence Arm 1 Arm 2 Arm 3 Arm 4 μi ~ f(π1) μi ~ f(π2) Total reward: Discounted: ∑ αt.E[R(t)], α = discounting factor Undiscounted: ∑ E[R(t)] ∞ t=0 T t=0

Undiscounted Reward “Cluster arm” 1 “Cluster arm” 2 Arm 1 Arm 2 Arm 3 Arm 4 All arms in a cluster are similar  They can be grouped into one hypothetical “cluster arm”

Undiscounted Reward Two-Level Policy In each iteration: Pick “cluster arm” using a traditional bandit policy Pick an arm within that cluster using a traditional bandit policy “Cluster arm” 1 “Cluster arm” 2 Each “cluster arm” must have some estimated reward probability Arm 1 Arm 2 Arm 3 Arm 4

Issues What is the reward probability of a “cluster arm”? How do cluster characteristics affect performance?

Reward probability of a “cluster arm” What is the reward probability r of a “cluster arm”? MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] Initially, r = μavg = average μ of arms in cluster Finally, r = μmax = max μ among arms in cluster “Drift” in the reward probability of the “cluster arm”

Reward probability drift causes problems Best (optimal) arm, with reward probability μopt Arm 1 Arm 2 Arm 3 Arm 4 Cluster 1 Cluster 2 (opt cluster) Drift  Non-optimal clusters might temporarily look better  optimal arm is explored only O(log T) times

Reward probability of a “cluster arm” What is the reward probability r of a “cluster arm”? MEAN: r = ∑si / ∑ni MAX: r = max( E[μi] ) PMAX: r = E[ max(μi) ] Both MAX and PMAX aim to estimate μmax and thus reduce drift for all arms i in cluster

Reward probability of a “cluster arm” Bias in estimation of μmax MEAN: r = ∑si / ∑ni MAX: r = max( E[μi] ) PMAX: r = E[ max(μi) ] Both MAX and PMAX aim to estimate μmax and thus reduce drift Variance of estimator High Unbiased Low High

Comparison of schemes 10 clusters, 11.3 arms/cluster MAX performs best

Issues What is the reward probability of a “cluster arm”? How do cluster characteristics affect performance?

Effects of cluster characteristics We analytically study the effects of cluster characteristics on the “crossover-time” Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms”

Effects of cluster characteristics Crossover-time Tc for MEAN depends on: Cluster separation Δ = μopt – μmax outside opt cluster Δ increases  Tc decreases Cluster size Aopt Aopt increases  Tc increases Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases  Tc decreases

Experiments (effect of separation) Circle mean, triangle max, square indep. Δ increases  Tc decreases  higher reward

Experiments (effect of size) Aopt increases  Tc increases  lower reward

Experiments (effect of cohesiveness) Cohesiveness increases  Tc decreases  higher reward

Related Work Typical multi-armed bandit problems Do not consider dependencies Very few arms Bandits with side information Cannot handle dependencies among arms Active learning Emphasis on #examples required to achieve a given prediction accuracy

Conclusions We analyze bandits where dependencies are encapsulated within clusters Discounted Reward  the optimal policy is an index scheme on the clusters Undiscounted Reward Two-level Policy with MEAN, MAX, and PMAX Analysis of the effect of cluster characteristics on performance, for MEAN

Discounted Reward 1 2 3 4 x”1 x”2 x’1 x’2 x3 x4 Pull Arm 1 success failure Change of belief for both arms 1 and 2 x1 x2 Estimated reward probabilities x3 x4 Create a belief-state MDP Each state contains the estimated reward probabilities for all arms Solve for optimal

Background: Bandits Regret = optimal payoff – actual payoff Bandit “arms” p1 p2 p3 (unknown payoff probabilities) Regret = optimal payoff – actual payoff

Reward probability of a “cluster arm” What is the reward probability of a “cluster arm”? Eventually, every “cluster arm” must converge to the most rewarding arm μmax within that cluster since a bandit policy is used within each cluster However, “drift” causes problems

Experiments Simulation based on one week’s worth of data from a large-scale ad-matching application 10 clusters, with 11.3 arms/cluster on average

Comparison of schemes MAX performs best 10 clusters, 11.3 arms/cluster Cluster separation Δ = 0.08 Cluster size Aopt = 31 Cohesiveness = 0.75 MAX performs best

Reward probability drift causes problems Best (optimal) arm, with reward probability μopt Arm 1 Arm 2 Arm 3 Arm 4 Cluster 1 Cluster 2 (opt cluster) Intuitively, to reduce regret, we must: Quickly converge to the optimal “cluster arm” and then to the best arm within that cluster