Machine Learning & Data Mining CS/CNS/EE 155 Lecture 17: The Multi-Armed Bandit Problem 1Lecture 17: The Multi-Armed Bandit Problem
Announcements Lecture Tuesday will be Course Review Final should only take a 4-5 hours to do – We give you 48 hours for your flexibility Homework 2 is graded – We graded pretty leniently – Approximate Grade Breakdown: – 64: A 61: A- 58: B+ 53: B 50: B- 47: C+ 42: C 39: C- Homework 3 will be graded soon Lecture 17: The Multi-Armed Bandit Problem2
Today The Multi-Armed Bandits Problem – And extensions Advanced topics course on this next year Lecture 17: The Multi-Armed Bandit Problem3
Recap: Supervised Learning Training Data: Model Class: Loss Function: Learning Objective: E.g., Linear Models E.g., Squared Loss Optimization Problem 4Lecture 17: The Multi-Armed Bandit Problem
But Labels are Expensive! 5 Image Source: x “Crystal” “Needle” “Empty” … “Sports” “World News” “Science” y Cheap and Abundant! Expensive and Scarce! Human Annotation Running Experiment Lecture 17: The Multi-Armed Bandit Problem
Solution? Let’s grab some labels! – Label images – Annotate webpages – Rate movies – Run Experiments – Etc… How should we choose? Lecture 17: The Multi-Armed Bandit Problem6
Interactive Machine Learning Start with unlabeled data: Loop: – select x i – receive feedback/label y i How to measure cost? How to define goal? Lecture 17: The Multi-Armed Bandit Problem7
Crowdsourcing Lecture 17: The Multi-Armed Bandit Problem8 “Mushroom” Unlabeled Labeled Initially Empty Labeled Initially Empty Repeat
Aside: Active Learning Lecture 17: The Multi-Armed Bandit Problem9 “Mushroom” Unlabeled Labeled Initially Empty Labeled Initially Empty Goal: Maximize Accuracy with Minimal Cost Repeat Choose
Passive Learning Lecture 17: The Multi-Armed Bandit Problem10 “Mushroom” Unlabeled Labeled Initially Empty Labeled Initially Empty Repeat Random
Comparison with Passive Learning Conventional Supervised Learning is considered “Passive” Learning Unlabeled training set sampled according to test distribution So we label it at random – Very Expensive! Lecture 17: The Multi-Armed Bandit Problem11
Aside: Active Learning Cost: uniform – E.g., each label costs $0.10 Goal: maximize accuracy of trained model Control distribution of labeled training data Lecture 17: The Multi-Armed Bandit Problem12
Problems with Crowdsourcing Assumes you can label by proxy – E.g., have someone else label objects in images But sometimes you can’t! – Personalized recommender systems Need to ask the user whether content is interesting – Personalized medicine Need to try treatment on patient – Requires actual target domain Lecture 17: The Multi-Armed Bandit Problem13
Personalized Labels Lecture 17: The Multi-Armed Bandit Problem14 Sports Unlabeled Labeled Initially Empty Labeled Initially Empty Choose Repeat What is Cost? Real System End User
The Multi-Armed Bandit Problem Lecture 17: The Multi-Armed Bandit Problem15
Formal Definition K actions/classes Each action has an average reward: μ k – Unknown to us – Assume WLOG that u 1 is largest For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ a(t) Goal: minimize Tu 1 – (μ a(1) + μ a(2) + … + μ a(T) ) Lecture 17: The Multi-Armed Bandit Problem16 Basic Setting K classes No features Algorithm Simultaneously Predicts & Receives Labels If we had perfect information to start Expected Reward of Algorithm
Sports # Shown Average Likes : 0 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem17
# Shown Average Likes : 0 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem18 Sports
# Shown Average Likes : 0 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem19 Politics
# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem20 Politics
# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem21 World
# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem22 World
# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem23 Economy
# Shown Average Likes : 2 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem24 Economy …
# Shown Average Likes : 24 What should Algorithm Recommend? Lecture 17: The Multi-Armed Bandit Problem25 Exploit:Explore: Best: Politics Economy Celebrity How to Optimally Balance Explore/Exploit Tradeoff? Characterized by the Multi-Armed Bandit Problem
( ) Opportunity cost of not knowing preferences “no-regret” if R(T)/T 0 – Efficiency measured by convergence rate Regret: Time Horizon ( ) … …
Recap: The Multi-Armed Bandit Problem K actions/classes Each action has an average reward: μ k – All unknown to us – Assume WLOG that u 1 is largest For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ a(t) Goal: minimize Tu 1 – (μ a(1) + μ a(2) + … + μ a(T) ) Lecture 17: The Multi-Armed Bandit Problem27 Basic Setting K classes No features Algorithm Simultaneously Predicts & Receives Labels Regret
The Motivating Problem Slot Machine = One-Armed Bandit Goal: Minimize regret From pulling suboptimal arms Lecture 17: The Multi-Armed Bandit Problem28 Each Arm Has Different Payoff
Implications of Regret If R(T) grows linearly w.r.t. T: – Then R(T)/T constant > 0 – I.e., we converge to predicting something suboptimal If R(T) is sub-linear w.r.t. T: – Then R(T)/T 0 – I.e., we converge to predicting the optimal action Lecture 17: The Multi-Armed Bandit Problem29 Regret:
Experimental Design How to split trials to collect information Static Experimental Design – Standard practice – (pre-planned) Lecture 17: The Multi-Armed Bandit Problem30 Treatment PlaceboTreatmentPlaceboTreatment …
Sequential Experimental Design Adapt experiments based on outcomes Lecture 17: The Multi-Armed Bandit Problem31 Treatment PlaceboTreatment …
Sequential Experimental Design Matters Lecture 17: The Multi-Armed Bandit Problem32
Sequential Experimental Design MAB models sequential experimental design! Each treatment has hidden expected value – Need to run trials to gather information – “Exploration” In hindsight, should always have used treatment with highest expected value Regret = opportunity cost of exploration Lecture 17: The Multi-Armed Bandit Problem33 basic
Online Advertising Lecture 17: The Multi-Armed Bandit Problem34 Largest Use-Case of Multi-Armed Bandit Problems
The UCB1 Algorithm Lecture 17: The Multi-Armed Bandit Problem35
# Shown Average Likes Confidence Intervals Lecture 17: The Multi-Armed Bandit Problem36 ** Maintain Confidence Interval for Each Action – Often derived using Chernoff-Hoeffding bounds (**) = [0.1, 0.3]= [0.25, 0.55] Undefined
UCB1 Confidence Interval Lecture 17: The Multi-Armed Bandit Problem37 Expected Reward Estimated from data Total Iterations so far (70 in example below) # Shown Average Likes #times action k was chosen
The UCB1 Algorithm At each iteration – Play arm with highest Upper Confidence Bound: Lecture 17: The Multi-Armed Bandit Problem # Shown Average Likes
Balancing Explore/Exploit “Optimism in the Face of Uncertainty” Lecture 17: The Multi-Armed Bandit Problem39 Exploitation Term Exploration Term # Shown Average Likes
Analysis (Intuition) Lecture 17: The Multi-Armed Bandit Problem40 With high probability (**): ** Proof of Theorem 1 in Value of Best Arm Upper Confidence Bound of Best Arm The true value is greater than the lower confidence bound. Bound on regret at time t+1
Lecture 17: The Multi-Armed Bandit Problem Iterations2000 Iterations 5000 Iterations Iterations
How Often Sub-Optimal Arms Get Played An arm never gets selected if: The number of times selected: – Prove using Hoeffding’s Inequality Lecture 17: The Multi-Armed Bandit Problem42 Bound grows slowly with time Shrinks quickly with #trials Theorem 1 in
Regret Guarantee With high probability: – UCB1 accumulates regret at most: Lecture 17: The Multi-Armed Bandit Problem43 Theorem 1 in #Actions Gap between best & 2 nd best ε = μ 1 – μ 2 Time Horizon
Recap: MAB & UCB1 Interactive setting – Receives reward/label while making prediction Must balance explore/exploit Sub-linear regret is good – Average regret converges to 0 Lecture 17: The Multi-Armed Bandit Problem44
Extensions Contextual Bandits – Features of environment Dependent-Arms Bandits – Features of actions/classes Dueling Bandits Combinatorial Bandits General Reinforcement Learning Lecture 17: The Multi-Armed Bandit Problem45
Contextual Bandits K actions/classes Rewards depends on context x: μ(x) For t = 1…T – Algorithm receives context x t – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ(x t ) Goal: Minimize Regret Lecture 17: The Multi-Armed Bandit Problem46 K classes Best class depends on features Algorithm Simultaneously Predicts & Receives Labels Bandit multiclass prediction
Linear Bandits K actions/classes – Each action has features x k – Reward function: μ(x) = w T x For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ a(t) Goal: regret scaling independent of K Lecture 17: The Multi-Armed Bandit Problem47 K classes Linear dependence Between Arms Algorithm Simultaneously Predicts & Receives Labels Labels can share information to other actions
Example Treatment of spinal cord injury patients – Studied by Joel Burdick’s Multi-armed bandit problem: – Thousands of arms Lecture 17: The Multi-Armed Bandit Problem Images from Yanan Sui UCB1 Regret Bound: Want regret bound that scales independently of #arms E.g., linearly in dimensionality of features x describing arms
Dueling Bandits K actions/classes – Preference model P(a k > a k’ ) For t = 1…T – Algorithm chooses actions a(t) & b(t) – Receives random reward y(t) Expectation P(a(t) > b(t)) Goal: low regret despite only pairwise feedback Lecture 17: The Multi-Armed Bandit Problem49 K classes Can only measure pairwise preferences Algorithm Simultaneously Predicts & Receives Labels Only pairwise rewards
Example in Sensory Testing (Hypothetical) taste experiment: vs – Natural usage context Experiment 1: Absolute Metrics 3 cans 2 cans1 can 5 cans 3 cans Total: 8 cansTotal: 9 cans Very Thirsty!
Example in Sensory Testing (Hypothetical) taste experiment: vs – Natural usage context Experiment 1: Relative Metrics All 6 prefer Pepsi
Example Revisited Treatment of spinal cord injury patients – Studied by Joel Burdick’s Dueling Bandits Problem! Lecture 17: The Multi-Armed Bandit Problem Images from Yanan Sui Patients cannot reliably rate individual treatments Patients can reliably compare pairs of treatments
Combinatorial Bandits Sometimes, actions must be selected from combinatorial action space: – E.g., shortest path problems with unknown costs on edges aka: Routing under uncertainty If you knew all the parameters of model: – standard optimization problem Lecture 17: The Multi-Armed Bandit Problem
General Reinforcement Learning Bandit setting assumes actions do not affect the world – E.g., sequence of experiments does not affect the distribution of future trials Lecture 17: The Multi-Armed Bandit Problem54 Treatment PlaceboTreatmentPlaceboTreatment …
Markov Decision Process M states K actions Reward: μ(s,a) – Depends on state For t = 1…T – Algorithm (approximately) observes current state s t Depends on previous state & action taken – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ(s t,a(t)) Lecture 17: The Multi-Armed Bandit Problem55 ** Example: Personalized Tutoring [Emma Brunskill et al.] (**)
Summary Interactive Machine Learning – Multi-armed Bandit Problem – Basic result: UCB1 – Surveyed Extensions Advanced Topics in ML course next year Next lecture: course review – Bring your questions! Lecture 17: The Multi-Armed Bandit Problem56