Machine Learning & Data Mining CS/CNS/EE 155 Lecture 17: The Multi-Armed Bandit Problem 1Lecture 17: The Multi-Armed Bandit Problem.

Slides:

Advertisements

Similar presentations

Lecture 18: Temporal-Difference Learning

Advertisements

Optimizing Recommender Systems as a Submodular Bandits Problem Yisong Yue Carnegie Mellon University Joint work with Carlos Guestrin & Sue Ann Hong.

Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.

A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.

Online learning, minimizing regret, and combining expert advice

Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.

ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Online Learning in Complex Environments

Linear Submodular Bandits and their Application to Diversified Retrieval Yisong Yue (CMU) & Carlos Guestrin (CMU) Optimizing Recommender Systems Every.

The loss function, the normal equation,

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 14: Embeddings 1Lecture 14: Embeddings.

Infinite Horizon Problems

Planning under Uncertainty

Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.

Reinforcement Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.

To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Evaluating Hypotheses

Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.

Multi-armed Bandit Problems with Dependent Arms

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Experts Learning and The Minimax Theorem for Zero-Sum Games Maria Florina Balcan December 8th 2011.

Hierarchical Exploration for Accelerating Contextual Bandits Yisong Yue Carnegie Mellon University Joint work with Sue Ann Hong (CMU) & Carlos Guestrin.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

online convex optimization (with partial information)

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.

Universit at Dortmund, LS VIII

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.

Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.

Psychology and Neurobiology of Decision-Making under Uncertainty Angela Yu March 11, 2010.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.

Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.

Basics of Multi-armed Bandit Problems

Tradeoffs Between Fairness and Accuracy in Machine Learning

Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.

Multi-armed Bandit Problems with Dependent Arms

Tuning bandit algorithms in stochastic environments

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Chapter 2: Evaluative Feedback

The loss function, the normal equation,

Active Learning with Unbalanced Classes & Example-Generation Queries

Mathematical Foundations of BME Reza Shadmehr

Chapter 2: Evaluative Feedback

Reinforcement Learning (2)

Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 17: The Multi-Armed Bandit Problem 1Lecture 17: The Multi-Armed Bandit Problem

Announcements Lecture Tuesday will be Course Review Final should only take a 4-5 hours to do – We give you 48 hours for your flexibility Homework 2 is graded – We graded pretty leniently – Approximate Grade Breakdown: – 64: A 61: A- 58: B+ 53: B 50: B- 47: C+ 42: C 39: C- Homework 3 will be graded soon Lecture 17: The Multi-Armed Bandit Problem2

Today The Multi-Armed Bandits Problem – And extensions Advanced topics course on this next year Lecture 17: The Multi-Armed Bandit Problem3

Recap: Supervised Learning Training Data: Model Class: Loss Function: Learning Objective: E.g., Linear Models E.g., Squared Loss Optimization Problem 4Lecture 17: The Multi-Armed Bandit Problem

But Labels are Expensive! 5 Image Source: x “Crystal” “Needle” “Empty” … “Sports” “World News” “Science” y Cheap and Abundant! Expensive and Scarce! Human Annotation Running Experiment Lecture 17: The Multi-Armed Bandit Problem

Solution? Let’s grab some labels! – Label images – Annotate webpages – Rate movies – Run Experiments – Etc… How should we choose? Lecture 17: The Multi-Armed Bandit Problem6

Interactive Machine Learning Start with unlabeled data: Loop: – select x i – receive feedback/label y i How to measure cost? How to define goal? Lecture 17: The Multi-Armed Bandit Problem7

Crowdsourcing Lecture 17: The Multi-Armed Bandit Problem8 “Mushroom” Unlabeled Labeled Initially Empty Labeled Initially Empty Repeat

Aside: Active Learning Lecture 17: The Multi-Armed Bandit Problem9 “Mushroom” Unlabeled Labeled Initially Empty Labeled Initially Empty Goal: Maximize Accuracy with Minimal Cost Repeat Choose

Passive Learning Lecture 17: The Multi-Armed Bandit Problem10 “Mushroom” Unlabeled Labeled Initially Empty Labeled Initially Empty Repeat Random

Comparison with Passive Learning Conventional Supervised Learning is considered “Passive” Learning Unlabeled training set sampled according to test distribution So we label it at random – Very Expensive! Lecture 17: The Multi-Armed Bandit Problem11

Aside: Active Learning Cost: uniform – E.g., each label costs $0.10 Goal: maximize accuracy of trained model Control distribution of labeled training data Lecture 17: The Multi-Armed Bandit Problem12

Problems with Crowdsourcing Assumes you can label by proxy – E.g., have someone else label objects in images But sometimes you can’t! – Personalized recommender systems Need to ask the user whether content is interesting – Personalized medicine Need to try treatment on patient – Requires actual target domain Lecture 17: The Multi-Armed Bandit Problem13

Personalized Labels Lecture 17: The Multi-Armed Bandit Problem14 Sports Unlabeled Labeled Initially Empty Labeled Initially Empty Choose Repeat What is Cost? Real System End User

The Multi-Armed Bandit Problem Lecture 17: The Multi-Armed Bandit Problem15

Formal Definition K actions/classes Each action has an average reward: μ k – Unknown to us – Assume WLOG that u 1 is largest For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ a(t) Goal: minimize Tu 1 – (μ a(1) + μ a(2) + … + μ a(T) ) Lecture 17: The Multi-Armed Bandit Problem16 Basic Setting K classes No features Algorithm Simultaneously Predicts & Receives Labels If we had perfect information to start Expected Reward of Algorithm

Sports # Shown Average Likes : 0 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem17

# Shown Average Likes : 0 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem18 Sports

# Shown Average Likes : 0 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem19 Politics

# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem20 Politics

# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem21 World

# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem22 World

# Shown Average Likes : 1 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem23 Economy

# Shown Average Likes : 2 Interactive Personalization (5 Classes, No features) Lecture 17: The Multi-Armed Bandit Problem24 Economy …

# Shown Average Likes : 24 What should Algorithm Recommend? Lecture 17: The Multi-Armed Bandit Problem25 Exploit:Explore: Best: Politics Economy Celebrity How to Optimally Balance Explore/Exploit Tradeoff? Characterized by the Multi-Armed Bandit Problem

( ) Opportunity cost of not knowing preferences “no-regret” if R(T)/T  0 – Efficiency measured by convergence rate Regret: Time Horizon ( ) … …

Recap: The Multi-Armed Bandit Problem K actions/classes Each action has an average reward: μ k – All unknown to us – Assume WLOG that u 1 is largest For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ a(t) Goal: minimize Tu 1 – (μ a(1) + μ a(2) + … + μ a(T) ) Lecture 17: The Multi-Armed Bandit Problem27 Basic Setting K classes No features Algorithm Simultaneously Predicts & Receives Labels Regret

The Motivating Problem Slot Machine = One-Armed Bandit Goal: Minimize regret From pulling suboptimal arms Lecture 17: The Multi-Armed Bandit Problem28 Each Arm Has Different Payoff

Implications of Regret If R(T) grows linearly w.r.t. T: – Then R(T)/T  constant > 0 – I.e., we converge to predicting something suboptimal If R(T) is sub-linear w.r.t. T: – Then R(T)/T  0 – I.e., we converge to predicting the optimal action Lecture 17: The Multi-Armed Bandit Problem29 Regret:

Experimental Design How to split trials to collect information Static Experimental Design – Standard practice – (pre-planned) Lecture 17: The Multi-Armed Bandit Problem30 Treatment PlaceboTreatmentPlaceboTreatment …

Sequential Experimental Design Adapt experiments based on outcomes Lecture 17: The Multi-Armed Bandit Problem31 Treatment PlaceboTreatment …

Sequential Experimental Design Matters Lecture 17: The Multi-Armed Bandit Problem32

Sequential Experimental Design MAB models sequential experimental design! Each treatment has hidden expected value – Need to run trials to gather information – “Exploration” In hindsight, should always have used treatment with highest expected value Regret = opportunity cost of exploration Lecture 17: The Multi-Armed Bandit Problem33 basic

Online Advertising Lecture 17: The Multi-Armed Bandit Problem34 Largest Use-Case of Multi-Armed Bandit Problems

The UCB1 Algorithm Lecture 17: The Multi-Armed Bandit Problem35

# Shown Average Likes Confidence Intervals Lecture 17: The Multi-Armed Bandit Problem36 ** Maintain Confidence Interval for Each Action – Often derived using Chernoff-Hoeffding bounds (**) = [0.1, 0.3]= [0.25, 0.55] Undefined

UCB1 Confidence Interval Lecture 17: The Multi-Armed Bandit Problem37 Expected Reward Estimated from data Total Iterations so far (70 in example below) # Shown Average Likes #times action k was chosen

The UCB1 Algorithm At each iteration – Play arm with highest Upper Confidence Bound: Lecture 17: The Multi-Armed Bandit Problem # Shown Average Likes

Balancing Explore/Exploit “Optimism in the Face of Uncertainty” Lecture 17: The Multi-Armed Bandit Problem39 Exploitation Term Exploration Term # Shown Average Likes

Analysis (Intuition) Lecture 17: The Multi-Armed Bandit Problem40 With high probability (**): ** Proof of Theorem 1 in Value of Best Arm Upper Confidence Bound of Best Arm The true value is greater than the lower confidence bound. Bound on regret at time t+1

Lecture 17: The Multi-Armed Bandit Problem Iterations2000 Iterations 5000 Iterations Iterations

How Often Sub-Optimal Arms Get Played An arm never gets selected if: The number of times selected: – Prove using Hoeffding’s Inequality Lecture 17: The Multi-Armed Bandit Problem42 Bound grows slowly with time Shrinks quickly with #trials Theorem 1 in

Regret Guarantee With high probability: – UCB1 accumulates regret at most: Lecture 17: The Multi-Armed Bandit Problem43 Theorem 1 in #Actions Gap between best & 2 nd best ε = μ 1 – μ 2 Time Horizon

Recap: MAB & UCB1 Interactive setting – Receives reward/label while making prediction Must balance explore/exploit Sub-linear regret is good – Average regret converges to 0 Lecture 17: The Multi-Armed Bandit Problem44

Extensions Contextual Bandits – Features of environment Dependent-Arms Bandits – Features of actions/classes Dueling Bandits Combinatorial Bandits General Reinforcement Learning Lecture 17: The Multi-Armed Bandit Problem45

Contextual Bandits K actions/classes Rewards depends on context x: μ(x) For t = 1…T – Algorithm receives context x t – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ(x t ) Goal: Minimize Regret Lecture 17: The Multi-Armed Bandit Problem46 K classes Best class depends on features Algorithm Simultaneously Predicts & Receives Labels Bandit multiclass prediction

Linear Bandits K actions/classes – Each action has features x k – Reward function: μ(x) = w T x For t = 1…T – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ a(t) Goal: regret scaling independent of K Lecture 17: The Multi-Armed Bandit Problem47 K classes Linear dependence Between Arms Algorithm Simultaneously Predicts & Receives Labels Labels can share information to other actions

Example Treatment of spinal cord injury patients – Studied by Joel Burdick’s Multi-armed bandit problem: – Thousands of arms Lecture 17: The Multi-Armed Bandit Problem Images from Yanan Sui UCB1 Regret Bound: Want regret bound that scales independently of #arms E.g., linearly in dimensionality of features x describing arms

Dueling Bandits K actions/classes – Preference model P(a k > a k’ ) For t = 1…T – Algorithm chooses actions a(t) & b(t) – Receives random reward y(t) Expectation P(a(t) > b(t)) Goal: low regret despite only pairwise feedback Lecture 17: The Multi-Armed Bandit Problem49 K classes Can only measure pairwise preferences Algorithm Simultaneously Predicts & Receives Labels Only pairwise rewards

Example in Sensory Testing (Hypothetical) taste experiment: vs – Natural usage context Experiment 1: Absolute Metrics 3 cans 2 cans1 can 5 cans 3 cans Total: 8 cansTotal: 9 cans Very Thirsty!

Example in Sensory Testing (Hypothetical) taste experiment: vs – Natural usage context Experiment 1: Relative Metrics All 6 prefer Pepsi

Example Revisited Treatment of spinal cord injury patients – Studied by Joel Burdick’s Dueling Bandits Problem! Lecture 17: The Multi-Armed Bandit Problem Images from Yanan Sui Patients cannot reliably rate individual treatments Patients can reliably compare pairs of treatments

Combinatorial Bandits Sometimes, actions must be selected from combinatorial action space: – E.g., shortest path problems with unknown costs on edges aka: Routing under uncertainty If you knew all the parameters of model: – standard optimization problem Lecture 17: The Multi-Armed Bandit Problem

General Reinforcement Learning Bandit setting assumes actions do not affect the world – E.g., sequence of experiments does not affect the distribution of future trials Lecture 17: The Multi-Armed Bandit Problem54 Treatment PlaceboTreatmentPlaceboTreatment …

Markov Decision Process M states K actions Reward: μ(s,a) – Depends on state For t = 1…T – Algorithm (approximately) observes current state s t Depends on previous state & action taken – Algorithm chooses action a(t) – Receives random reward y(t) Expectation μ(s t,a(t)) Lecture 17: The Multi-Armed Bandit Problem55 ** Example: Personalized Tutoring [Emma Brunskill et al.] (**)

Summary Interactive Machine Learning – Multi-armed Bandit Problem – Basic result: UCB1 – Surveyed Extensions Advanced Topics in ML course next year Next lecture: course review – Bring your questions! Lecture 17: The Multi-Armed Bandit Problem56