COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Background Reinforcement Learning (RL) agents learn to do tasks by iteratively performing actions in the world and using resulting experiences to decide.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
Lecture 5: Learning models using EM
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,
A Decentralised Coordination Algorithm for Mobile Sensors School of Electronics and Computer Science University of Southampton {rs06r2, fmdf08r, acr,
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Decentralised Coordination of Mobile Sensors School of Electronics and Computer Science University of Southampton Ruben Stranders,
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Upper Confidence Trees for Game AI Chahine Koleejan.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.
Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Competitive Scheduling in Wireless Networks with Correlated Channel State Ozan.
COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.
Simulation in Healthcare Ozcan: Chapter 15 ISE 491 Fall 2009 Dr. Burtner.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
 We have also applied VPI in a disaster management setting:  We investigate overlapping coalition formation models. Sequential Decision Making in Repeated.
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Revision.
Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.
COMP24111 Machine Learning K-means Clustering Ke Chen.
By: Kenny Raharjo 1. Agenda Problem scope and goals Game development trend Multi-armed bandit (MAB) introduction Integrating MAB into game development.
Basics of Multi-armed Bandit Problems
Figure 5: Change in Blackjack Posterior Distributions over Time.
Done Done Course Overview What is AI? What are the Major Challenges?
Reinforcement Learning (1)
Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo
Multi-Agent Exploration
Feedback-Aware Social Event-Participant Arrangement
Announcements Homework 3 due today (grace period through Friday)
CSE-490DF Robotics Capstone
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 416 Artificial Intelligence
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits

Decision making Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour

Sequential decision making Environment Perception Behaviour/Action Decision making Repeatedly making decisions 1 decision per round Uncertainty: outcome is not known in advance and noisy

The story of unlucky bandits

…who decided to outsource the robberies The GoodThe Bad The Ugly (and lazy) But what we don’t know: Expected rewards (also unknown):

The Good, the Bad, and the Ugly Objective: maximise total expected rewards via repeated robbing At each round, we can only hire 1 guy The reward value per round of each bandit is noisy (i.e., is a random variable) – we don’t know the expected values Question: who we should hire at each round?

How to estimate the expected rewards? Exploration: … Average: … … 29.5

Time limits There is a threshold on the number of rounds Finds the best guy as soon as possible Exploitation: Always choose the one with the highest current average

Exploration versus exploitation Too much focus on exploration: Pro: can accurately estimate each guy’s expected performance Con: will take too much time Too much focus on exploitation: Pro: focus on maximising the expected total reward Con: might miss the chance to identify the true best guy

Exploration versus exploitation Key challenge: how to efficiently balance between exploration and exploitation The dilemma of exploration vs. exploitation Next: introduction of the multi-armed bandit model

Reward value is drawn from unknown distribution One-armed bandit

The multi-armed bandit (MAB) model There are multiple arms At each time step (round): We choose 1 arm to pull Receive a reward, drawn from an unknown distribution of that arm Objective: maximise the expected total reward Exploration: we want to learn each arm’s expected reward value Exploitation: we want to maximise the sum of the rewards MAB is the simplest model that captures the dilemma of exploration vs. exploitation

MAB: a sequential decision making framework Environment Perception Behaviour/Action Decision making Arm = option Pulling an arm = choosing that option

How to solve MAB problems? We don’t have knowledge about the expected reward values at the beginning. But we can learn these values through exploration. However, this indicates that we have to pull arms that are not optimal (to learn that they are not optimal). We cannot achieve the optimal solution, which would be to pull the optimal arm (the one with the highest expected reward value) all the time. Our goal: design algorithms that are close to the optimum as much as possible (= good approximation)

The Epsilon-first approach Suppose we know the number of rounds beforehand We can pull the arms T times Epsilon-first: Choose an epsilon value: 0 < epsilon < 1 (typically between 0.05 and 0.2) In the first epsilon*T rounds, we only do exploration: We pull all the arms in a round robin manner After the first epsilon*T rounds -> we choose the arm with the highest average reward value We only pull this arm for the rest of (1-epsilon)*T rounds

The Epsilon-greedy approach Epsilon-greedy: Choose an epsilon value: 0 < epsilon < 1 (typically between 0.05 and 0.2) At each round, we choose to pull the arm with the current best average reward value with probability (1-epsilon) Or we pull any arbitrary arm (other than the current best), with probability epsilon We repeat this for each round

Epsilon-first vs. epsilon-greedy Epsilon-first Epsilon-greedy Explicitly separates exploration from exploitation Exploration and exploitation are in an interleaving manner Observation: epsilon-first is typically very good when T is small Observation: epsilon-greedy is typically efficient when T is sufficiently large Drawback: we need to know T in advance Drawback: slow convergence at the beginning (especially with small epsilon) Drawback 2: sensitive to the value of epsilon

Other algorithms UCB (for upper confidence bound): combines exploration and exploitation within each single round in a very clever way More advanced approaches Thompson-sampling: maintain a belief distribution about the true expected reward of each arm, using Bayes’ Theorem. Randomly sample from each of these believes -> choose the arm with highest sample. We repeat this at each round … and many others

Application 1: Disaster response ORCHID ( Prof Nick Jennings, Dr. Sarvapali (Gopal) Ramchurn

Application 1: Disaster response Decentralised coordination of UAVs Australian Centre for Field Robotics Build UAVs Implement software University of Southampton Develop algorithms for decentralised coordination (max-sum) No uncertainty included

Noisy observations, unknown outcomes Application 1: Disaster response Multi-armed bandit model: each area of observation = arm Information value of observing an area = reward Bandit algorithm + some decentralised task allocation protocol = things work fine

Application 2: Crowdsourcing systems Idea: combine machine intelligence with mass of human intelligence reCAPTCHA (Louis von Ahn) Oxford, Harvard, Microsoft, etc.

Employer assigns translation tasks to candidate workers Pays (worker specific) cost (e.g., $5/hr) Receives utility value from each user (e.g., hr/doc) Objective: maximise the total translated docs Set of translation tasks Limited by budget B = $500 Cheap but low utility vs. high utility but expensive workers Expert crowdsourcing systems: Application 2: Crowdsourcing systems

Markov Decision Processes are one of the most wide used frameworks to formulate probabilistic planning problems. Find: wide used Fix: widely used, wide-used Verify:YES NO Crowdsourced text corrector: Application 2: Crowdsourcing systems

Common problem: task allocation in crowdsourcing systems Tackle this problem with a multi-armed bandit model: Task assignment (to whom, which task) = pulling an arm Quality of the outcome of the assignment = reward value

Some extensions of the multi-armed bandit Best-arm bandit Dueling bandits Budget-limited bandits

Best-arm bandit We aim to identify the best arm Pure learning: only exploration, no exploitation Some nice algorithms with good theoretical and practical performance (see Sebastien Bubeck) Applications: E.g., in optimisation: a new search technique, where the underlying structure of the value surface is not known or very difficult

Dueling bandits We choose 2 arms at each round We only see which one is better (but not their reward values Some nice algorithms (Microsoft Research, Yisong Yue, etc…) Application: internet search ranking Application 2: A/B testing

Budget-limited bandits We have to pay a cost to pull an arm We have a total budget limit Some nice algorithms (L. Tran-Thanh, Microsoft Research, A. Badanidiyuru, etc…) Battery Time Financial budget

Summary: multi-armed bandits Very simple model for sequential decision making Main challenge: exploration vs. exploitation Epsilon-first, epsilon-greedy Many application domains: crowdsourcing, coordination, optimisation, internet search