COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits
Decision making Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour
Sequential decision making Environment Perception Behaviour/Action Decision making Repeatedly making decisions 1 decision per round Uncertainty: outcome is not known in advance and noisy
The story of unlucky bandits
…who decided to outsource the robberies The GoodThe Bad The Ugly (and lazy) But what we don’t know: Expected rewards (also unknown):
The Good, the Bad, and the Ugly Objective: maximise total expected rewards via repeated robbing At each round, we can only hire 1 guy The reward value per round of each bandit is noisy (i.e., is a random variable) – we don’t know the expected values Question: who we should hire at each round?
How to estimate the expected rewards? Exploration: … Average: … … 29.5
Time limits There is a threshold on the number of rounds Finds the best guy as soon as possible Exploitation: Always choose the one with the highest current average
Exploration versus exploitation Too much focus on exploration: Pro: can accurately estimate each guy’s expected performance Con: will take too much time Too much focus on exploitation: Pro: focus on maximising the expected total reward Con: might miss the chance to identify the true best guy
Exploration versus exploitation Key challenge: how to efficiently balance between exploration and exploitation The dilemma of exploration vs. exploitation Next: introduction of the multi-armed bandit model
Reward value is drawn from unknown distribution One-armed bandit
The multi-armed bandit (MAB) model There are multiple arms At each time step (round): We choose 1 arm to pull Receive a reward, drawn from an unknown distribution of that arm Objective: maximise the expected total reward Exploration: we want to learn each arm’s expected reward value Exploitation: we want to maximise the sum of the rewards MAB is the simplest model that captures the dilemma of exploration vs. exploitation
MAB: a sequential decision making framework Environment Perception Behaviour/Action Decision making Arm = option Pulling an arm = choosing that option
How to solve MAB problems? We don’t have knowledge about the expected reward values at the beginning. But we can learn these values through exploration. However, this indicates that we have to pull arms that are not optimal (to learn that they are not optimal). We cannot achieve the optimal solution, which would be to pull the optimal arm (the one with the highest expected reward value) all the time. Our goal: design algorithms that are close to the optimum as much as possible (= good approximation)
The Epsilon-first approach Suppose we know the number of rounds beforehand We can pull the arms T times Epsilon-first: Choose an epsilon value: 0 < epsilon < 1 (typically between 0.05 and 0.2) In the first epsilon*T rounds, we only do exploration: We pull all the arms in a round robin manner After the first epsilon*T rounds -> we choose the arm with the highest average reward value We only pull this arm for the rest of (1-epsilon)*T rounds
The Epsilon-greedy approach Epsilon-greedy: Choose an epsilon value: 0 < epsilon < 1 (typically between 0.05 and 0.2) At each round, we choose to pull the arm with the current best average reward value with probability (1-epsilon) Or we pull any arbitrary arm (other than the current best), with probability epsilon We repeat this for each round
Epsilon-first vs. epsilon-greedy Epsilon-first Epsilon-greedy Explicitly separates exploration from exploitation Exploration and exploitation are in an interleaving manner Observation: epsilon-first is typically very good when T is small Observation: epsilon-greedy is typically efficient when T is sufficiently large Drawback: we need to know T in advance Drawback: slow convergence at the beginning (especially with small epsilon) Drawback 2: sensitive to the value of epsilon
Other algorithms UCB (for upper confidence bound): combines exploration and exploitation within each single round in a very clever way More advanced approaches Thompson-sampling: maintain a belief distribution about the true expected reward of each arm, using Bayes’ Theorem. Randomly sample from each of these believes -> choose the arm with highest sample. We repeat this at each round … and many others
Application 1: Disaster response ORCHID ( Prof Nick Jennings, Dr. Sarvapali (Gopal) Ramchurn
Application 1: Disaster response Decentralised coordination of UAVs Australian Centre for Field Robotics Build UAVs Implement software University of Southampton Develop algorithms for decentralised coordination (max-sum) No uncertainty included
Noisy observations, unknown outcomes Application 1: Disaster response Multi-armed bandit model: each area of observation = arm Information value of observing an area = reward Bandit algorithm + some decentralised task allocation protocol = things work fine
Application 2: Crowdsourcing systems Idea: combine machine intelligence with mass of human intelligence reCAPTCHA (Louis von Ahn) Oxford, Harvard, Microsoft, etc.
Employer assigns translation tasks to candidate workers Pays (worker specific) cost (e.g., $5/hr) Receives utility value from each user (e.g., hr/doc) Objective: maximise the total translated docs Set of translation tasks Limited by budget B = $500 Cheap but low utility vs. high utility but expensive workers Expert crowdsourcing systems: Application 2: Crowdsourcing systems
Markov Decision Processes are one of the most wide used frameworks to formulate probabilistic planning problems. Find: wide used Fix: widely used, wide-used Verify:YES NO Crowdsourced text corrector: Application 2: Crowdsourcing systems
Common problem: task allocation in crowdsourcing systems Tackle this problem with a multi-armed bandit model: Task assignment (to whom, which task) = pulling an arm Quality of the outcome of the assignment = reward value
Some extensions of the multi-armed bandit Best-arm bandit Dueling bandits Budget-limited bandits
Best-arm bandit We aim to identify the best arm Pure learning: only exploration, no exploitation Some nice algorithms with good theoretical and practical performance (see Sebastien Bubeck) Applications: E.g., in optimisation: a new search technique, where the underlying structure of the value surface is not known or very difficult
Dueling bandits We choose 2 arms at each round We only see which one is better (but not their reward values Some nice algorithms (Microsoft Research, Yisong Yue, etc…) Application: internet search ranking Application 2: A/B testing
Budget-limited bandits We have to pay a cost to pull an arm We have a total budget limit Some nice algorithms (L. Tran-Thanh, Microsoft Research, A. Badanidiyuru, etc…) Battery Time Financial budget
Summary: multi-armed bandits Very simple model for sequential decision making Main challenge: exploration vs. exploitation Epsilon-first, epsilon-greedy Many application domains: crowdsourcing, coordination, optimisation, internet search