Download presentation
1
Basics of Multi-armed Bandit Problems
Zhu Han Department of Electrical and Computer Engineering University of Houston, TX, USA Sep. 2016
2
Overview Introduction Basic Classification Bounds Algorithms Variants
One Example
3
A slot machine with K arms available
At each step of a repeated game, the forecaster pulls an arm It and gets some bounded reward Xt associated with it. He only observes the reward Xt corresponding to the arm he chose; he does not observe the reward he would have got had chosen a different arm. Thus, at round t > 2, he can only base his decision on the past observations His aim is to maximize the sum of the obtained rewards (in expectation or with high probability).
4
Regret after n plays I1,…In
First term is only if you have played that Second term is your play Expected Regret Pseudo Regret Pseudo ≤ Expected
5
Classification 1: Stochastic Bandit Problem
6
Classification 2: Adversarial Bandit Problem
Randomization: Shuffle cards every round
7
Classification 3: Markovian Bandit Problem
8
Tail bounds Chernoff – Hoeffing bound: Bernstein inequality
9
Overview Introduction Basic Classification Bounds Algorithms Variants
One Example
10
Stochastic Bandits Some Definition
Moment Condition for convex function ψ Legendre-Fenchel transform (α,ψ) UCB Algorithm Exploitation Exploration
11
Stochastic Bandits Upper bound and lower bound
12
Stochastic Bandits Variants Second order bounds (KL-UCB algorithm)
Distribution free bounds (MOSS, improved UCB) High probability bound ε –greedy: First, pick a parameter 0 < ε < 1. Then, at each step greedily play the arm with highest empirical mean reward with probability 1−ε, and play a random arm with probability ε Thompson sampling Heavy-tailed distribution
13
Adversarial Bandits Algorithm
14
Adversarial Bandits upper/lower Bounds
15
Adversarial Bandits Variants High probability bound
Log-free upper bound Adaptive bound Alterative feedback structure
16
Overview Introduction Basic Classification Bounds Algorithms Variants
One Example
17
Contextual Bandits Scenarios Side information (context) Model
In personalized news article recommendation the task is to select, from a pool of candidates, a news article to display whenever a new user visits a website. The articles correspond to arms, and a reward is obtained whenever the user clicks on the selected article. Side information (context) For the user this may include historical activities, demographic information, and geolocation; For the articles, we may have content information and categories. Model The world announces some context information x (think of this as a high dimensional bit vector if that helps). A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on). Expert case
18
Linear Bandits Loss at each step is some function defined on K, and task is to pick an arm as close as possible to minimum of loss function at hand. Online linear optimization Forecaster chooses xt ∈ K, Simultaneously, the adversary chooses ℓt from some fixed and known subset L. John’s Theorem, Exp2 (Expanded Exp) algorithm Online Mirror Descent (OMD) algorithm
19
Nonlinear bandits Losses are nonlinear functions of the arms Algorithm
Two-point vs. one-point bandit feedbacks Stochastic bandits
20
Other Variants Markov Decision Process, restless and sleeping bandits
Pure exploration problems Dueling bandits Discovery with probabilistic expert advice Many-armed bandits Truthful bandits
21
Overview Introduction Basic Classification Bounds Algorithms Variants
One Example
22
Max Effort Cover (MEC) Sequential Learning for Passive Monitoring of Multichannel Wireless Networks, Thanh Le, MS 2013 Objective: find the best set of assignments (sniffer to channel) to capture of activity of users with highest probability
23
Greedy algorithm Problem Optimal Greedy
24
Multi-agent idea Correlation exploiting algorithms:
Advantage: highly correct information about the channel. Drawback: computation complexity
25
Domino effect: Reward seen by agents
26
Domino effect: Reward seen by agents
27
Domino effect: Reward seen by agents
28
ε– Greedy-Agent-approx
29
Simulation results Small Regret
30
Conclusion Multiarm bandit problem
A gambler at a row of slot machine has to decide which machines to play, how many times to play each machine and in which order to play them. Three categories: stochastic, adversarial, and Markovian Many algorithms with regret bound Many variant for variety of applications Cognitive energy harvesting? Powercast and USRP testbed Which category and which algorithm What is uniqueness of such a scenario? How to set up the reasonable test
31
Reference Gilles Stoltz. "Introduction to stochastic and adversarial multi-armed bandit problems, with some recent results of the French guys”, slides. S´ebastien Bubeck and Nicol`o Cesa-Bianchi, ``Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems”, Foundations and Trends® in Machine Learning: Vol. 5: No. 1, pp 1-122, Dec. 2012 Nicolo Cesa-Bianchi and Gabor Lugosi, ``Prediction, Learning, and Games”, Cambridge University Press, 2006 Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning Algorithms for Optimal Monitoring in Multi-channel Wireless Networks," IEEE Transactions on Wireless Communications, vol. 13, no. 2, p.p , February 2014. Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning for Passive Monitoring of Multi-channel Wireless Networks," The 32nd IEEE International Conference on Computer Communications, INFOCOM, Turin, Italy, April 2013.
32
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.