Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basics of Multi-armed Bandit Problems

Similar presentations


Presentation on theme: "Basics of Multi-armed Bandit Problems"— Presentation transcript:

1 Basics of Multi-armed Bandit Problems
Zhu Han Department of Electrical and Computer Engineering University of Houston, TX, USA Sep. 2016

2 Overview Introduction Basic Classification Bounds Algorithms Variants
One Example

3 A slot machine with K arms available
At each step of a repeated game, the forecaster pulls an arm It and gets some bounded reward Xt associated with it. He only observes the reward Xt corresponding to the arm he chose; he does not observe the reward he would have got had chosen a different arm. Thus, at round t > 2, he can only base his decision on the past observations His aim is to maximize the sum of the obtained rewards (in expectation or with high probability).

4 Regret after n plays I1,…In
First term is only if you have played that Second term is your play Expected Regret Pseudo Regret Pseudo ≤ Expected

5 Classification 1: Stochastic Bandit Problem

6 Classification 2: Adversarial Bandit Problem
Randomization: Shuffle cards every round

7 Classification 3: Markovian Bandit Problem

8 Tail bounds Chernoff – Hoeffing bound: Bernstein inequality

9 Overview Introduction Basic Classification Bounds Algorithms Variants
One Example

10 Stochastic Bandits Some Definition
Moment Condition for convex function ψ Legendre-Fenchel transform (α,ψ) UCB Algorithm Exploitation Exploration

11 Stochastic Bandits Upper bound and lower bound

12 Stochastic Bandits Variants Second order bounds (KL-UCB algorithm)
Distribution free bounds (MOSS, improved UCB) High probability bound ε –greedy: First, pick a parameter 0 < ε < 1. Then, at each step greedily play the arm with highest empirical mean reward with probability 1−ε, and play a random arm with probability ε Thompson sampling Heavy-tailed distribution

13 Adversarial Bandits Algorithm

14 Adversarial Bandits upper/lower Bounds

15 Adversarial Bandits Variants High probability bound
Log-free upper bound Adaptive bound Alterative feedback structure

16 Overview Introduction Basic Classification Bounds Algorithms Variants
One Example

17 Contextual Bandits Scenarios Side information (context) Model
In personalized news article recommendation the task is to select, from a pool of candidates, a news article to display whenever a new user visits a website. The articles correspond to arms, and a reward is obtained whenever the user clicks on the selected article. Side information (context) For the user this may include historical activities, demographic information, and geolocation; For the articles, we may have content information and categories. Model The world announces some context information x (think of this as a high dimensional bit vector if that helps). A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on). Expert case

18 Linear Bandits Loss at each step is some function defined on K, and task is to pick an arm as close as possible to minimum of loss function at hand. Online linear optimization Forecaster chooses xt ∈ K, Simultaneously, the adversary chooses ℓt from some fixed and known subset L. John’s Theorem, Exp2 (Expanded Exp) algorithm Online Mirror Descent (OMD) algorithm

19 Nonlinear bandits Losses are nonlinear functions of the arms Algorithm
Two-point vs. one-point bandit feedbacks Stochastic bandits

20 Other Variants Markov Decision Process, restless and sleeping bandits
Pure exploration problems Dueling bandits Discovery with probabilistic expert advice Many-armed bandits Truthful bandits

21 Overview Introduction Basic Classification Bounds Algorithms Variants
One Example

22 Max Effort Cover (MEC) Sequential Learning for Passive Monitoring of Multichannel Wireless Networks, Thanh Le, MS 2013 Objective: find the best set of assignments (sniffer to channel) to capture of activity of users with highest probability

23 Greedy algorithm Problem Optimal Greedy

24 Multi-agent idea Correlation exploiting algorithms:
Advantage: highly correct information about the channel. Drawback: computation complexity

25 Domino effect: Reward seen by agents

26 Domino effect: Reward seen by agents

27 Domino effect: Reward seen by agents

28 ε– Greedy-Agent-approx

29 Simulation results Small Regret

30 Conclusion Multiarm bandit problem
A gambler at a row of slot machine has to decide which machines to play, how many times to play each machine and in which order to play them. Three categories: stochastic, adversarial, and Markovian Many algorithms with regret bound Many variant for variety of applications Cognitive energy harvesting? Powercast and USRP testbed Which category and which algorithm What is uniqueness of such a scenario? How to set up the reasonable test

31 Reference Gilles Stoltz. "Introduction to stochastic and adversarial multi-armed bandit problems, with some recent results of the French guys”, slides. S´ebastien Bubeck and Nicol`o Cesa-Bianchi, ``Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems”, Foundations and Trends® in Machine Learning: Vol. 5: No. 1, pp 1-122, Dec. 2012 Nicolo Cesa-Bianchi and Gabor Lugosi, ``Prediction, Learning, and Games”, Cambridge University Press, 2006 Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning Algorithms for Optimal Monitoring in Multi-channel Wireless Networks," IEEE Transactions on Wireless Communications, vol. 13, no. 2, p.p , February 2014. Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning for Passive Monitoring of Multi-channel Wireless Networks," The 32nd IEEE International Conference on Computer Communications, INFOCOM, Turin, Italy, April 2013.

32 Thank you


Download ppt "Basics of Multi-armed Bandit Problems"

Similar presentations


Ads by Google