Basics of Multi-armed Bandit Problems

Name: Basics of Multi-armed Bandit Problems
Uploaded: 2017-12-24T11:14:11+00:00
Duration: PTM8S27
Channel: 些司 柳
Description: Basics of Multi-armed Bandit Problems

Basics of Multi-armed Bandit Problems
Zhu Han Department of Electrical and Computer Engineering University of Houston, TX, USA Sep. 2016

Overview Introduction Basic Classification Bounds Algorithms Variants
One Example

A slot machine with K arms available
At each step of a repeated game, the forecaster pulls an arm It and gets some bounded reward Xt associated with it. He only observes the reward Xt corresponding to the arm he chose; he does not observe the reward he would have got had chosen a different arm. Thus, at round t > 2, he can only base his decision on the past observations His aim is to maximize the sum of the obtained rewards (in expectation or with high probability).

Regret after n plays I1,…In
First term is only if you have played that Second term is your play Expected Regret Pseudo Regret Pseudo ≤ Expected

Classification 1: Stochastic Bandit Problem

Classification 2: Adversarial Bandit Problem
Randomization: Shuffle cards every round

Classification 3: Markovian Bandit Problem

Tail bounds Chernoff – Hoeffing bound: Bernstein inequality

One Example

Stochastic Bandits Some Definition
Moment Condition for convex function ψ Legendre-Fenchel transform (α,ψ) UCB Algorithm Exploitation Exploration

Stochastic Bandits Upper bound and lower bound

Stochastic Bandits Variants Second order bounds (KL-UCB algorithm)
Distribution free bounds (MOSS, improved UCB) High probability bound ε –greedy: First, pick a parameter 0 < ε < 1. Then, at each step greedily play the arm with highest empirical mean reward with probability 1−ε, and play a random arm with probability ε Thompson sampling Heavy-tailed distribution

Adversarial Bandits Algorithm

Adversarial Bandits upper/lower Bounds

Adversarial Bandits Variants High probability bound
Log-free upper bound Adaptive bound Alterative feedback structure

One Example

Contextual Bandits Scenarios Side information (context) Model
In personalized news article recommendation the task is to select, from a pool of candidates, a news article to display whenever a new user visits a website. The articles correspond to arms, and a reward is obtained whenever the user clicks on the selected article. Side information (context) For the user this may include historical activities, demographic information, and geolocation; For the articles, we may have content information and categories. Model The world announces some context information x (think of this as a high dimensional bit vector if that helps). A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on). Expert case

Linear Bandits Loss at each step is some function defined on K, and task is to pick an arm as close as possible to minimum of loss function at hand. Online linear optimization Forecaster chooses xt ∈ K, Simultaneously, the adversary chooses ℓt from some fixed and known subset L. John’s Theorem, Exp2 (Expanded Exp) algorithm Online Mirror Descent (OMD) algorithm

Nonlinear bandits Losses are nonlinear functions of the arms Algorithm
Two-point vs. one-point bandit feedbacks Stochastic bandits

Other Variants Markov Decision Process, restless and sleeping bandits
Pure exploration problems Dueling bandits Discovery with probabilistic expert advice Many-armed bandits Truthful bandits

One Example

Max Effort Cover (MEC) Sequential Learning for Passive Monitoring of Multichannel Wireless Networks, Thanh Le, MS 2013 Objective: find the best set of assignments (sniffer to channel) to capture of activity of users with highest probability

Greedy algorithm Problem Optimal Greedy

Multi-agent idea Correlation exploiting algorithms:
Advantage: highly correct information about the channel. Drawback: computation complexity

Domino effect: Reward seen by agents

ε– Greedy-Agent-approx

Simulation results Small Regret

Conclusion Multiarm bandit problem
A gambler at a row of slot machine has to decide which machines to play, how many times to play each machine and in which order to play them. Three categories: stochastic, adversarial, and Markovian Many algorithms with regret bound Many variant for variety of applications Cognitive energy harvesting? Powercast and USRP testbed Which category and which algorithm What is uniqueness of such a scenario? How to set up the reasonable test

Reference Gilles Stoltz. "Introduction to stochastic and adversarial multi-armed bandit problems, with some recent results of the French guys”, slides. S´ebastien Bubeck and Nicol`o Cesa-Bianchi, ``Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems”, Foundations and Trends® in Machine Learning: Vol. 5: No. 1, pp 1-122, Dec. 2012 Nicolo Cesa-Bianchi and Gabor Lugosi, ``Prediction, Learning, and Games”, Cambridge University Press, 2006 Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning Algorithms for Optimal Monitoring in Multi-channel Wireless Networks," IEEE Transactions on Wireless Communications, vol. 13, no. 2, p.p , February 2014. Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning for Passive Monitoring of Multi-channel Wireless Networks," The 32nd IEEE International Conference on Computer Communications, INFOCOM, Turin, Italy, April 2013.

Thank you

Basics of Multi-armed Bandit Problems

Similar presentations

Presentation on theme: "Basics of Multi-armed Bandit Problems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basics of Multi-armed Bandit Problems

Similar presentations

Presentation on theme: "Basics of Multi-armed Bandit Problems"— Presentation transcript:

Similar presentations

About project

Feedback