Basics of Multi-armed Bandit Problems

Slides:



Advertisements
Similar presentations
Blind online optimization Gradient descent without a gradient Abie Flaxman CMU Adam Tauman Kalai TTI Brendan McMahan CMU.
Advertisements

The K-armed Dueling Bandits Problem
IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.
Efficient Sequential Decision-Making in Structured Problems Adam Tauman Kalai Georgia Institute of Technology Weizmann Institute Toyota Technological Institute.
Multi Armed Bandits
Network Utility Maximization over Partially Observable Markov Channels 1 1 Channel State 1 = ? Channel State 2 = ? Channel State 3 = ? Restless.
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.
Taming the monster: A fast and simple algorithm for contextual bandits
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.
DBLA: D ISTRIBUTED B LOCK L EARNING A LGORITHM F OR C HANNEL S ELECTION I N C OGNITIVE R ADIO N ETWORKS Chowdhury Sayeed Hyder Department of Computer Science.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.
Decision Theoretic Planning
DECISION MARKETS WITH GOOD INCENTIVES Yiling Chen (Harvard), Ian Kash (Harvard), Internet and Network Economics, 2011.
DYNAMIC POWER ALLOCATION AND ROUTING FOR TIME-VARYING WIRELESS NETWORKS Michael J. Neely, Eytan Modiano and Charles E.Rohrs Presented by Ruogu Li Department.
Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.
Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.
Jointly Optimal Transmission and Probing Strategies for Multichannel Systems Saswati Sarkar University of Pennsylvania Joint work with Sudipto Guha (Upenn)
*Sponsored in part by the DARPA IT-MANET Program, NSF OCE Opportunistic Scheduling with Reliability Guarantees in Cognitive Radio Networks Rahul.
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
Multi-armed Bandit Problems with Dependent Arms
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Using Value of Information to Learn and Classify under Hard Budgets Russell Greiner, Daniel Lizotte, Aloak Kapoor, Omid Madani Dept of Computing Science,
Handling Advertisements of Unknown Quality in Search Advertising Sandeep Pandey Christopher Olston (CMU and Yahoo! Research)
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.
online convex optimization (with partial information)
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
Upper Confidence Trees for Game AI Chahine Koleejan.
Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.
Universit at Dortmund, LS VIII
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Regret Minimizing Equilibria of Games with Strict Type Uncertainty Stony Brook Conference on Game Theory Nathanaël Hyafil and Craig Boutilier Department.
Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Department of Electrical and Computer Engineering Sequential Learning for Passive Monitoring of Multichannel Wireless Networks Department of Electrical.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.
Spectrum Sensing In Cognitive Radio Networks
Resource Allocation in Hospital Networks Based on Green Cognitive Radios 王冉茵
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Department of Computer Science Undergraduate Events More
Computacion Inteligente Least-Square Methods for System Identification.
Slide 1 Toward Optimal Sniffer-Channel Assignment for Reliable Monitoring in Multi-Channel Wireless Networks Donghoon Shin, Saurabh Bagchi and Chih-Chun.
Distributed Learning for Multi-Channel Selection in Wireless Network Monitoring — Yuan Xue, Pan Zhou, Tao Jiang, Shiwen Mao and Xiaolei Huang.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.
By: Kenny Raharjo 1. Agenda Problem scope and goals Game development trend Multi-armed bandit (MAB) introduction Integrating MAB into game development.
CS 9633 Machine Learning Support Vector Machines
Figure 5: Change in Blackjack Posterior Distributions over Time.
Game Theory Just last week:
Zhu Han University of Houston Thanks for Professor Dan Wang’s slides
Computing and Compressive Sensing in Wireless Sensor Networks
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Quality-aware Aggregation & Predictive Analytics at the Edge
Feedback-Aware Social Event-Participant Arrangement
Adaptive Ambulance Redeployment via Multi-armed Bandits
Multi-armed Bandit Problems with Dependent Arms
Presenter: Xudong Zhu Authors: Xudong Zhu, etc.
Tuning bandit algorithms in stochastic environments
RL methods in practice Alekh Agarwal.
The
The Nonstochastic Multiarmed Bandit Problem
Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 10 Stochastic Game Zhu Han, Dusit Niyato, Walid Saad, and.
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
Chapter 2: Evaluative Feedback
at University of Texas at Dallas
Presentation transcript:

Basics of Multi-armed Bandit Problems Zhu Han Department of Electrical and Computer Engineering University of Houston, TX, USA Sep. 2016

Overview Introduction Basic Classification Bounds Algorithms Variants One Example

A slot machine with K arms available At each step of a repeated game, the forecaster pulls an arm It and gets some bounded reward Xt associated with it. He only observes the reward Xt corresponding to the arm he chose; he does not observe the reward he would have got had chosen a different arm. Thus, at round t > 2, he can only base his decision on the past observations His aim is to maximize the sum of the obtained rewards (in expectation or with high probability).

Regret after n plays I1,…In First term is only if you have played that Second term is your play Expected Regret Pseudo Regret Pseudo ≤ Expected

Classification 1: Stochastic Bandit Problem

Classification 2: Adversarial Bandit Problem Randomization: Shuffle cards every round

Classification 3: Markovian Bandit Problem

Tail bounds Chernoff – Hoeffing bound: Bernstein inequality

Overview Introduction Basic Classification Bounds Algorithms Variants One Example

Stochastic Bandits Some Definition Moment Condition for convex function ψ Legendre-Fenchel transform (α,ψ) UCB Algorithm Exploitation Exploration

Stochastic Bandits Upper bound and lower bound

Stochastic Bandits Variants Second order bounds (KL-UCB algorithm) Distribution free bounds (MOSS, improved UCB) High probability bound ε –greedy: First, pick a parameter 0 < ε < 1. Then, at each step greedily play the arm with highest empirical mean reward with probability 1−ε, and play a random arm with probability ε Thompson sampling Heavy-tailed distribution

Adversarial Bandits Algorithm

Adversarial Bandits upper/lower Bounds

Adversarial Bandits Variants High probability bound Log-free upper bound Adaptive bound Alterative feedback structure

Overview Introduction Basic Classification Bounds Algorithms Variants One Example

Contextual Bandits Scenarios Side information (context) Model In personalized news article recommendation the task is to select, from a pool of candidates, a news article to display whenever a new user visits a website. The articles correspond to arms, and a reward is obtained whenever the user clicks on the selected article. Side information (context) For the user this may include historical activities, demographic information, and geolocation; For the articles, we may have content information and categories. Model The world announces some context information x (think of this as a high dimensional bit vector if that helps). A policy chooses arm a from 1 of k arms (i.e. 1 of k ads). The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on). Expert case

Linear Bandits Loss at each step is some function defined on K, and task is to pick an arm as close as possible to minimum of loss function at hand. Online linear optimization Forecaster chooses xt ∈ K, Simultaneously, the adversary chooses ℓt from some fixed and known subset L. John’s Theorem, Exp2 (Expanded Exp) algorithm Online Mirror Descent (OMD) algorithm

Nonlinear bandits Losses are nonlinear functions of the arms Algorithm Two-point vs. one-point bandit feedbacks Stochastic bandits

Other Variants Markov Decision Process, restless and sleeping bandits Pure exploration problems Dueling bandits Discovery with probabilistic expert advice Many-armed bandits Truthful bandits

Overview Introduction Basic Classification Bounds Algorithms Variants One Example

Max Effort Cover (MEC) Sequential Learning for Passive Monitoring of Multichannel Wireless Networks, Thanh Le, MS 2013 Objective: find the best set of assignments (sniffer to channel) to capture of activity of users with highest probability

Greedy algorithm Problem Optimal Greedy

Multi-agent idea Correlation exploiting algorithms: Advantage: highly correct information about the channel. Drawback: computation complexity .

Domino effect: Reward seen by agents

Domino effect: Reward seen by agents

Domino effect: Reward seen by agents

ε– Greedy-Agent-approx

Simulation results Small Regret

Conclusion Multiarm bandit problem A gambler at a row of slot machine has to decide which machines to play, how many times to play each machine and in which order to play them. Three categories: stochastic, adversarial, and Markovian Many algorithms with regret bound Many variant for variety of applications Cognitive energy harvesting? Powercast and USRP testbed Which category and which algorithm What is uniqueness of such a scenario? How to set up the reasonable test

Reference Gilles Stoltz. "Introduction to stochastic and adversarial multi-armed bandit problems, with some recent results of the French guys”, slides. S´ebastien Bubeck and Nicol`o Cesa-Bianchi, ``Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems”, Foundations and Trends® in Machine Learning: Vol. 5: No. 1, pp 1-122, Dec. 2012 Nicolo Cesa-Bianchi and Gabor Lugosi, ``Prediction, Learning, and Games”, Cambridge University Press, 2006 Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning Algorithms for Optimal Monitoring in Multi-channel Wireless Networks," IEEE Transactions on Wireless Communications, vol. 13, no. 2, p.p. 1023-1033, February 2014. Rong Zheng, Thanh Le, and Zhu Han, “Approximate Online Learning for Passive Monitoring of Multi-channel Wireless Networks," The 32nd IEEE International Conference on Computer Communications, INFOCOM, Turin, Italy, April 2013.

Thank you