Thompson Sampling for learning in online decision making

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Dialogue Policy Optimisation
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Decision Theoretic Planning
Online Learning in Complex Environments
Planning under Uncertainty
Visual Recognition Tutorial
x – independent variable (input)
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Instructor: Vincent Conitzer
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Demand for Repeated Insurance Contracts with Unknown Loss Probability Emilio Venezian Venezian Associates Chwen-Chi Liu Feng Chia University Chu-Shiu.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Multi-armed Bandit Problems WAIM 2014.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Figure 5: Change in Blackjack Posterior Distributions over Time.
Math 6330: Statistical Consulting Class 11
Deep Feedforward Networks
12. Principles of Parameter Estimation
Making complex decisions
Monte-Carlo Planning:
Ch3: Model Building through Regression
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Machine Learning Basics
Feedback-Aware Social Event-Participant Arrangement
Tuning bandit algorithms in stochastic environments
Hidden Markov Models Part 2: Algorithms
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
10701 / Machine Learning Today: - Cross validation,
Summarizing Data by Statistics
Lecture 4: Econometric Foundations
Instructor: Vincent Conitzer
CS 188: Artificial Intelligence Fall 2008
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
12. Principles of Parameter Estimation
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Thompson Sampling for learning in online decision making Shipra Agrawal IEOR and Data Science Institute Columbia University

Learning from sequential interactions Tradeoff between information and revenue learning and optimization Exploration and exploitation

Managing exploitation-exploitation tradeoff The multi-armed bandit problem (Thompson 1933; Robbins 1952) Multiple rigged slot machines in a casino. Which one to put money on? Try each one out An old crisp and clean formulation of the problem in statistics WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?

Stochastic multi-armed bandit problem Online decisions At every time step 𝑡=1,…, 𝑇, pull one arm out of 𝑁 arms Stochastic feedback For each arm 𝑖, reward is generated i.i.d from a fixed but unknown distribution support [0,1], mean 𝜇 𝑖 Bandit feedback Only the reward of the pulled arm can be observed

Measure of Performance Maximize expected reward in time 𝑇 𝐸[ 𝑡=1 𝑇 𝜇 𝑖 𝑡 ] Minimize expected regret in time 𝑇 Optimal arm is the arm with expected reward 𝜇 ∗ = max 𝑗 𝜇 𝑗 Expected regret for playing arm 𝑖: Δ 𝑖 = 𝜇 ∗ − 𝜇 𝑖 Expected regret in any time 𝑇, 𝑅𝑒𝑔𝑟𝑒𝑡 𝑇 = 𝑖 Δ 𝑖 𝑡 = 𝑖 Δ 𝑖 𝐸[ 𝑘 𝑖 𝑇 ] = 𝑖=1 𝑁 𝜇 𝑖 𝐸[𝑘 𝑖 (𝑇) Any time algorithm: the time horizon 𝑇 is not known

Summary of the talk Introduction to Thompson Sampling as a very general algorithmic methodology for MAB Provably near-optimal adaptations for Linear Contextual Bandits Assortment Selection Reinforcement Learning Focus on algorithmic methodology

References Papers which this talk derives from.. Provably optimal [A., and Goyal COLT 2012, AISTATS 2013, JACM 2017] Handling personalization, contexts, large number of products [A. and Goyal ICML 2013] [Kocák, Valko, Munos, A. AAAI 2014] Assortment selection: consumer choice behavior [A., Avandhanula, Goyal, Zeevi EC 2016, COLT 2017] Reinforcement Learning [A. Jia, NIPS 2017] (Many) Other related research articles [Li et al. 2011, Kaufmann et al. 2012, Russo and Van Roy, 2013, 2014, 2016, Osband and Van Roy 2016,….] A central piece of my investigation into this problem has been analysis of a very old and popular Bayesian heusristic for this problem called Thompson Sampling. Despite being easy to implement and good empirical performance, this heuristic did not have rigorous theoretical undetrstanding. And, our work showed that this simple and popular heuristic achieves the optimal regret bounds possible for every instance of the multi-armed bandit problem Of course when this heuristic was proposed the popular concerns of today – personalization and millions of products were not considered. Several of my work focus on extensions and analysis of this heuristic to incorporate these important aspects. In addition to extensions in the direction I mentioned so far, one exciting direction I am working is in handling exploration-exploitation in reinforcement learning. This is a more complex and general model of computers learning from their own past mistakes, which forms a popular models for learning to play games.

The need for exploration Two arms black and red Random rewards with unknown mean 𝝁 𝟏 =𝟏.𝟏, 𝝁 𝟐 =𝟏 Optimal expected reward in 𝑇 time steps is 1.1×𝑇 Exploit only strategy: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms Initial few trials can mislead into playing red action forever 1.1, 1, 0.2, 1, 1, 1, 1, 1, 1, 1, …… Expected regret in 𝑇 steps is close to 0.1×𝑇 2/4/2019

Exploration-Exploitation tradeoff Exploitation: play the empirical mean reward maximizer Exploration: play less explored actions to ensure empirical estimates converge 2/4/2019

Thompson Sampling [Thompson, 1933] Natural and Efficient heuristic Maintain belief about effectiveness (mean reward) of each arm Observe feedback, update belief of pulled arm i in Bayesian manner Pull arm with posterior probability of being best arm NOT same as choosing the arm that is most likely to be best If a customer likes the product update the belief to be more favorable for the product

Bernoulli rewards, Beta priors Uniform distribution 𝐵𝑒𝑡𝑎(1,1) 𝐵𝑒𝑡𝑎 𝛼,𝛽 prior ⇒ Posterior 𝐵𝑒𝑡𝑎 𝛼+1,𝛽 if you observe 1 𝐵𝑒𝑡𝑎 𝛼,𝛽+1 if you observe 0 Start with 𝐵𝑒𝑡𝑎(1,1) prior belief for every arm In round t, For every arm 𝑖, sample 𝜃 𝑖,𝑡 independently from posterior B𝑒𝑡𝑎 𝑆 𝑖,𝑡 +1, 𝐹 𝑖,𝑡 +1 Play arm 𝑖 𝑡 = max 𝑖 𝜃 𝑖,𝑡 Observe reward and update the Beta posterior for arm 𝑖 𝑡

Thompson Sampling with Gaussian Priors Suppose reward for arm 𝑖 is i.i.d. 𝑁( 𝜇 𝑖 ,1) Starting prior N(0,1) Gaussian Prior, Gaussian likelihood → Gaussian posterior 𝑁( 𝜇 𝑖,𝑡 , 1 𝑛 𝑖,𝑡 +1 ) 𝜇 𝑖,𝑡 is empirical mean of 𝑛 𝑖,𝑡 observations for arm 𝑖 Algorithm Sample 𝜇 𝑖 from posterior 𝑁 𝜇 𝑖,𝑡 , 1 𝑛 𝑖,𝑡 +1 for arm 𝑖 Play arm 𝑖 𝑡 = arg⁡ max 𝑖 ⁡ 𝜇 𝑖 Observe reward, update empirical mean for arm 𝑖 𝑡

Thompson Sampling [Thompson 1933] With more trials posteriors concentrate on the true parameters Mode is MLE: enables exploitation Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables exploration Gives benefit of doubt those less explored OPTIMAL benefit of doubt [A. and Goyal, COLT 2012, AISTATS 2013, JACM 2017] A sample from the posterior is used as an estimate for unknown parameters to make decisions 2/4/2019

Our Results [A. Goyal COLT 2012, AISTATS 2013, JACM 2017] Optimal instance-dependent bounds for Bernoulli rewards Regret 𝑇 ≤ 𝒍𝒏(𝑻) 1+𝜖 𝑖 Δ 𝑖 𝐾𝐿( 𝜇 ∗ || 𝜇 𝑖 ) +𝑂( 𝑁 𝜖 2 ) Matches asymptotic lower bound [Lai Robbins 1985] UCB algorithm achieves this only after careful tuning [Kaufmann et al. 2012] Arbitrary bounded reward distribution Regret 𝑇 ≤ 𝑂(𝒍𝒏(𝑻) 𝑖 1 Δ 𝑖 ) Matches the best available for UCB for general reward distributions Instance-independent bounds Regret 𝑇 ≤ 𝑂( 𝑁𝑇ln 𝑇 ) Lower bound Ω( 𝑁𝑇 ) Prior and likelihood mismatch allowed! Simple extension to arbitrary distribution with [0,1] support

Further challenges Scalability and contextual response TS for Linear contextual bandits [A. and Goyal ICML 2013, Kocák, Valko, Munos, A., AAAI 2014] Externality in assortment of arms TS for dynamic assortment selection in MNL choice model [Avadhanula, A., Goyal, Zeevi, COLT 2017] State-dependent response Learning MDPs (Reinforcement learning) [A. and Jia, NIPS 2017] Other Budget/supply constraints, nonlinear utilities [EC 2014, SODA 2015, NIPS 2016, COLT 2016] Delayed anonymous feedback [ICML 2018]

Further challenges #1 Scalability and context Large number of products and customer types, external conditions Utilize similarity? Content based recommendation Customers and products described by their features Similar features means similar preferences Parametric models mapping customer and product features to customer preferences Contextual bandits Exploration-exploitation to learn the parametric models Large number of arms but related Customization ``Recommended just for you” Response depends not just on assortment but on customer profile (context) Customer preference for product depends on (customer,product) features and unknown parameters

Linear Contextual Bandits N arms, possibly very large N A d-dimensional context (feature vector) 𝑥 𝑖,𝑡 for every arm 𝑖, time 𝑡 Linear parametric model Unknown parameter 𝜃 Expected reward for arm 𝑖 at time 𝑡 is 𝑥 𝑖,𝑡 ⋅ 𝜃 Algorithm picks 𝑥 𝑡 ∈ {𝑥 1,𝑡 , …, 𝑥 𝑁,𝑡 }, observes 𝑟 𝑡 = 𝑥 𝑡 ⋅𝜃+ 𝜂 𝑡 Optimal arm depends on context: 𝑥 𝑡 ∗ =arg max 𝑥 𝑖,𝑡 𝑥 𝑖,𝑡 ⋅𝜃 Goal: Minimize regret Regret 𝑇 = 𝑡 ( 𝑥 𝑡 ∗ ⋅𝜃− 𝑥 𝑡 ⋅𝜃 ) Arm are products, every time 𝑡 a new user arrives Optimal arm at any time is the arm with the best features at that time

Thompson Sampling for linear contextual bandits Least square solution 𝜃 𝑡 of set of 𝑡−1 equations 𝑥 𝑠 ⋅𝜃= 𝑟 𝑠 , 𝑠=1, …, 𝑡−1 𝜃 𝑡 ≃ 𝐵 𝑡 −1 ( 𝑠=1 𝑡−1 𝑥 𝑠 𝑟 𝑠 ) where 𝐵 𝑡 =𝐼+ 𝑠=1 𝑡−1 𝑥 𝑠 𝑥 𝑠 ′ 𝐵 𝑡 −1 covariance matrix of this estimator Gaussian Prior, Gaussian Likelihood, Gaussian Posterior 𝑁 0, 𝐼 starting prior on 𝜃, Reward distribution given 𝜃, 𝑥 𝑖,𝑡 : 𝑁 𝜃 𝑇 𝑥 𝑖,𝑡 , 1 , posterior on 𝜃 at time t is 𝑁 𝜃 𝑡 , 𝐵 𝑡 −1 An observation that ties Least squares estimation nicely with TS is that

Thompson Sampling for linear contextual bandits [A. Goyal ICML 2013] Algorithm: At Step t, Sample 𝜃 𝑡 from 𝑁( 𝜃 𝑡 , 𝑣 2 𝐵 𝑡 −1 ) Pull arm with feature 𝑥 𝑡 where 𝑥 𝑡 = max 𝑥 𝑖,𝑡 𝑥 𝑖,𝑡 ⋅ 𝜃 𝑡 Apply this algorithm for any likelihood , starting prior 𝑁 0, 𝑣 2 𝐼 ! Use slightly scaled varianace

Our Results [A. Goyal ICML 2013] With probability 1−𝛿, regret 𝑅𝑒𝑔𝑟𝑒𝑡 𝑇 ≤ 𝑂 𝑑 3/2 𝑇 Any likelihood, unknown prior, only assumes bounded or sub-Gaussian noise No dependence on number of arms Lower bound Ω 𝑑 𝑇 For UCB, best bound 𝑂 𝑑 𝑇 [Dani et al 2008, Abbasi-Yadkori et al 2011] Best earlier bound for a polynomial time algorithm 𝑂 𝑑 3/2 𝑇 [Dani et al 2008] 𝑂 𝑑 3/2 𝑇 ln⁡(𝑇) ln 𝑇 𝛿

Further challenges #2 Assortment selection as multi-armed bandit Arms are products Limited display space, k products at a time Challenge: Customer response on one product is influenced by other products in assortment Arms are no longer independent Amazon product example Combinatorially large number of product combinations Cannot learn about every product combination How to learn about individual product?

Assortment optimization Multinomial logit choice model 𝑝 𝑖 𝑆 = 𝑒 𝑣 𝑖 1+ 𝑗 𝑒 𝑣 𝑗 Independence of irrelevant alternatives Expected revenue on offering assortment 𝑆 𝑅 𝑆,𝒗 = 𝑖=1 𝑁 𝑟 𝑖 𝑝 𝑖 𝑆 Optimal assortment 𝑆 ∗ = arg max 𝑆 ≤𝑘 𝑅(𝑆, 𝒗) More generally: product i (feature vector 𝑥 𝑖,𝑡 ) in assortment S 𝑝 𝑖,𝑡 𝑆 = 𝑒 𝜃 𝑇 𝑥 𝑖 ,𝑡 1+ 𝑗 𝑒 𝜃 𝑇 𝑥 𝑗,𝑡 Probability of choosing Log ratio is linear in features

Online assortment selection N products, Unknown parameters 𝑣 1 , 𝑣 2 , …, 𝑣 𝑁 At every step 𝒕, recommend an assortment 𝑆 𝑡 of size at most K, observe customer choice 𝑖 𝑡 , revenue 𝑟 𝑖 𝑡 (expected value 𝑅 𝑆 𝑡 ,𝒗 ) update parameter estimates Goal: optimize total expected revenue E[ 𝑡 𝑟 𝑖 𝑡 ] or minimize regret compared to optimal assortment S ∗ = arg max S 𝑖 𝑟 𝑖 𝑝 𝑖 𝑆 =R(S,𝐯)

Thompson Sampling structure Initialization: Specify belief on parameters 𝐯 Prior distribution P(𝐯) on 𝐯 Step t: Sample 𝐯 from posterior belief P(𝐯| H t−1 ) Offer optimal assortment consistent with beliefs. S = argmax S R(S, 𝐯 ) Observe purchase/no purchase 𝑟 𝑡 , update belief P 𝐯 H t ) α P 𝑟 𝑡 𝐯)×P(𝐯| 𝐇 t−1 )

Main challenge and techniques Censored feedback: Feedback for product 𝑖 effected by other products in assortment Getting unbiased estimate of 𝑣 𝑖 / 𝑣 0 by repeated offering until no purchase What is a good structure of prior/posterior? Correlated priors Boosted sampling Difficult to update posterior Approximate Gaussian posterior updates

TS Based Algorithm Initialize 𝜇 𝑖 =1, 𝜎 𝑖 2 =1 for 𝑖=1,…, 𝑁, Epoch ℓ=1 (Correlated Sampling) Sample 𝐾 samples from Gaussian θ k ~ N(0,1). v i = max k μ i + θ k 𝜎 𝑖 Offer repeatedly 𝑆 ℓ = arg max 𝑆 𝑅(𝑆, 𝑣 ) till no purchase occurs. Track 𝑛 𝑖,ℓ : number of times product i is chosen For 𝑖∈ 𝑆 ℓ , update 𝜇 𝑖 = ℓ 𝑛 𝑖,ℓ L i , σ 𝑖 2 =Var ℓ n i,𝑙 𝐿 𝑖 ; Repeat That question is the focus of this talk. Specifically, can we design a policy that suggests S_t as a function of history, to learn v and maximize the cumulative revenue over the selling period.

Results: Dynamic assortment selection [A Results: Dynamic assortment selection [A., Avadhanula, Goyal, Zeevi 2016, 2017] Assumption: 𝑣 𝑖 ≤ 𝑣 0 (no purchase is more likely than single product purchase) Upper Bound: For any 𝐯, Reg T,𝐯 ≤𝑂( NT log T ) Lower Bound [Chen and Wang 2017]: Reg T,𝐯 ≥Ω( NT )

Different versions of TS (N=1000, K=10, random instances)

Homepage Optimization at Flipkart Assortment of Widgets Widgets Hence for this home page team, the other teams are pretty much black-boxes and their role is in picking the right subset of widgets. Often the requests can be competing and the screen space very limited.

Substitution Effects Demand Cannibalization: Similar recommendations. For example, two different teams can push similar products like offers for you/discounts for you. If we don’t account for substitution we might end up wrongly estimating the attractiveness of these banners.

Numerical simulation

Numerical simulations (assortment optimization) (assortment optimization)

Further challenges #3 State dependent response: Reinforcement learning Rounds 𝑡=1,…, 𝑇 Observe state take an action Observe response: reward and new state Markov model: Response to an action depends on the state of the system Applications: autonomous vehicle control, robot navigation, personalized medical treatments, inventory management, dynamic pricing… 2/4/2019

The reinforcement learning problem System dynamics given by an unknown MDP 𝑆,𝐴, 𝑃, 𝑟, 𝑠 0 Rounds 𝑡=1,…, 𝑇. In round 𝑡, observe state s 𝑡 , take action 𝑎 𝑡 , observe reward 𝑟 𝑡 ∈[0,1], E 𝑟 𝑡 = 𝒓 𝒔 𝒕 , 𝒂 𝒕 observe the transition to next state 𝑠 𝑡+1 with probability 𝐏 𝒔 𝒕 , 𝒂 𝒕 , ( 𝒔 𝒕+𝟏 ) S4 S6 S5 S2 S3 S1 0.5, $1 0.5, $2 1, $0 6 states, 2 actions 2/4/2019

The reinforcement learning problem Solution concept: optimal policy Which action to take in which state 𝜋 𝑡 :𝑆→𝐴 , Action 𝜋 𝑡 𝑠 in state 𝑠 Unknown reward distributions 𝑟 𝑠,𝑎 and unknown transition function 𝑃 𝑠,𝑎 , unknown optimal policy 𝜋 ∗ Learn to solve the MDP from observations over 𝑇 rounds Goal: maximize total reward over 𝑇 rounds Minimize regret in 𝑇 rounds: difference between total reward for playing optimal policy for all rounds vs. algorithm’s total reward 2/4/2019

The need for exploration Uncertainty in rewards, state transitions Unknown reward distribution, transition probabilities Exploration-exploitation: Explore actions/states/policies, learn reward distributions and transition model Exploit the (seemingly) best policy 0.6, $0 1.0, $1 0.4, $0 S1 S2 0.1, $0 Here is a simple example to illustrate this problem. We have a simple underlying MDP. Let’s say the game starts in state s1. and There are two actions available action 1, which is in black can take the process in state2 with some probability, where the rewards are very high, therefore clearly it is a better action. However, its expected immediate reward is bad. Therefore, the optimal policy should be to take action 1 in state s1, and s2. lets see what would Q-learning do. Initially, say all the Q values are initialized to 0 by the algorithm. Q(s1,a1)=0 Q(s1,a2)=0 Q(s2,a1)=0 At least the algorithm does not know that state s2 is so good, or even that there is a way to go from state s1 to s2. Suppose the algorithm tries the first action (the black action) initially, but in these few trials it does not observe a transfer to s2, so Q value of s2 never gets updated and so that the Q-value of s1,a1 is updated again and again to -1. Q(s1,a1)=-1 Therefore, the algorithm tries a2, and observes +1 reward, updates Q value to +1. Q(s1,a2)=1 So, the algorithm chooses a2 again, the Q value is updated to 1+gamma 1 + \gamma^2 1 + …. = \frac{1}{1-\gamma} Since a2 is not played and s2 is not visited its Q value doesn’t get updated. This example may seem extreme, but this situation can actually easily occur, even in more reasonable looking examples. Before enough samples are obtained, the less explored states or actions might appear bad due to empirical errors. If these estimates are used greedily to decide actions, they will inevitably pick the suboptimal actions. And since the estimates are updated only when a state is visited or an action is played, the estimates for those unexplored states and actions never improve. 1.0, $1 0.9, $100 2/4/2019

Posterior Sampling Assume for simplicity: Known reward distribution Needs to learn the unknown transition probability vector 𝑃 𝑠,𝑎 =(𝑃 𝑠,𝑎 1 , …, 𝑃 𝑠,𝑎 (𝑆)) for all 𝑠,𝑎 𝑆 2 𝐴 parameters In any state 𝑠 𝑡 =𝑠, 𝑎 𝑡 =𝑎, observes new state 𝑠 𝑡+1 outcome of a Multivariate Bernoulli trial with probability vector 𝑃 𝑠,𝑎 2/4/2019

Posterior Sampling with Dirichlet priors Given prior Dirichlet 𝛼 1 , 𝛼 2 , …, 𝛼 𝑆 on 𝑃 𝑠,𝑎 After a Multinoulli trial with outcome (new state) 𝑖, Bayesian posterior on 𝑃 𝑠,𝑎 Dirichlet 𝛼 1 , 𝛼 2 , …, 𝛼 𝑖 +1, … , 𝛼 𝑆 After 𝑛 𝑠,𝑎 = 𝛼 1 +…+ 𝛼 𝑆 observations for a state-action pair 𝑠,𝑎 Posterior mean vector is empirical mean 𝑃 𝑠,𝑎 (𝑖)= 𝛼 𝑖 𝛼 1 +…+ 𝛼 𝑆 = 𝛼 𝑖 𝑛 𝑠,𝑎 variance bounded by 1 𝑛 𝑠,𝑎 With more trials of 𝑠,𝑎, the posterior mean concentrates around true probability 2/4/2019

Posterior Sampling for RL (Thompson Sampling) Learning Maintain a Dirichlet posterior for 𝑃 𝑠,𝑎 for every 𝑠,𝑎 Start with an uninformative prior e.g., Dirichlet(1,1,…,1) After round 𝑡, on observing outcome 𝑠 𝑡+1 , update for state 𝑠 𝑡 and action 𝑎 𝑡 To decide action Sample a 𝑃 𝑠,𝑎 for every 𝑠,𝑎 Compute the optimal policy 𝝅 for sample MDP (𝑆,𝐴, 𝑃 , 𝑟, 𝑠 0 ) Choose 𝑎 𝑡 = 𝜋 𝑠 𝑡 Exploration-exploitation Exploitation: With more observations Dirichlet posterior concentrates, 𝑃 𝑠,𝑎 approaches empirical mean 𝑃 𝑠,𝑎 Exploration: Anti-concentration of Dirichlet ensures exploration for states/actions/policies less explored 2/4/2019

Main result [A., Jia NIPS 2017] An algorithm based on posterior sampling with high probability near- optimal worst-case regret upper bound Regret 𝑀, 𝑇 =𝑇 𝜆 ∗ − 𝑡=1 𝑇 𝑟( 𝑠 𝑡 , 𝑎 𝑡 ) 𝜆 ∗ is optimal infinite horizon average reward (gain), achieved by best single policy 𝜋 ∗ Theorem: For any communicating MDP M with (unknown) diameter 𝐷, and for T≥ 𝑆 5 𝐴, with high probability: Regret 𝑀, 𝑇 ≤ 𝑂 (𝐷 𝑆𝐴𝑇 ) Improved state-of-the-art bounds and matched the lower bound First worst-case/minimax regret analysis of posterior sampling for RL Analyzed in known prior Bayesian regret setting earlier [Osband and Van Roy, 2013, 2016, 2017, Abbasi-Yadkori and Szepesvari [2014] First algorithm to achieve 𝑂 (𝐷 𝑆𝐴𝑇 ) bound in non-episodic setting Episodic setting 𝑂 ( 𝐻𝑆𝐴𝑇 ), episodes of length 𝐻 [Azar, Osband, Munos, 2017] 2/4/2019

Some further directions Large scale structured MDPs with small parameter space Scalable learning algorithms for inventory management problems Learning to price in presence of state-dependent demand Transition probability depends on time varying context in addition to state Logistic normal priors

Other related work on Thompson Sampling Known likelihood Exponential families (with Jeffreys prior) [Korda et al. 2013] Known prior (Bayesian regret) Near-optimal regret bounds for any prior [Russo and Van Roy 2013, 2014], [Bubeck and Liu 2013] Extensions for many variations of MAB side information, delayed feedback, sleeping bandits, sparse bandits, spectral bandits