Download presentation
Presentation is loading. Please wait.
1
Thompson Sampling for learning in online decision making
Shipra Agrawal IEOR and Data Science Institute Columbia University
2
Learning from sequential interactions
Tradeoff between information and revenue learning and optimization Exploration and exploitation
3
Managing exploitation-exploitation tradeoff
The multi-armed bandit problem (Thompson 1933; Robbins 1952) Multiple rigged slot machines in a casino. Which one to put money on? Try each one out An old crisp and clean formulation of the problem in statistics WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?
4
Stochastic multi-armed bandit problem
Online decisions At every time step π‘=1,β¦, π, pull one arm out of π arms Stochastic feedback For each arm π, reward is generated i.i.d from a fixed but unknown distribution support [0,1], mean π π Bandit feedback Only the reward of the pulled arm can be observed
5
Measure of Performance
Maximize expected reward in time π πΈ[ π‘=1 π π π π‘ ] Minimize expected regret in time π Optimal arm is the arm with expected reward π β = max π π π Expected regret for playing arm π: Ξ π = π β β π π Expected regret in any time π, π
πππππ‘ π = π Ξ π π‘ = π Ξ π πΈ[ π π π ] = π=1 π π π πΈ[π π (π) Any time algorithm: the time horizon π is not known
6
Summary of the talk Introduction to Thompson Sampling as a very general algorithmic methodology for MAB Provably near-optimal adaptations for Linear Contextual Bandits Assortment Selection Reinforcement Learning Focus on algorithmic methodology
7
References Papers which this talk derives from.. Provably optimal
[A., and Goyal COLT 2012, AISTATS 2013, JACM 2017] Handling personalization, contexts, large number of products [A. and Goyal ICML 2013] [KocΓ‘k, Valko, Munos, A. AAAI 2014] Assortment selection: consumer choice behavior [A., Avandhanula, Goyal, Zeevi EC 2016, COLT 2017] Reinforcement Learning [A. Jia, NIPS 2017] (Many) Other related research articles [Li et al. 2011, Kaufmann et al. 2012, Russo and Van Roy, 2013, 2014, 2016, Osband and Van Roy 2016,β¦.] A central piece of my investigation into this problem has been analysis of a very old and popular Bayesian heusristic for this problem called Thompson Sampling. Despite being easy to implement and good empirical performance, this heuristic did not have rigorous theoretical undetrstanding. And, our work showed that this simple and popular heuristic achieves the optimal regret bounds possible for every instance of the multi-armed bandit problem Of course when this heuristic was proposed the popular concerns of today β personalization and millions of products were not considered. Several of my work focus on extensions and analysis of this heuristic to incorporate these important aspects. In addition to extensions in the direction I mentioned so far, one exciting direction I am working is in handling exploration-exploitation in reinforcement learning. This is a more complex and general model of computers learning from their own past mistakes, which forms a popular models for learning to play games.
8
The need for exploration
Two arms black and red Random rewards with unknown mean π π =π.π, π π =π Optimal expected reward in π time steps is 1.1Γπ Exploit only strategy: use the current best estimate (MLE/empirical mean) of unknown mean to pick arms Initial few trials can mislead into playing red action forever 1.1, 1, 0.2, 1, 1, 1, 1, 1, 1, 1, β¦β¦ Expected regret in π steps is close to 0.1Γπ 2/4/2019
9
Exploration-Exploitation tradeoff
Exploitation: play the empirical mean reward maximizer Exploration: play less explored actions to ensure empirical estimates converge 2/4/2019
10
Thompson Sampling [Thompson, 1933]
Natural and Efficient heuristic Maintain belief about effectiveness (mean reward) of each arm Observe feedback, update belief of pulled arm i in Bayesian manner Pull arm with posterior probability of being best arm NOT same as choosing the arm that is most likely to be best If a customer likes the product update the belief to be more favorable for the product
11
Bernoulli rewards, Beta priors
Uniform distribution π΅ππ‘π(1,1) π΅ππ‘π πΌ,π½ prior β Posterior π΅ππ‘π πΌ+1,π½ if you observe 1 π΅ππ‘π πΌ,π½+1 if you observe 0 Start with π΅ππ‘π(1,1) prior belief for every arm In round t, For every arm π, sample π π,π‘ independently from posterior Bππ‘π π π,π‘ +1, πΉ π,π‘ +1 Play arm π π‘ = max π π π,π‘ Observe reward and update the Beta posterior for arm π π‘
12
Thompson Sampling with Gaussian Priors
Suppose reward for arm π is i.i.d. π( π π ,1) Starting prior N(0,1) Gaussian Prior, Gaussian likelihood β Gaussian posterior π( π π,π‘ , 1 π π,π‘ +1 ) π π,π‘ is empirical mean of π π,π‘ observations for arm π Algorithm Sample π π from posterior π π π,π‘ , 1 π π,π‘ +1 for arm π Play arm π π‘ = argβ‘ max π β‘ π π Observe reward, update empirical mean for arm π π‘
13
Thompson Sampling [Thompson 1933]
With more trials posteriors concentrate on the true parameters Mode is MLE: enables exploitation Less trials means more uncertainty in estimates Spread/variance captures uncertainty: enables exploration Gives benefit of doubt those less explored OPTIMAL benefit of doubt [A. and Goyal, COLT 2012, AISTATS 2013, JACM 2017] A sample from the posterior is used as an estimate for unknown parameters to make decisions 2/4/2019
14
Our Results [A. Goyal COLT 2012, AISTATS 2013, JACM 2017]
Optimal instance-dependent bounds for Bernoulli rewards Regret π β€ ππ(π») 1+π π Ξ π πΎπΏ( π β || π π ) +π( π π 2 ) Matches asymptotic lower bound [Lai Robbins 1985] UCB algorithm achieves this only after careful tuning [Kaufmann et al. 2012] Arbitrary bounded reward distribution Regret π β€ π(ππ(π») π 1 Ξ π ) Matches the best available for UCB for general reward distributions Instance-independent bounds Regret π β€ π( ππln π ) Lower bound Ξ©( ππ ) Prior and likelihood mismatch allowed! Simple extension to arbitrary distribution with [0,1] support
15
Further challenges Scalability and contextual response
TS for Linear contextual bandits [A. and Goyal ICML 2013, KocΓ‘k, Valko, Munos, A., AAAI 2014] Externality in assortment of arms TS for dynamic assortment selection in MNL choice model [Avadhanula, A., Goyal, Zeevi, COLT 2017] State-dependent response Learning MDPs (Reinforcement learning) [A. and Jia, NIPS 2017] Other Budget/supply constraints, nonlinear utilities [EC 2014, SODA 2015, NIPS 2016, COLT 2016] Delayed anonymous feedback [ICML 2018]
16
Further challenges #1 Scalability and context
Large number of products and customer types, external conditions Utilize similarity? Content based recommendation Customers and products described by their features Similar features means similar preferences Parametric models mapping customer and product features to customer preferences Contextual bandits Exploration-exploitation to learn the parametric models Large number of arms but related Customization ``Recommended just for youβ Response depends not just on assortment but on customer profile (context) Customer preference for product depends on (customer,product) features and unknown parameters
17
Linear Contextual Bandits
N arms, possibly very large N A d-dimensional context (feature vector) π₯ π,π‘ for every arm π, time π‘ Linear parametric model Unknown parameter π Expected reward for arm π at time π‘ is π₯ π,π‘ β
π Algorithm picks π₯ π‘ β {π₯ 1,π‘ , β¦, π₯ π,π‘ }, observes π π‘ = π₯ π‘ β
π+ π π‘ Optimal arm depends on context: π₯ π‘ β =arg max π₯ π,π‘ π₯ π,π‘ β
π Goal: Minimize regret Regret π = π‘ ( π₯ π‘ β β
πβ π₯ π‘ β
π ) Arm are products, every time π‘ a new user arrives Optimal arm at any time is the arm with the best features at that time
18
Thompson Sampling for linear contextual bandits
Least square solution π π‘ of set of π‘β1 equations π₯ π β
π= π π , π =1, β¦, π‘β1 π π‘ β π΅ π‘ β1 ( π =1 π‘β1 π₯ π π π ) where π΅ π‘ =πΌ+ π =1 π‘β1 π₯ π π₯ π β² π΅ π‘ β1 covariance matrix of this estimator Gaussian Prior, Gaussian Likelihood, Gaussian Posterior π 0, πΌ starting prior on π, Reward distribution given π, π₯ π,π‘ : π π π π₯ π,π‘ , 1 , posterior on π at time t is π π π‘ , π΅ π‘ β1 An observation that ties Least squares estimation nicely with TS is that
19
Thompson Sampling for linear contextual bandits [A. Goyal ICML 2013]
Algorithm: At Step t, Sample π π‘ from π( π π‘ , π£ 2 π΅ π‘ β1 ) Pull arm with feature π₯ π‘ where π₯ π‘ = max π₯ π,π‘ π₯ π,π‘ β
π π‘ Apply this algorithm for any likelihood , starting prior π 0, π£ 2 πΌ ! Use slightly scaled varianace
20
Our Results [A. Goyal ICML 2013]
With probability 1βπΏ, regret π
πππππ‘ π β€ π π 3/2 π Any likelihood, unknown prior, only assumes bounded or sub-Gaussian noise No dependence on number of arms Lower bound Ξ© π π For UCB, best bound π π π [Dani et al 2008, Abbasi-Yadkori et al 2011] Best earlier bound for a polynomial time algorithm π π 3/2 π [Dani et al ] π π 3/2 π lnβ‘(π) ln π πΏ
21
Further challenges #2 Assortment selection as multi-armed bandit
Arms are products Limited display space, k products at a time Challenge: Customer response on one product is influenced by other products in assortment Arms are no longer independent Amazon product example Combinatorially large number of product combinations Cannot learn about every product combination How to learn about individual product?
24
Assortment optimization
Multinomial logit choice model π π π = π π£ π π π π£ π Independence of irrelevant alternatives Expected revenue on offering assortment π π
π,π = π=1 π π π π π π Optimal assortment π β = arg max π β€π π
(π, π) More generally: product i (feature vector π₯ π,π‘ ) in assortment S π π,π‘ π = π π π π₯ π ,π‘ 1+ π π π π π₯ π,π‘ Probability of choosing Log ratio is linear in features
25
Online assortment selection
N products, Unknown parameters π£ 1 , π£ 2 , β¦, π£ π At every step π, recommend an assortment π π‘ of size at most K, observe customer choice π π‘ , revenue π π π‘ (expected value π
π π‘ ,π ) update parameter estimates Goal: optimize total expected revenue E[ π‘ π π π‘ ] or minimize regret compared to optimal assortment S β = arg max S π π π π π π =R(S,π―)
26
Thompson Sampling structure
Initialization: Specify belief on parameters π― Prior distribution P(π―) on π― Step t: Sample π― from posterior belief P(π―| H tβ1 ) Offer optimal assortment consistent with beliefs. S = argmax S R(S, π― ) Observe purchase/no purchase π π‘ , update belief P π― H t ) Ξ± P π π‘ π―)ΓP(π―| π tβ1 )
27
Main challenge and techniques
Censored feedback: Feedback for product π effected by other products in assortment Getting unbiased estimate of π£ π / π£ 0 by repeated offering until no purchase What is a good structure of prior/posterior? Correlated priors Boosted sampling Difficult to update posterior Approximate Gaussian posterior updates
28
TS Based Algorithm Initialize π π =1, π π 2 =1 for π=1,β¦, π, Epoch β=1
(Correlated Sampling) Sample πΎ samples from Gaussian ΞΈ k ~ N(0,1). v i = max k ΞΌ i + ΞΈ k π π Offer repeatedly π β = arg max π π
(π, π£ ) till no purchase occurs. Track π π,β : number of times product i is chosen For πβ π β , update π π = β π π,β L i , Ο π 2 =Var β n i,π πΏ π ; Repeat That question is the focus of this talk. Specifically, can we design a policy that suggests S_t as a function of history, to learn v and maximize the cumulative revenue over the selling period.
29
Results: Dynamic assortment selection [A
Results: Dynamic assortment selection [A., Avadhanula, Goyal, Zeevi 2016, 2017] Assumption: π£ π β€ π£ 0 (no purchase is more likely than single product purchase) Upper Bound: For any π―, Reg T,π― β€π( NT log T ) Lower Bound [Chen and Wang 2017]: Reg T,π― β₯Ξ©( NT )
30
Different versions of TS (N=1000, K=10, random instances)
31
Homepage Optimization at Flipkart
Assortment of Widgets Widgets Hence for this home page team, the other teams are pretty much black-boxes and their role is in picking the right subset of widgets. Often the requests can be competing and the screen space very limited.
32
Substitution Effects Demand Cannibalization: Similar recommendations.
For example, two different teams can push similar products like offers for you/discounts for you. If we donβt account for substitution we might end up wrongly estimating the attractiveness of these banners.
33
Numerical simulation
34
Numerical simulations
(assortment optimization) (assortment optimization)
35
Further challenges #3 State dependent response: Reinforcement learning
Rounds π‘=1,β¦, π Observe state take an action Observe response: reward and new state Markov model: Response to an action depends on the state of the system Applications: autonomous vehicle control, robot navigation, personalized medical treatments, inventory management, dynamic pricingβ¦ 2/4/2019
36
The reinforcement learning problem
System dynamics given by an unknown MDP π,π΄, π, π, π 0 Rounds π‘=1,β¦, π. In round π‘, observe state s π‘ , take action π π‘ , observe reward π π‘ β[0,1], E π π‘ = π π π , π π observe the transition to next state π π‘+1 with probability π π π , π π , ( π π+π ) S4 S6 S5 S2 S3 S1 0.5, $1 0.5, $2 1, $0 6 states, 2 actions 2/4/2019
37
The reinforcement learning problem
Solution concept: optimal policy Which action to take in which state π π‘ :πβπ΄ , Action π π‘ π in state π Unknown reward distributions π π ,π and unknown transition function π π ,π , unknown optimal policy π β Learn to solve the MDP from observations over π rounds Goal: maximize total reward over π rounds Minimize regret in π rounds: difference between total reward for playing optimal policy for all rounds vs. algorithmβs total reward 2/4/2019
38
The need for exploration
Uncertainty in rewards, state transitions Unknown reward distribution, transition probabilities Exploration-exploitation: Explore actions/states/policies, learn reward distributions and transition model Exploit the (seemingly) best policy 0.6, $0 1.0, $1 0.4, $0 S1 S2 0.1, $0 Here is a simple example to illustrate this problem. We have a simple underlying MDP. Letβs say the game starts in state s1. and There are two actions available action 1, which is in black can take the process in state2 with some probability, where the rewards are very high, therefore clearly it is a better action. However, its expected immediate reward is bad. Therefore, the optimal policy should be to take action 1 in state s1, and s2. lets see what would Q-learning do. Initially, say all the Q values are initialized to 0 by the algorithm. Q(s1,a1)=0 Q(s1,a2)=0 Q(s2,a1)=0 At least the algorithm does not know that state s2 is so good, or even that there is a way to go from state s1 to s2. Suppose the algorithm tries the first action (the black action) initially, but in these few trials it does not observe a transfer to s2, so Q value of s2 never gets updated and so that the Q-value of s1,a1 is updated again and again to -1. Q(s1,a1)=-1 Therefore, the algorithm tries a2, and observes +1 reward, updates Q value to +1. Q(s1,a2)=1 So, the algorithm chooses a2 again, the Q value is updated to 1+gamma 1 + \gamma^2 1 + β¦. = \frac{1}{1-\gamma} Since a2 is not played and s2 is not visited its Q value doesnβt get updated. This example may seem extreme, but this situation can actually easily occur, even in more reasonable looking examples. Before enough samples are obtained, the less explored states or actions might appear bad due to empirical errors. If these estimates are used greedily to decide actions, they will inevitably pick the suboptimal actions. And since the estimates are updated only when a state is visited or an action is played, the estimates for those unexplored states and actions never improve. 1.0, $1 0.9, $100 2/4/2019
39
Posterior Sampling Assume for simplicity: Known reward distribution
Needs to learn the unknown transition probability vector π π ,π =(π π ,π 1 , β¦, π π ,π (π)) for all π ,π π 2 π΄ parameters In any state π π‘ =π , π π‘ =π, observes new state π π‘+1 outcome of a Multivariate Bernoulli trial with probability vector π π ,π 2/4/2019
40
Posterior Sampling with Dirichlet priors
Given prior Dirichlet πΌ 1 , πΌ 2 , β¦, πΌ π on π π ,π After a Multinoulli trial with outcome (new state) π, Bayesian posterior on π π ,π Dirichlet πΌ 1 , πΌ 2 , β¦, πΌ π +1, β¦ , πΌ π After π π ,π = πΌ 1 +β¦+ πΌ π observations for a state-action pair π ,π Posterior mean vector is empirical mean π π ,π (π)= πΌ π πΌ 1 +β¦+ πΌ π = πΌ π π π ,π variance bounded by 1 π π ,π With more trials of π ,π, the posterior mean concentrates around true probability 2/4/2019
41
Posterior Sampling for RL (Thompson Sampling)
Learning Maintain a Dirichlet posterior for π π ,π for every π ,π Start with an uninformative prior e.g., Dirichlet(1,1,β¦,1) After round π‘, on observing outcome π π‘+1 , update for state π π‘ and action π π‘ To decide action Sample a π π ,π for every π ,π Compute the optimal policy π
for sample MDP (π,π΄, π , π, π 0 ) Choose π π‘ = π π π‘ Exploration-exploitation Exploitation: With more observations Dirichlet posterior concentrates, π π ,π approaches empirical mean π π ,π Exploration: Anti-concentration of Dirichlet ensures exploration for states/actions/policies less explored 2/4/2019
42
Main result [A., Jia NIPS 2017]
An algorithm based on posterior sampling with high probability near- optimal worst-case regret upper bound Regret π, π =π π β β π‘=1 π π( π π‘ , π π‘ ) π β is optimal infinite horizon average reward (gain), achieved by best single policy π β Theorem: For any communicating MDP M with (unknown) diameter π·, and for Tβ₯ π 5 π΄, with high probability: Regret π, π β€ π (π· ππ΄π ) Improved state-of-the-art bounds and matched the lower bound First worst-case/minimax regret analysis of posterior sampling for RL Analyzed in known prior Bayesian regret setting earlier [Osband and Van Roy, 2013, 2016, 2017, Abbasi-Yadkori and Szepesvari [2014] First algorithm to achieve π (π· ππ΄π ) bound in non-episodic setting Episodic setting π ( π»ππ΄π ), episodes of length π» [Azar, Osband, Munos, 2017] 2/4/2019
43
Some further directions
Large scale structured MDPs with small parameter space Scalable learning algorithms for inventory management problems Learning to price in presence of state-dependent demand Transition probability depends on time varying context in addition to state Logistic normal priors
44
Other related work on Thompson Sampling
Known likelihood Exponential families (with Jeffreys prior) [Korda et al. 2013] Known prior (Bayesian regret) Near-optimal regret bounds for any prior [Russo and Van Roy 2013, 2014], [Bubeck and Liu 2013] Extensions for many variations of MAB side information, delayed feedback, sleeping bandits, sparse bandits, spectral bandits
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.