Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012

Recap: What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal Can be thought of as a stochastic optimization over time 220/01/2012

Recap: The Setup for RL Agent is: Temporally situated Continual learning and planning Objective is to affect the environment – actions and states Environment is uncertain, stochastic Environment action state reward Agent 320/01/2012

Recap: Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward – Sacrifice short-term gains for greater long-term gains The need to explore and exploit Consider the whole problem of a goal-directed agent interacting with an uncertain environment 420/01/2012

Multi-arm Bandits (MAB) N possible actions You can play for some period of time and you want to maximize reward (expected utility) Which is the best arm/ machine? DEMO 520/01/2012

Real-Life Version Choose the best content to display to the next visitor of your commercial website Content options = slot machines Reward = user's response (e.g., click on an ad) Also, clinical trials: arm = treatment, reward = patient cured Simplifying assumption: no context (no visitor proles). In practice, we want to solve contextual bandit problems but that is for later discussion. 620/01/2012

What is the Choice? 720/01/2012

n-Armed Bandit Problem Choose repeatedly from one of n actions; each choice is called a play After each play a t, you get a reward r t, where These are unknown action values Distribution of depends only on Objective is to maximize the reward in the long term, e.g., over 1000 plays To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them 820/01/2012

Exploration/Exploitation Dilemma Suppose you form estimates The greedy action at time t is a t * You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you could reduce exploring. action value estimates 920/01/2012 Why?

Action-Value Methods Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t -th play, action a had been chosen k a times, producing rewards r 1, r 2, …, r k a, then “sample average” 1020/01/2012

Remark The simple greedy action selection strategy: Why might this above be insufficient? You are estimating, online, from a few samples. How will this behave? DEMO 1120/01/2012

 -Greedy Action Selection Greedy action selection:  -Greedy: {... the simplest way to balance exploration and exploitation 1220/01/2012

Worked Example: 10-Armed Testbed n = 10 possible actions Each is chosen randomly from a normal distrib.: Each is also normal: 1000 plays, repeat the whole thing 2000 times and average the results 1320/01/2012

 -Greedy Methods on the 10-Armed Testbed 1420/01/2012

Softmax Action Selection Softmax action selection methods grade action probabilities by estimated values. The most common softmax uses a Gibbs, or Boltzmann, distribution: 1520/01/2012

Incremental Implementation Sample average estimation method: How to do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: The average of the first k rewards is (dropping the dependence on a ): NewEstimate = OldEstimate + StepSize [Target – OldEstimate] 1620/01/2012

Tracking a Nonstationary Problem Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem. Better in the nonstationary case is: exponential, recency-weighted average 1720/01/2012

Optimistic Initial Values All methods so far depend on, i.e., they are biased Encourage exploration: initialize the action values optimistically, i.e., on the 10-armed testbed, use 1820/01/2012

Beyond Counting… 20/01/201219

An Interpretation of MAB Type Problems 20/01/201220 Related to ‘rewards’

MAB is a Special Case of Online Learning 20/01/201221

How to Evaluate Online Alg.: Regret After you have played for T rounds, you experience a regret: = [Reward sum of optimal strategy] – [Sum of actual collected rewards] If the average regret per round goes to zero with probability 1, asymptotically, we say the strategy has no-regret property ~ guaranteed to converge to an optimal strategy  -greedy is sub-optimal (so has some regret). Why? 20/01/201222 Randomness in draw of rewards & player’s strategy

Interval Estimation Attribute to each arm an “optimistic initial estimate” within a certain confidence interval Greedily choose arm with highest optimistic mean (upper bound on confidence interval) Infrequently observed arm will have over-valued reward mean, leading to exploration Frequent usage pushes optimistic estimate to true values 20/01/201223

Interval Estimation Procedure Associate to each arm 100(1-  )% reward mean upper band Assume, e.g., rewards are normally distributed Arm is observed n times to yield empirical mean & std dev  -upper bound: If  is carefully controlled, could be made zero-regret strategy – In general, we don’t know 20/01/201224 Cum. Distribution Function

Variant: UCB Strategy Again, based on notion of an upper confidence bound but more generally applicable Algorithm: – Play each arm once – At time t > K, play arm i t maximizing 20/01/201225

UCB Strategy 20/01/201226

Reminder: Chernoff-Hoeffding Bound 20/01/201227

UCB Strategy – Behaviour 20/01/201228 We will not try to prove the following result but I quote the final result to tell you why UCB may be a desirable strategy – regret is bounded. K = number of arms

Variation on SoftMax: It is possible to drive regret down by annealing  Exp3 : Exponential weight alg. for exploration and exploitation Probability of choosing arm k at time t is 20/01/201229  is a user defined open parameter

The Gittins Index Each arm delivers reward with a probability This probability may change through time but only when arm is pulled Goal is to maximize discounted rewards – future is discounted by an exponential discount factor  The structure of the problem is such that, all you need to do is compute an “index” for each arm and play the one with the highest index Index is of the form: 20/01/201230

Gittins Index – Intuition Proving optimality isn’t within our scope; but based on, Stopping time: the point where you should ‘terminate’ bandit Nice Property: Gittins index for any given bandit is independent of expected outcome of all other bandits – Once you have a good arm, keep playing until there is a better one – If you add/remove machines, computation doesn’t really change BUT: – hard to compute, even when you know distributions – Exploration issues; arms isn’t updated unless used (restless bandits?) 20/01/201231

Numerous Applications! 20/01/201232

Extending the MAB Model In this lecture, we are in a single casino and the only decision is to pull from a set of n arms – except perhaps in the very last slides, exactly one state! Next, What if there is more than one state? So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms? What happens if you only obtain a net reward corresponding to a long sequence of arm pulls (at the end)? 3320/01/2012

Acknowledgements Many slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book 3420/01/2012

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

Similar presentations

Presentation on theme: "Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

Similar presentations

Presentation on theme: "Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012."— Presentation transcript:

Similar presentations

About project

Feedback