Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012.

Similar presentations


Presentation on theme: "Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012."— Presentation transcript:

1 Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012

2 Recap: What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal Can be thought of as a stochastic optimization over time 220/01/2012

3 Recap: The Setup for RL Agent is: Temporally situated Continual learning and planning Objective is to affect the environment – actions and states Environment is uncertain, stochastic Environment action state reward Agent 320/01/2012

4 Recap: Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward – Sacrifice short-term gains for greater long-term gains The need to explore and exploit Consider the whole problem of a goal-directed agent interacting with an uncertain environment 420/01/2012

5 Multi-arm Bandits (MAB) N possible actions You can play for some period of time and you want to maximize reward (expected utility) Which is the best arm/ machine? DEMO 520/01/2012

6 Real-Life Version Choose the best content to display to the next visitor of your commercial website Content options = slot machines Reward = user's response (e.g., click on an ad) Also, clinical trials: arm = treatment, reward = patient cured Simplifying assumption: no context (no visitor proles). In practice, we want to solve contextual bandit problems but that is for later discussion. 620/01/2012

7 What is the Choice? 720/01/2012

8 n-Armed Bandit Problem Choose repeatedly from one of n actions; each choice is called a play After each play a t, you get a reward r t, where These are unknown action values Distribution of depends only on Objective is to maximize the reward in the long term, e.g., over 1000 plays To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them 820/01/2012

9 Exploration/Exploitation Dilemma Suppose you form estimates The greedy action at time t is a t * You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you could reduce exploring. action value estimates 920/01/2012 Why?

10 Action-Value Methods Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t -th play, action a had been chosen k a times, producing rewards r 1, r 2, …, r k a, then “sample average” 1020/01/2012

11 Remark The simple greedy action selection strategy: Why might this above be insufficient? You are estimating, online, from a few samples. How will this behave? DEMO 1120/01/2012

12  -Greedy Action Selection Greedy action selection:  -Greedy: {... the simplest way to balance exploration and exploitation 1220/01/2012

13 Worked Example: 10-Armed Testbed n = 10 possible actions Each is chosen randomly from a normal distrib.: Each is also normal: 1000 plays, repeat the whole thing 2000 times and average the results 1320/01/2012

14  -Greedy Methods on the 10-Armed Testbed 1420/01/2012

15 Softmax Action Selection Softmax action selection methods grade action probabilities by estimated values. The most common softmax uses a Gibbs, or Boltzmann, distribution: 1520/01/2012

16 Incremental Implementation Sample average estimation method: How to do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: The average of the first k rewards is (dropping the dependence on a ): NewEstimate = OldEstimate + StepSize [Target – OldEstimate] 1620/01/2012

17 Tracking a Nonstationary Problem Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the change over time, But not in a nonstationary problem. Better in the nonstationary case is: exponential, recency-weighted average 1720/01/2012

18 Optimistic Initial Values All methods so far depend on, i.e., they are biased Encourage exploration: initialize the action values optimistically, i.e., on the 10-armed testbed, use 1820/01/2012

19 Beyond Counting… 20/01/201219

20 An Interpretation of MAB Type Problems 20/01/201220 Related to ‘rewards’

21 MAB is a Special Case of Online Learning 20/01/201221

22 How to Evaluate Online Alg.: Regret After you have played for T rounds, you experience a regret: = [Reward sum of optimal strategy] – [Sum of actual collected rewards] If the average regret per round goes to zero with probability 1, asymptotically, we say the strategy has no-regret property ~ guaranteed to converge to an optimal strategy  -greedy is sub-optimal (so has some regret). Why? 20/01/201222 Randomness in draw of rewards & player’s strategy

23 Interval Estimation Attribute to each arm an “optimistic initial estimate” within a certain confidence interval Greedily choose arm with highest optimistic mean (upper bound on confidence interval) Infrequently observed arm will have over-valued reward mean, leading to exploration Frequent usage pushes optimistic estimate to true values 20/01/201223

24 Interval Estimation Procedure Associate to each arm 100(1-  )% reward mean upper band Assume, e.g., rewards are normally distributed Arm is observed n times to yield empirical mean & std dev  -upper bound: If  is carefully controlled, could be made zero-regret strategy – In general, we don’t know 20/01/201224 Cum. Distribution Function

25 Variant: UCB Strategy Again, based on notion of an upper confidence bound but more generally applicable Algorithm: – Play each arm once – At time t > K, play arm i t maximizing 20/01/201225

26 UCB Strategy 20/01/201226

27 Reminder: Chernoff-Hoeffding Bound 20/01/201227

28 UCB Strategy – Behaviour 20/01/201228 We will not try to prove the following result but I quote the final result to tell you why UCB may be a desirable strategy – regret is bounded. K = number of arms

29 Variation on SoftMax: It is possible to drive regret down by annealing  Exp3 : Exponential weight alg. for exploration and exploitation Probability of choosing arm k at time t is 20/01/201229  is a user defined open parameter

30 The Gittins Index Each arm delivers reward with a probability This probability may change through time but only when arm is pulled Goal is to maximize discounted rewards – future is discounted by an exponential discount factor  The structure of the problem is such that, all you need to do is compute an “index” for each arm and play the one with the highest index Index is of the form: 20/01/201230

31 Gittins Index – Intuition Proving optimality isn’t within our scope; but based on, Stopping time: the point where you should ‘terminate’ bandit Nice Property: Gittins index for any given bandit is independent of expected outcome of all other bandits – Once you have a good arm, keep playing until there is a better one – If you add/remove machines, computation doesn’t really change BUT: – hard to compute, even when you know distributions – Exploration issues; arms isn’t updated unless used (restless bandits?) 20/01/201231

32 Numerous Applications! 20/01/201232

33 Extending the MAB Model In this lecture, we are in a single casino and the only decision is to pull from a set of n arms – except perhaps in the very last slides, exactly one state! Next, What if there is more than one state? So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms? What happens if you only obtain a net reward corresponding to a long sequence of arm pulls (at the end)? 3320/01/2012

34 Acknowledgements Many slides are adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book 3420/01/2012


Download ppt "Reinforcement Learning Evaluative Feedback and Bandit Problems Subramanian Ramamoorthy School of Informatics 20 January 2012."

Similar presentations


Ads by Google