Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.

Similar presentations


Presentation on theme: "Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no."— Presentation transcript:

1 Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information. Central problem – temporal credit assignment.

2 Example: Spatial learning is impaired by block of NMDA receptors (Morris, 1989) Morris water maze rat platform

3

4

5 Solving this problem is comprised of two separate tasks. 1. Predicting reward 2.Choosing the correct action or 1. Policy evaluation (critic) 2. Policy improvement (actor)

6 Classical vs. instrumental conditioning Classical think -> Pavlov dog In instrumental the animal is rewarded for “correct” actions, and not, or even punished for incorrect. In instrumental (Operant) what the animal does (Policy) matters.

7 Predicting reward – Rascola-Wagner rule Notation u – stimulus r - reward v – expected reward w – weight (filter) With: For more than one stimulus:

8 Learning, r=1Extinction, r=0 Random reward

9 Predicting future reward: Temporal Difference learning In more realistic conditions, especially in operant conditioning the actual reward might come some time after the signal for the reward. What we might care about is not the immediate reward at this time point, but rather the total reward predicted given the choice made at this time. How can we estimate the total reward? Total averagefuture reward at time t: Assume that we estimate this with a linear estimator:

10 Use the δ rule at time t: Where δ is the difference between the actual future rewards, and the prediction of these rewards: But, we do not know Instead we can approximate this by:

11 Which gives us: The temporal difference learning rule then becomes: (1) (2)

12 Dopamine and predicted reward Activity of VTA doparminergic neurons in a monkey. A. top- before learning, bottom after learning B. After learning. top- with reward, bottom – no reward

13 Generalization of TD(0) 1. u can be a vector u, so w is also a vector. This is for more complex, or multiple possible stimuli. 2. A decay term. Here: Current location Location moved to after action a This has the effect of putting a stronger emphasis on rewards that take fewer steps to reach.

14 Until now – how do we predict a reward. Still need to see how we make decisions of which path to take, or what policy to use. Describe bee foraging example: ? Different reward for each flower Different reward for each flower P(r b ) and P(r y )

15 Learn “action values” m b and m y (the actor), these will determine which choice to make. Assume r b =1, r y =2, what is the best choice we can make? The average reward is: What will maximize this reward?

16 Learn “action values” m b and m y, these will determine which choice to make. Use softmax: This is a stochastic choice, β is a variability parameter. A good choice for the “action values”: is to set them to the mean reward: This is also called “indirect actor” (???)

17 How good is this choice? Assume β=1, r b =1, r y =2, what is >> rb=1;ry=2; >> pb=exp(rb)/(exp(rb)+exp(ry))pb = 0.2689 >> py=exp(ry)/(exp(rb)+exp(ry))py = 0.7311 >> r_av=rb*pb+ry*pyr_av = 1.7311

18 This choice can be learned using a delta rule β=1β=50 t<100; r b =1, r y =2 t >100; r b =2, r y =1

19 Another option (direct actor ???) is to set the activation values to maximize the expected reward: This can be done by stochastic gradient decent on For example: So that generally for actions variable m x given action a : A good choice for r 0 is the mean of r x over all possible choices. (See D&A book pg 344)

20 The Maze task and sequential action choice Policy evaluation: Initial random policy Policy evaluation What would it be for an ideal policy?

21 Policy improvement Using the direct actor learn to improve the policy. Note – policy improvement and policy evaluation are best carried out sequentially: evaluate – improve – evaluate – Improve … ? At A: For left turn For right turn

22 V(a)=1.75 V(B)=2.5V(C)=1

23

24 Reinforcement learning - summary


Download ppt "Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no."

Similar presentations


Ads by Google