Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides.

Similar presentations


Presentation on theme: "Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides."— Presentation transcript:

1 Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

2 2 Marrian Conditioning Ethology – optimality – appropriateness Psychology – classical/operant conditioning Computation – dynamic progr. – Kalman filtering Algorithm – TD/delta rules – simple weights Neurobiology neuromodulators; midbrain; sub-cortical; cortical structures prediction: of important events control: in the light of those predictions

3

4 Plan prediction; classical conditioning – Rescorla-Wagner/TD learning – dopamine action selection; instrumental conditioning – direct/indirect immediate actors – trajectory optimization – multiple mechanisms – vigour Bayesian decision theory and computational psychiatry

5 = Conditioned Stimulus = Unconditioned Stimulus = Unconditioned Response (reflex); Conditioned Response (reflex) Animals learn predictions Ivan Pavlov

6 Animals learn predictions Ivan Pavlov very general across species, stimuli, behaviors

7 But do they really? temporal contiguity is not enough - need contingency 1. Rescorla’s control P(food | light) > P(food | no light)

8 But do they really? contingency is not enough either… need surprise 2. Kamin’s blocking

9 But do they really? seems like stimuli compete for learning 3. Reynold’s overshadowing

10 Theories of prediction learning: Goals Explain how the CS acquires “value” When (under what conditions) does this happen? Basic phenomena: gradual learning and extinction curves More elaborate behavioral phenomena (Neural data) P.S. Why are we looking at old-fashioned Pavlovian conditioning?  it is the perfect uncontaminated test case for examining prediction learning on its own

11 error-driven learning: change in value is proportional to the difference between actual and predicted outcome Assumptions: 1.learning is driven by error (formalizes notion of surprise) 2.summations of predictors is linear A simple model - but very powerful! –explains: gradual acquisition & extinction, blocking, overshadowing, conditioned inhibition, and more.. –predicted overexpectation note: US as “special stimulus” Rescorla & Wagner (1972)

12 how does this explain acquisition and extinction? what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0 –what would V be on average after learning? –what would the error term be on average after learning? Rescorla-Wagner learning

13 how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …? Rescorla-Wagner learning recent rewards weigh more heavily why is this sensible? learning rate = forgetting rate! the R-W rule estimates expected reward using a weighted average of past rewards

14 Summary so far Predictions are useful for behavior Animals (and people) learn predictions (Pavlovian conditioning = prediction learning) Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality Marr:

15 But: second order conditioning animals learn that a predictor of a predictor is also a predictor of reward!  not interested solely in predicting immediate reward animals learn that a predictor of a predictor is also a predictor of reward!  not interested solely in predicting immediate reward phase 1: phase 2: test: ? what do you think will happen? what would Rescorla-Wagner learning predict here?

16 lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward what’s the obvious prediction error? what’s the obvious problem with this? want to predict expected sum of future reward in a trial/episode (N.B. here t indexes time within a trial)

17 lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward want to predict expected sum of future reward in a trial/episode Bellman eqn for policy evaluation

18 lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward The algorithm: temporal difference learning temporal difference prediction error  t compare to:

19 19 Dopamine

20 20 Rewards rather than Punishments no predictionprediction, rewardprediction, no reward TD error V(t) R RL dopamine cells in VTA/SNc Schultz et al

21 Risk Experiment < 1 sec 0.5 sec You won 40 cents 5 sec ISI 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced 2-5sec ITI 5 stimuli: 40¢ 20¢ 0 / 40¢ 0¢ 5 stimuli: 40¢ 20¢ 0 / 40¢ 0¢

22 Neural results: Prediction Errors what would a prediction error look like (in BOLD)?

23 Neural results: Prediction errors in NAC unbiased anatomical ROI in nucleus accumbens (marked per subject*) * thanks to Laura deSouza raw BOLD (avg over all subjects) can actually decide between different neuroeconomic models of risk

24 Plan prediction; classical conditioning – Rescorla-Wagner/TD learning – dopamine action selection; instrumental conditioning – direct/indirect immediate actors – trajectory optimization – multiple mechanisms – vigour Bayesian decision theory and computational psychiatry

25 Action Selection Evolutionary specification Immediate reinforcement: – leg flexion – Thorndike puzzle box – pigeon; rat; human matching Delayed reinforcement: – these tasks – mazes – chess

26

27 Immediate Reinforcement stochastic policy: based on action values: 27

28 28 Indirect Actor use RW rule: switch every 100 trials

29 Direct Actor

30 30 Direct Actor

31 31 Action at a (Temporal) Distance learning an appropriate action at S 1 : – depends on the actions at S 2 and S 3 – gains no immediate feedback idea: use prediction as surrogate feedback S1S1 S3S3 S2S2 0 42 2

32 32 Direct Action Propensities S1S1 S2S2 S3S3 1 start with policy: evaluate it: improve it: thus choose L more frequently than R

33 Policy spiraling links between striatum and dopamine system: ventral to dorsal 33 value is too pessimistic action is better than average S1S1 S2S2 S3S3 vmPFC OFC/dACC dPFC SMA Mx

34 Variants: SARSA Morris et al, 2006

35 Variants: Q learning Roesch et al, 2007

36 Summary prediction learning – Bellman evaluation actor-critic – asynchronous policy iteration indirect method (Q learning) – asynchronous value iteration

37 Three Decision Makers tree search position evaluation situation memory

38 Tree-Search/Model-Based System Tolmanian forward model forwards/backwards tree search motivationally flexible OFC; dlPFC; dorsomedial striatum; BLA? statistically efficient computationally catastrophic

39 Habit/Model-Free System works by minimizing inconsistency: model-free, cached dopamine; CEN? dorsolateral striatum motivationally insensitive statistically inefficient computationally congenial

40 Need a Canary... if a  c and c  £££, then do more of a or b? – MB: b – MF: a (or even no effect) a b c

41 Behaviour action values depend on both systems: expect that will vary by subject (but be fixed)

42 Neural Prediction Errors (1  2) note that MB RL does not use this prediction error – training signal? R ventral striatum (anatomical definition)

43 Pavlovian Control Keay & Bandler, 2001

44 44 Vigour Two components to choice: – what: lever pressing direction to run meal to choose – when/how fast/how vigorous free operant tasks real-valued DP

45 45 The model choose (action,  ) = (LP,  1 )  1 time Costs Rewards choose (action,  ) = (LP,  2 ) Costs Rewards  cost LP NP Other ? how fast  2 time S1S1 S2S2 vigour cost unit cost (reward) S0S0 URUR PRPR goal

46 46 The model choose (action,  ) = (LP,  1 )  1 time Costs Rewards choose (action,  ) = (LP,  2 ) Costs Rewards  2 time S1S1 S2S2 S0S0 Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) ARL

47 47 Compute differential values of actions Differential value of taking action L with latency  when in state x ρ = average rewards minus costs, per unit time steady state behavior (not learning dynamics) (Extension of Schwartz 1993) Q L,  (x) = Rewards – Costs + Future Returns Average Reward RL

48 48  Choose action with largest expected reward minus cost 1. Which action to take? slow  delays (all) rewards net rate of rewards = cost of delay (opportunity cost of time)  Choose rate that balances vigour and opportunity costs 2.How fast to perform it? slow  less costly (vigour cost) Average Reward Cost/benefit Tradeoffs explains faster (irrelevant) actions under hunger, etc masochism

49 49 Optimal response rates Experimental data Niv, Dayan, Joel, unpublished 1 st Nose poke seconds since reinforcement Model simulation 1 st Nose poke seconds since reinforcementseconds

50 50 Optimal response rates Model simulation 50 0 0 % Reinforcements on lever A % Responses on lever A Model Perfect matching 020406080100 80 60 40 20 Pigeon A Pigeon B Perfect matching % Reinforcements on key A % Responses on key A Herrnstein 1961 Experimental data More: # responses interval length amount of reward ratio vs. interval breaking point temporal structure etc.

51 51 Effects of motivation (in the model) RR25 low utility high utility mean latency LPOther energizing effect

52 52 What about tonic dopamine? Phasic dopamine firing = reward prediction error moreless Relation to Dopamine

53 53 Tonic dopamine hypothesis Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 reaction time firing rate …also explains effects of phasic dopamine on response times $$$$$$♫♫♫♫♫♫

54 Plan prediction; classical conditioning – Rescorla-Wagner/TD learning – dopamine action selection; instrumental conditioning – direct/indirect immediate actors – trajectory optimization – multiple mechanisms – vigour Bayesian decision theory and computational psychiatry

55 The Computational Level sensory cortex. PFC striatum; (pre-)motor cortex, PFC hypothalamus; VTA/SNc; OFC choice: action that maximises the expected utility over the states

56 How Does the Brain Fail to Work? wrong problem – priors; likelihoods; utilities right problem, wrong inference – e.g., early/over-habitization; – over-reliance on Pavlovian mechanisms right problem, right inference, wrong environment – learned helplessness – discounting from inconsistency not completely independent: e.g., miscalibration

57 57 Conditioning Ethology – optimality – appropriateness Psychology – classical/operant conditioning Computation – dynamic progr. – Kalman filtering Algorithm – TD/delta rules – simple weights Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum prediction: of important events control: in the light of those predictions


Download ppt "Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides."

Similar presentations


Ads by Google