Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides.

Slides:



Advertisements
Similar presentations
Reinforcement Learning I: prediction and classical conditioning
Advertisements

Reinforcement learning 2: action selection
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Reinforcement learning
Lecture 18: Temporal-Difference Learning
Summary of part I: prediction and RL
Conditioning Bear with me. Bare with me. Beer with me. Stay focused.
Lectures 14: Instrumental Conditioning (Basic Issues) Learning, Psychology 5310 Spring, 2015 Professor Delamater.
dopamine and prediction error
Designing a behavioral experiment
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Classical Conditioning Pavlov’s experiment - psychic secretions. Pavlov was a Russian physiologists who studied digestion. He won the Nobel prize in physiology.
Computational Neuromodulation Peter Dayan Gatsby Computational Neuroscience Unit University College London Nathaniel Daw Sham Kakade Read Montague John.
Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.
Psychology 001 Introduction to Psychology Christopher Gade, PhD Office: 621 Heafey Office hours: F 3-6 and by apt. Class WF 7:00-8:30.
Lecture 21: Avoidance Learning & Punishment Learning, Psychology 5310 Spring, 2015 Professor Delamater.
Journal club Marian Tsanov Reinforcement Learning.
Time, Rate and Conditioning or “A model with no free parameters that explains everything in behaviour” C.R. Gallistel John Gibbon Psych. Review 2000, 107(2):
Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.
Introduction: What does phasic Dopamine encode ? With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) ->
Dopamine, Uncertainty and TD Learning CoSyNe’04 Yael Niv Michael Duff Peter Dayan.
Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.
Learning Rules 2 Computational Neuroscience 03 Lecture 9.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Negative Reinforcement
FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.
Episodic Control: Singular Recall and Optimal Actions Peter Dayan Nathaniel Daw Máté Lengyel Yael Niv.
Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning.
Operant Conditioning Unit 4 - AoS 2 - Learning. Trial and Error Learning An organism’s attempts to learn or solve a problem by trying alternative possibilities.
Jochen Triesch, UC San Diego, 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,
Operant Conditioning Unit 4 - AoS 2 - Learning. Trial and Error Learning An organism’s attempts to learn or solve a problem by trying alternative possibilities.
CHAPTER 4 Pavlovian Conditioning: Causal Factors.
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
Prediction in Human Presented by: Rezvan Kianifar January 2009.
Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.
Psychology of Learning EXP4404 Chapter 3: Pavlovian (Classical) Conditioning Dr. Steve.
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.
Version 0.1 (c) CELEST Associative Learning.
The role of the basal ganglia in habit formation Group 4 Youngjin Kang Zhiheng Zhou Baoyu Wang.
Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.
Schedules of Reinforcement and Choice. Simple Schedules Ratio Interval Fixed Variable.
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.
A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.
Blocking The phenomenon of blocking tells us that what happens to one CS depends not only on its relationship to the US but also on the strength of other.
Event-related fMRI SPM course May 2015 Helen Barron Wellcome Trust Centre for Neuroimaging 12 Queen Square.
Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.
Learning & Memory JEOPARDY. The Field CC Basics Important Variables Theories Grab Bag $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
CONDITIONING CLASSICAL AND OPERANT CONDITIONING BSN-II, RLE-II.
Psychology and Neurobiology of Decision-Making under Uncertainty Angela Yu March 11, 2010.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Does the brain compute confidence estimates about decisions?
Chapter 6 LEARNING. Learning Learning – A process through which experience produces lasting change in behavior or mental processes. Behavioral Learning.
Optimal Decision-Making in Humans & Animals Angela Yu March 05, 2009.
Neural Coding of Basic Reward Terms of Animal Learning Theory, Game Theory, Microeconomics and Behavioral Ecology Wolfram Schultz Current Opinion in Neurobiology.
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Neuroimaging of associative learning
neuromodulators; midbrain; sub-cortical;
התניה אופרנטית II: מטרות והרגלים
מוטיבציה והתנהגות free operant
Classical Conditioning and prediction
Neuroimaging of associative learning
Chapter 6: Temporal Difference Learning
Wallis, JD Helen Wills Neuroscience Institute UC, Berkeley
Neuroimaging of associative learning
Christine Fry and Alex Park March 30th, 2004
Presentation transcript:

Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

2 Marrian Conditioning Ethology – optimality – appropriateness Psychology – classical/operant conditioning Computation – dynamic progr. – Kalman filtering Algorithm – TD/delta rules – simple weights Neurobiology neuromodulators; midbrain; sub-cortical; cortical structures prediction: of important events control: in the light of those predictions

Plan prediction; classical conditioning – Rescorla-Wagner/TD learning – dopamine action selection; instrumental conditioning – direct/indirect immediate actors – trajectory optimization – multiple mechanisms – vigour Bayesian decision theory and computational psychiatry

= Conditioned Stimulus = Unconditioned Stimulus = Unconditioned Response (reflex); Conditioned Response (reflex) Animals learn predictions Ivan Pavlov

Animals learn predictions Ivan Pavlov very general across species, stimuli, behaviors

But do they really? temporal contiguity is not enough - need contingency 1. Rescorla’s control P(food | light) > P(food | no light)

But do they really? contingency is not enough either… need surprise 2. Kamin’s blocking

But do they really? seems like stimuli compete for learning 3. Reynold’s overshadowing

Theories of prediction learning: Goals Explain how the CS acquires “value” When (under what conditions) does this happen? Basic phenomena: gradual learning and extinction curves More elaborate behavioral phenomena (Neural data) P.S. Why are we looking at old-fashioned Pavlovian conditioning?  it is the perfect uncontaminated test case for examining prediction learning on its own

error-driven learning: change in value is proportional to the difference between actual and predicted outcome Assumptions: 1.learning is driven by error (formalizes notion of surprise) 2.summations of predictors is linear A simple model - but very powerful! –explains: gradual acquisition & extinction, blocking, overshadowing, conditioned inhibition, and more.. –predicted overexpectation note: US as “special stimulus” Rescorla & Wagner (1972)

how does this explain acquisition and extinction? what would V look like with 50% reinforcement? eg –what would V be on average after learning? –what would the error term be on average after learning? Rescorla-Wagner learning

how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …? Rescorla-Wagner learning recent rewards weigh more heavily why is this sensible? learning rate = forgetting rate! the R-W rule estimates expected reward using a weighted average of past rewards

Summary so far Predictions are useful for behavior Animals (and people) learn predictions (Pavlovian conditioning = prediction learning) Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality Marr:

But: second order conditioning animals learn that a predictor of a predictor is also a predictor of reward!  not interested solely in predicting immediate reward animals learn that a predictor of a predictor is also a predictor of reward!  not interested solely in predicting immediate reward phase 1: phase 2: test: ? what do you think will happen? what would Rescorla-Wagner learning predict here?

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward what’s the obvious prediction error? what’s the obvious problem with this? want to predict expected sum of future reward in a trial/episode (N.B. here t indexes time within a trial)

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward want to predict expected sum of future reward in a trial/episode Bellman eqn for policy evaluation

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward The algorithm: temporal difference learning temporal difference prediction error  t compare to:

19 Dopamine

20 Rewards rather than Punishments no predictionprediction, rewardprediction, no reward TD error V(t) R RL dopamine cells in VTA/SNc Schultz et al

Risk Experiment < 1 sec 0.5 sec You won 40 cents 5 sec ISI 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced 2-5sec ITI 5 stimuli: 40¢ 20¢ 0 / 40¢ 0¢ 5 stimuli: 40¢ 20¢ 0 / 40¢ 0¢

Neural results: Prediction Errors what would a prediction error look like (in BOLD)?

Neural results: Prediction errors in NAC unbiased anatomical ROI in nucleus accumbens (marked per subject*) * thanks to Laura deSouza raw BOLD (avg over all subjects) can actually decide between different neuroeconomic models of risk

Plan prediction; classical conditioning – Rescorla-Wagner/TD learning – dopamine action selection; instrumental conditioning – direct/indirect immediate actors – trajectory optimization – multiple mechanisms – vigour Bayesian decision theory and computational psychiatry

Action Selection Evolutionary specification Immediate reinforcement: – leg flexion – Thorndike puzzle box – pigeon; rat; human matching Delayed reinforcement: – these tasks – mazes – chess

Immediate Reinforcement stochastic policy: based on action values: 27

28 Indirect Actor use RW rule: switch every 100 trials

Direct Actor

30 Direct Actor

31 Action at a (Temporal) Distance learning an appropriate action at S 1 : – depends on the actions at S 2 and S 3 – gains no immediate feedback idea: use prediction as surrogate feedback S1S1 S3S3 S2S

32 Direct Action Propensities S1S1 S2S2 S3S3 1 start with policy: evaluate it: improve it: thus choose L more frequently than R

Policy spiraling links between striatum and dopamine system: ventral to dorsal 33 value is too pessimistic action is better than average S1S1 S2S2 S3S3 vmPFC OFC/dACC dPFC SMA Mx

Variants: SARSA Morris et al, 2006

Variants: Q learning Roesch et al, 2007

Summary prediction learning – Bellman evaluation actor-critic – asynchronous policy iteration indirect method (Q learning) – asynchronous value iteration

Three Decision Makers tree search position evaluation situation memory

Tree-Search/Model-Based System Tolmanian forward model forwards/backwards tree search motivationally flexible OFC; dlPFC; dorsomedial striatum; BLA? statistically efficient computationally catastrophic

Habit/Model-Free System works by minimizing inconsistency: model-free, cached dopamine; CEN? dorsolateral striatum motivationally insensitive statistically inefficient computationally congenial

Need a Canary... if a  c and c  £££, then do more of a or b? – MB: b – MF: a (or even no effect) a b c

Behaviour action values depend on both systems: expect that will vary by subject (but be fixed)

Neural Prediction Errors (1  2) note that MB RL does not use this prediction error – training signal? R ventral striatum (anatomical definition)

Pavlovian Control Keay & Bandler, 2001

44 Vigour Two components to choice: – what: lever pressing direction to run meal to choose – when/how fast/how vigorous free operant tasks real-valued DP

45 The model choose (action,  ) = (LP,  1 )  1 time Costs Rewards choose (action,  ) = (LP,  2 ) Costs Rewards  cost LP NP Other ? how fast  2 time S1S1 S2S2 vigour cost unit cost (reward) S0S0 URUR PRPR goal

46 The model choose (action,  ) = (LP,  1 )  1 time Costs Rewards choose (action,  ) = (LP,  2 ) Costs Rewards  2 time S1S1 S2S2 S0S0 Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) ARL

47 Compute differential values of actions Differential value of taking action L with latency  when in state x ρ = average rewards minus costs, per unit time steady state behavior (not learning dynamics) (Extension of Schwartz 1993) Q L,  (x) = Rewards – Costs + Future Returns Average Reward RL

48  Choose action with largest expected reward minus cost 1. Which action to take? slow  delays (all) rewards net rate of rewards = cost of delay (opportunity cost of time)  Choose rate that balances vigour and opportunity costs 2.How fast to perform it? slow  less costly (vigour cost) Average Reward Cost/benefit Tradeoffs explains faster (irrelevant) actions under hunger, etc masochism

49 Optimal response rates Experimental data Niv, Dayan, Joel, unpublished 1 st Nose poke seconds since reinforcement Model simulation 1 st Nose poke seconds since reinforcementseconds

50 Optimal response rates Model simulation % Reinforcements on lever A % Responses on lever A Model Perfect matching Pigeon A Pigeon B Perfect matching % Reinforcements on key A % Responses on key A Herrnstein 1961 Experimental data More: # responses interval length amount of reward ratio vs. interval breaking point temporal structure etc.

51 Effects of motivation (in the model) RR25 low utility high utility mean latency LPOther energizing effect

52 What about tonic dopamine? Phasic dopamine firing = reward prediction error moreless Relation to Dopamine

53 Tonic dopamine hypothesis Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 reaction time firing rate …also explains effects of phasic dopamine on response times $$$$$$♫♫♫♫♫♫

Plan prediction; classical conditioning – Rescorla-Wagner/TD learning – dopamine action selection; instrumental conditioning – direct/indirect immediate actors – trajectory optimization – multiple mechanisms – vigour Bayesian decision theory and computational psychiatry

The Computational Level sensory cortex. PFC striatum; (pre-)motor cortex, PFC hypothalamus; VTA/SNc; OFC choice: action that maximises the expected utility over the states

How Does the Brain Fail to Work? wrong problem – priors; likelihoods; utilities right problem, wrong inference – e.g., early/over-habitization; – over-reliance on Pavlovian mechanisms right problem, right inference, wrong environment – learned helplessness – discounting from inconsistency not completely independent: e.g., miscalibration

57 Conditioning Ethology – optimality – appropriateness Psychology – classical/operant conditioning Computation – dynamic progr. – Kalman filtering Algorithm – TD/delta rules – simple weights Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum prediction: of important events control: in the light of those predictions