Neuroimaging of associative learning

Slides:



Advertisements
Similar presentations
Reinforcement Learning I: prediction and classical conditioning
Advertisements

dopamine and prediction error
Neural Correlates of Variations in Event Processing during Learning in Basolateral Amygdala Matthew R. Roesch*, Donna J. Calu, Guillem R. Esber, and Geoffrey.
Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.
Decision making. ? Blaise Pascal Probability in games of chance How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x)
Introduction: What does phasic Dopamine encode ? With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) ->
Dopamine, Uncertainty and TD Learning CoSyNe’04 Yael Niv Michael Duff Peter Dayan.
Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.
FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.
Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.
Dopamine enhances model-based over model-free choice behavior Peter Smittenaar *, Klaus Wunderlich *, Ray Dolan.
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.
Read Montague Baylor College of Medicine Houston, TX Reward Processing and Social Exchange.
Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
Manipulating the teaching signal: effects of dopamine-related drugs on human learning systems Wellcome Trust Centre for NeuroImaging University College.
Reward vs. Punishment: An fMRI Analysis Approach to Identifying the Neural Substrates of Motivation and Cognitive Control Ya’el C. Courtney, Debbie Yee,
Performance-Related Sustained and Anticipatory Activity in Human Medial Temporal Lobe during Delayed Match-to-Sample Rosanna K. Olsen,1 Elizabeth A. Nichols,1.
Chapter 6: Temporal Difference Learning
An Overview of Reinforcement Learning
Risk-Responsive Orbitofrontal Neurons Track Acquired Salience
מוטיבציה והתנהגות free operant
Emotion, Decision Making, and the Amygdala
Volume 64, Issue 3, Pages (November 2009)
Volume 72, Issue 4, Pages (November 2011)
Sang Wan Lee, Shinsuke Shimojo, John P. O’Doherty  Neuron 
Daphna Shohamy, Anthony D. Wagner  Neuron 
Pavlovian Conditioning: Mechanisms and Theories
Neuroimaging of associative learning
Perceptual Learning and Decision-Making in Human Medial Frontal Cortex
Matias J. Ison, Rodrigo Quian Quiroga, Itzhak Fried  Neuron 
Volume 93, Issue 2, Pages (January 2017)
Chapter 6: Temporal Difference Learning
Volume 65, Issue 1, Pages (January 2010)
Volume 38, Issue 2, Pages (April 2003)
Bayesian Methods in Brain Imaging
Fear Conditioning in Humans
Yoni K. Ashar, Jessica R. Andrews-Hanna, Sona Dimidjian, Tor D. Wager 
Volume 87, Issue 5, Pages (September 2015)
Volume 74, Issue 3, Pages (May 2012)
Volume 45, Issue 4, Pages (February 2005)
Volume 44, Issue 2, Pages (October 2004)
Ju Tian, Naoshige Uchida  Neuron 
Ronald Keiflin, Patricia H. Janak  Neuron 
Volume 16, Issue 10, Pages (September 2016)
Wallis, JD Helen Wills Neuroscience Institute UC, Berkeley
John T. Arsenault, Koen Nelissen, Bechir Jarraya, Wim Vanduffel  Neuron 
Ethan S. Bromberg-Martin, Masayuki Matsumoto, Okihide Hikosaka  Neuron 
Kerstin Preuschoff, Peter Bossaerts, Steven R. Quartz  Neuron 
Subliminal Instrumental Conditioning Demonstrated in the Human Brain
Erie D. Boorman, John P. O’Doherty, Ralph Adolphs, Antonio Rangel 
Neuroimaging of associative learning
Volume 39, Issue 4, Pages (August 2003)
Volume 87, Issue 5, Pages (September 2015)
Christian Büchel, Jond Morris, Raymond J Dolan, Karl J Friston  Neuron 
Probabilistic Modelling of Brain Imaging Data
Megan E. Speer, Jamil P. Bhanji, Mauricio R. Delgado  Neuron 
Predictive Neural Coding of Reward Preference Involves Dissociable Responses in Human Ventral Midbrain and Ventral Striatum  John P. O'Doherty, Tony W.
Temporal Specificity of Reward Prediction Errors Signaled by Putative Dopamine Neurons in Rat VTA Depends on Ventral Striatum  Yuji K. Takahashi, Angela J.
Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex
Samuel M McClure, Gregory S Berns, P.Read Montague  Neuron 
Will Penny Wellcome Trust Centre for Neuroimaging,
The Human Basal Forebrain Integrates the Old and the New
Neural Responses during Anticipation of a Primary Taste Reward
Striatal Activity Underlies Novelty-Based Choice in Humans
Volume 34, Issue 4, Pages (May 2002)
Orbitofrontal Cortex as a Cognitive Map of Task Space
Volume 90, Issue 5, Pages (June 2016)
Presentation transcript:

Neuroimaging of associative learning John O’Doherty Functional Imaging Lab Wellcome Department of Imaging Neuroscience Institute of Neurology Queen Square, London Collaborators on this project: Peter Dayan Ray Dolan Karl Friston Hugo Critchley Ralf Deichmann Acknowledgements to: Eric Featherstone, Peter Aston

Neuroimaging of associative learning Classical conditioning CS UCS 1/ How Pavlovian value predictions are learned 2/ Where this is implemented in the human brain 3/ Extend approach to instrumental conditioning

Neuroimaging of associative learning How value predictions are learned CS UCS Learning is mediated by a prediction error (Rescorla and Wagner, 1972; Pearce & Hall, 1980) δ = (r - v) where r = reward received on a given trial (UCS) v = expected reward - value of CS stimulus

Neuroimaging of associative learning Temporal difference learning (TD learning): Differs from previous trial based theories - Predictions are learned about the total future reward available within a trial for each time t in which a CS is presented (Sutton and Barto, 1989;Montague et al., 1996; Schultz, Dayan and Montague,1997) CS2 CS 0 1 2 3 4 5 Time within trial

Temporal difference learning model CS UCS Before Learning During Learning Unexpected Omission of reward

Reward Learning: Pavlovian Appetitive Conditioning Single unit recordings from dopamine neurons revealed that these neurons produce responses consistent with TD - learning (Schultz, 1998): (1) Transferring their responses from the time of the presentation of the reward to the time of the presentation of the CS during learning. (2) Decreasing firing from baseline at the time the reward was expected following an omission of expected reward. (3) Responding at the time of the reward following the unexpected delivery of reward From Schultz, Montague and Dayan, 1997 Can we find evidence of a temporal difference prediction error in the human brain during appetitive conditioning?

Reward circuits in the brain

Experimental Set Up Scanning conducted at 2 Tesla (Siemens) Taste delivered using an electronic syringe pump positioned outside the scanner room On-line measurement of pupillary responses 13 subjects participated (9 found taste pleasant at end of scanning)

Ratio of regular to ‘surprise’ trials 4:1 Experimental Design 3 6 CS+omit 3 6 Taste Reward (0.5ml of 1M glucose) CS+ 3 6 Neutral taste CSneut 3 6 CS- Taste Reward (0.5ml of 1M glucose) CS-unexpreward Ratio of regular to ‘surprise’ trials 4:1

Statistical Analysis TD related δ responses across the experiment CS+ Early CS-unexpreward CS UCS 6s 3s 0s CS UCS 6s 3s 0s Delta at time of CS CS+ Late CS UCS 3s 0s CS+omit 6s CS UCS 6s 3s 0s CS+ trials Delta at time of UCS CS+, CS+omit and CS-unexpreward trials

Discriminatory pupillary responses Results Discriminatory pupillary responses B *significant at p<0.05 * CS+ — CSneut

From O’Doherty, Dayan, Friston, Critchley and Dolan. Neuron, 2003 Results Areas showing signed TD-related PE responses +6 +3 R -3 -6 Random effects p<<0.001 p<0.05 corrected for small volume (ventral striatum) Different learning rates fit some regions better than others (α=0.2 vs. α=0.7) -30 +54 From O’Doherty, Dayan, Friston, Critchley and Dolan. Neuron, 2003

Results CS+omit model real Early CS+ (trials 1-10) % signal change UCS 6s 3s 0s CS+omit % signal change CS UCS 6s 3s 0s Time (secs) Mid CS+ (trials 11-40) CS UCS 6s 3s 0s CS-unexpreward CS UCS 6s 3s 0s LateCS+ (trials 41-80) 0s 3s 6s CS UCS

Interim Conclusions (1) Responses in a part of human ventral striatum and orbitofrontal cortex can be described by a theoretical learning model: temporal difference learning. On the basis of evidence from non-human primates, it is likely that a source of TD-learning related activity in these regions is the modulatory influence exerted by the phasic responses of dopamine neurons.

Instrumental Conditioning Reward Learning: Instrumental Conditioning Response-Outcome Stimulus-Response Stimulus-Outcome (Pavlovian)

Model of instrumental conditioning: Actor Critic Actor-critic TD(0) learning Follow policy π, in state u, take action a, with probability P[a’,u] (a function of the value of that action) moving from state u to u’. Critic: v=wu δ = ra(u)+v(u’)-v(u) where v(u)= value of state (averaged over all possible actions). Update weights: w(u)=w(u)+εδ Where ε = learning rate. Actor: update policy π, by changing value of state-action pairs (Q) for action a, as well as the value of all other actions a’ Q(a’,u) = Q(a’,u)+ ε(δaa’ – P[a’,u])δ P[a’;u] = probabilty of taking action a’ in state u δ = ra(u)+v(u’)-v(u)

Dorsal vs Ventral Striatum SNigra Putative anatomical substrates of actor-critic Ventral striatum = critic Dorsal striatum = actor (Montague et al., 1996) Houk et al., (1995) – suggest matrix/striosome distinction VTA

Two trial types: 80 trials of each Experimental Design Two trial types: 80 trials of each high valence low valence + + 60% probability receive neutral taste 30% probability receive 60% probability receive fruit juice 30% probability receive fruit juice

Experimental Design The design is split into two ‘sessions’: Pavlovian and Instrumental (each ~15 minutes in duration). Order of presentation of sessions counterbalanced across subjects Used two different fruit juices as the reward: peach juice and blackcurrant juice. To control for habituation in the pleasantness of the juices over the course of the experiment a different juice was used in the Pavlovian and Instrumental tasks for each subject. Again this was counterbalanced across subjects. Instrumental responses from one subjects were used as a ‘yoke’ to the Pavlovian contingencies from another subject.

Behavioral results * Instrumental choices * Significant at p<0.05

Results: Ventral Striatum PAVLOVIAN CONDITIONING P<0.01 R P<0.001 y=+8 y=+14 INSTRUMENTAL CONDITIONING P<0.01 P<0.001

Results: Ventral Striatum CONJUNCTION OF INSTRUMENTAL AND PAVLOVIAN CONDITIONING y=+8 y=+10

Results: Dorsal Striatum INSTRUMENTAL CONDITIONING P<0.01 R P<0.001 y=+20 PAVLOVIAN CONDITIONING y=+20

Results: Dorsal Striatum INSTRUMENTAL – PAVLOVIAN CONDITIONING α = 0.2 P<0.01 P<0.001 Effect size R y=+20 y=+20 PE_cs PE_ucs PE_cs PE_ucs Instrumental task Pavlovian task

Dorsal vs Ventral Striatum SNigra Putative anatomical substrates of actor-critic Ventral striatum = critic Dorsal striatum = actor (Montague et al., 1996) Houk et al., (1995) – suggest matrix/striosome distinction VTA

Conclusions (1) A temporal difference prediction error signal is present in a part of the human brain (ventral striatum) during appetitive conditioning. A putative neural substrate of the reward-related prediction error signal in the striatum is the phasic activity of afferent dopamine neurons. TD-related response is present in ventral striatum during Instrumental as well as Pavlovian conditioning Dorsal striatum has significantly enhanced responses during Instrumental relative to Pavlovian Conditioning

Conclusions (2) Suggests that actor-critic like process is implemented in human striatum: -Ventral striatum may correspond to the critic: involved in forming predictions of future reward -Dorsal striatum may correspond to the instrumental actor: may mediate stimulus-response learning More generally, demonstrates application of event-related fMRI to test constrained computational models of human brain function.

Neuroimaging of associative learning John O’Doherty Functional Imaging Lab Wellcome Department of Imaging Neuroscience Institute of Neurology Queen Square, London Collaborators on this project: Peter Dayan Ray Dolan Karl Friston Hugo Critchley Ralf Deichmann Acknowledgements to: Eric Featherstone, Peter Aston