neuromodulators; midbrain; sub-cortical;

Slides:

Advertisements

Similar presentations

Reinforcement Learning I: prediction and classical conditioning

Advertisements

Reinforcement learning 2: action selection

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Reinforcement Learning

Lecture 18: Temporal-Difference Learning

Summary of part I: prediction and RL

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.

Lectures 14: Instrumental Conditioning (Basic Issues) Learning, Psychology 5310 Spring, 2015 Professor Delamater.

dopamine and prediction error

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.

Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”

Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.

FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.

Episodic Control: Singular Recall and Optimal Actions Peter Dayan Nathaniel Daw Máté Lengyel Yael Niv.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning.

Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.

Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.

Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides.

Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.

Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Does the brain compute confidence estimates about decisions?

Becoming Extinct. Extinction of Conditioned behavior Extinction is a typical part of the learning process – Pavlovian or Instrumental contingency can.

Neural Coding of Basic Reward Terms of Animal Learning Theory, Game Theory, Microeconomics and Behavioral Ecology Wolfram Schultz Current Opinion in Neurobiology.

The Rescorla-Wagner Model

PSY402 Theories of Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

An Overview of Reinforcement Learning

Neuroimaging of associative learning

התניה אופרנטית II: מטרות והרגלים

Announcements Homework 3 due today (grace period through Friday)

מוטיבציה והתנהגות free operant

Operant Conditioning Unit 4 - AoS 2 - Learning.

Classical Conditioning and prediction

Learning Psychology /29/2018.

Dr. Unnikrishnan P.C. Professor, EEE

Pavlovian Conditioning: Mechanisms and Theories

Neuroimaging of associative learning

PSY402 Theories of Learning

Dynamic Causal Modelling for M/EEG

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

CS 188: Artificial Intelligence Fall 2008

Chapter 7: Eligibility Traces

PSY402 Theories of Learning

Rethinking Extinction

Neuroimaging of associative learning

Christine Fry and Alex Park March 30th, 2004

Temporal Specificity of Reward Prediction Errors Signaled by Putative Dopamine Neurons in Rat VTA Depends on Ventral Striatum Yuji K. Takahashi, Angela J.

Volume 92, Issue 2, Pages (October 2016)

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Orbitofrontal Cortex as a Cognitive Map of Task Space

World models and basis functions

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

neuromodulators; midbrain; sub-cortical; Marrian Conditioning prediction: of important events control: in the light of those predictions Ethology optimality appropriateness Psychology classical/operant conditioning Computation dynamic progr. Kalman filtering Algorithm TD/delta rules simple weights Neurobiology neuromodulators; midbrain; sub-cortical; cortical structures

Plan ‘simple’ learning temporal difference learning and dopamine Rescorla-Wagner Pearce-Hall contexts and extinction temporal difference learning and dopamine action-learning model-free model-based vigour

Rescorla & Wagner (1972) error-driven learning: change in value is proportional to the difference between actual and predicted outcome Assumptions: learning is driven by error (formalizes notion of surprise) summations of predictors is linear A simple model - but very powerful! explains: gradual acquisition & extinction, blocking, overshadowing, conditioned inhibition, and more.. predicted overexpectation note: US as “special stimulus”

Rescorla-Wagner learning what about 50% reinforcement? note that extinction is not really like this – misses savings

Rescorla-Wagner learning prediction on trial (t) as a function of rewards in trials (t-1), (t-2), …? the R-W rule estimates expected reward using a weighted average of past rewards recent rewards weigh more heavily learning rate = forgetting rate!

Kalman Filter Markov random walk (or OU process) no punctate changes additive model of combination forward inference

Kalman Posterior ^ ε 

Assumed Density KF Rescorla-Wagner error correction competitive allocation of learning P&H, M

Blocking forward blocking: error correction backward blocking: -ve off-diag

Plan ‘simple’ learning temporal difference learning and dopamine Rescorla-Wagner Pearce-Hall contexts and extinction temporal difference learning and dopamine action-learning model-free model-based vigour

reinstatement slides from Yael Niv Test no shock Acquisition Extinction no shock no shock slides from Yael Niv

extinction ≠ unlearning Acquisition Extinction no shock Test Storsve, McNally & Richardson, 2012 also other evidence that extinction is not unlearning: spontaneous recovery, reinstatement slides from Yael Niv

learning causal structure: Gershman & Niv Sam Gershman there are many options for what the animal can be learning. maybe by modifying the learned association we can get insight regarding what was actually learned

conditioning as clustering: DPM Gershman & Niv; Daw & Courville; Redish Within each cluster: “learning as usual” (Rescorla-Wagner, RL etc.)

associative learning versus state learning Gershman & Niv these two processes compete to relieve the “explanatory tension” in the animal’s internal model structural learning (create new state)

how to erase a fear memory hypothesis: prediction errors (dissimilar data) lead to new states acquisition extinction what if we make extinction a bit more similar to acquisition? slides from Yael Niv

gradual extinction acquisition extinction gradual extinction regular extinction gradual reverse Gershman, Jones, Norman, Monfils & Niv

gradual extinction acquisition extinction Gershman, Jones, Norman, Monfils & Niv - under review acquisition extinction gradual extinction regular extinction gradual reverse test one day (reinstatement) or 30 days later (spontaneous recovery)

gradual extinction Gershman, Jones, Norman, Monfils & Niv - under review only gradual extinction group shows no reinstatement

Plan ‘simple’ learning temporal difference learning and dopamine Rescorla-Wagner Pearce-Hall contexts and extinction temporal difference learning and dopamine action-learning model-free model-based vigour

But: second order conditioning phase 1: phase 2: test: ? animals learn that a predictor of a predictor is also a predictor of reward!  not interested solely in predicting immediate reward

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward what’s the obvious prediction error? but… want to predict expected sum of future reward in a trial/episode (N.B. here t indexes time within a trial)

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward want to predict expected sum of future reward in a trial/episode Bellman eqn for policy evaluation

lets start over: this time from the top Marr’s 3 levels: The problem: optimal prediction of future reward The algorithm: temporal difference learning temporal difference prediction error t compare to:

Dopamine

dopamine and prediction error TD error L R Vt R no prediction prediction, reward prediction, no reward

Risk Experiment You won 40 cents < 1 sec 5 sec ISI 0.5 sec 2-5sec 5 stimuli: 40¢ 20¢ 0/40¢ 0¢ < 1 sec 0.5 sec You won 40 cents 5 sec ISI 2-5sec ITI 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced

Neural results: Prediction Errors what would a prediction error look like (in BOLD)?

Neural results: Prediction errors in NAC raw BOLD (avg over all subjects) unbiased anatomical ROI in nucleus accumbens (marked per subject*) can actually decide between different neuroeconomic models of risk * thanks to Laura deSouza

Plan ‘simple’ learning temporal difference learning and dopamine Rescorla-Wagner Pearce-Hall contexts and extinction temporal difference learning and dopamine action-learning model-free model-based vigour

Action Selection Immediate reinforcement: Delayed reinforcement: leg flexion Thorndike puzzle box pigeon; rat; human matching Delayed reinforcement: these tasks mazes chess Evolutionary specification

Pavlovian Control Keay & Bandler, 2001

Immediate Reinforcement stochastic policy: based on action values:

Direct Actor

Action at a (Temporal) Distance 4 2 2 S2 S3 S1 learning an appropriate action at S1: depends on the actions at S2 and S3 gains no immediate feedback idea: use prediction as surrogate feedback

Direct Action Propensities start with policy: evaluate it: S1 S2 S3 S 2 S 3 S 1 improve it: 1 -1 thus choose L more frequently than R

Policy S 2 S 3 S 1 value is too pessimistic action is better than average S1 S2 S3 spiraling links between striatum and dopamine system: ventral to dorsal vmPFC OFC/dACC dPFC SMA Mx

Plan ‘simple’ learning temporal difference learning and dopamine Rescorla-Wagner Pearce-Hall contexts and extinction temporal difference learning and dopamine action-learning model-free model-based vigour

Tree-Search/Model-Based System Tolmanian forward model forwards/backwards tree search motivationally flexible OFC; dlPFC; dorsomedial striatum; BLA? statistically efficient computationally catastrophic

Or more formally…. Daw & Niv S1 S3 S2 caching (habitual) S1 S3 S2 forward model (goal directed) Hunger Thirst (NB: trained hungry) H;S2,L 4 H;S2,R S1 S3 S2 L R = 4 = 0 = 2 = 3 = 2 = 0 = 4 = 1 H;S1,L 4 H;S1,R 3 H;S3,L 2 H;S3,R 3 acquire with simple learning rules perform online planning (MCTS) but how to choose when thirsty?

Human Canary a b c if a  c and c  £££ , then do more of a or b? MB: b MF: a (or even no effect)

Behaviour action values depend on both systems: expect that will vary by subject (but be fixed)

Neural Prediction Errors (12) R ventral striatum (anatomical definition) note that MB RL does not use this prediction error – training signal?

Plan ‘simple’ learning temporal difference learning and dopamine Rescorla-Wagner Pearce-Hall contexts and extinction temporal difference learning and dopamine action-learning model-free model-based vigour

Vigour Two components to choice: real-valued DP what: lever pressing direction to run meal to choose when/how fast/how vigorous free operant tasks real-valued DP

The model ? 1 time 2 time S0 S1 S2 vigour cost unit cost (reward) UR PR how fast ? LP NP Other S0 S1 S2 1 time 2 time choose (action,) = (LP,1) Costs Rewards choose (action,) = (LP,2) Costs Rewards goal

The model Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) S0 S1 S2 1 time 2 time choose (action,) = (LP,1) Costs Rewards choose (action,) = (LP,2) Costs Rewards ARL

Average Reward RL Compute differential values of actions Differential value of taking action L with latency  when in state x ρ = average rewards minus costs, per unit time Future Returns QL,(x) = Rewards – Costs + Mention that the model has few parameters (basically the cost constants and the reward utility) but we will not try to fit any of these, but just look at principles steady state behavior (not learning dynamics) (Extension of Schwartz 1993)

Effects of motivation (in the model) RR25 low utility high utility mean latency LP Other energizing effect

Relation to Dopamine Phasic dopamine firing = reward prediction error What about tonic dopamine? more less

Tonic dopamine hypothesis ♫ $ Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 reaction time firing rate …also explains effects of phasic dopamine on response times

Conditioning Ethology Psychology Computation Algorithm Neurobiology prediction: of important events control: in the light of those predictions Ethology optimality appropriateness Psychology classical/operant conditioning Computation dynamic progr. Kalman filtering Algorithm TD/delta rules simple weights Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum 58