Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.

Slides:

Advertisements

Similar presentations

Reinforcement Learning I: prediction and classical conditioning

Advertisements

Reinforcement learning 2: action selection

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Reinforcement learning

Lecture 18: Temporal-Difference Learning

Summary of part I: prediction and RL

Institute for Theoretical Physics and Mathematics Tehran January, 2006 Value based decision making: behavior and theory.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Computational Neuromodulation Peter Dayan Gatsby Computational Neuroscience Unit University College London Nathaniel Daw Sham Kakade Read Montague John.

Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.

Journal club Marian Tsanov Reinforcement Learning.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Decision making. ? Blaise Pascal Probability in games of chance How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Introduction: What does phasic Dopamine encode ? With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) ->

Dopamine, Uncertainty and TD Learning CoSyNe’04 Yael Niv Michael Duff Peter Dayan.

Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.

Lectures 15 & 16: Instrumental Conditioning (Schedules of Reinforcement) Learning, Psychology 5310 Spring, 2015 Professor Delamater.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

From T. McMillen & P. Holmes, J. Math. Psych. 50: 30-57, MURI Center for Human and Robot Decision Dynamics, Sept 13, Phil Holmes, Jonathan.

FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.

Serotonin and Impulsivity Yufeng Zhang. Serotonin Originate from the median and dorsal raphe nuclei. Serotonin has been implicated in a variety of motor,

Episodic Control: Singular Recall and Optimal Actions Peter Dayan Nathaniel Daw Máté Lengyel Yael Niv.

Lyle Ungar, University of Pennsylvania Learning and Memory Reinforcement Learning.

The free-energy principle: a rough guide to the brain? K Friston Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Rapid Dopamine Signaling: Cocaine Versus “Natural” Rewards

Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.

Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.

Testing computational models of dopamine and noradrenaline dysfunction in attention deficit/hyperactivity disorder Jaeseung Jeong, Ph.D Department of Bio.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.

Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Schedules of Reinforcement and Choice. Simple Schedules Ratio Interval Fixed Variable.

Decision Making Theories in Neuroscience Alexander Vostroknutov October 2008.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.

Schedules of Reinforcement Thomas G. Bowers, Ph.D.

Schedules of reinforcement

A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides.

The Computing Brain: Focus on Decision-Making

Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.

Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.

Psychology and Neurobiology of Decision-Making under Uncertainty Angela Yu March 11, 2010.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Does the brain compute confidence estimates about decisions?

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Optimal Decision-Making in Humans & Animals Angela Yu March 05, 2009.

Choice Behavior One.

Chapter 6: Temporal Difference Learning

Reinforcement Learning (1)

An Overview of Reinforcement Learning

Neuroimaging of associative learning

neuromodulators; midbrain; sub-cortical;

התניה אופרנטית II: מטרות והרגלים

מוטיבציה והתנהגות free operant

Neuroimaging of associative learning

Chapter 6: Temporal Difference Learning

Hidden Markov Models (cont.) Markov Decision Processes

Wallis, JD Helen Wills Neuroscience Institute UC, Berkeley

Neuroimaging of associative learning

Orbitofrontal Cortex as a Cognitive Map of Task Space

World models and basis functions

Presentation transcript:

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG  A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model  Precise (normative!) theory for generation of dopamine firing patterns  Explains anticipatory dopaminergic responding, second order conditioning  Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

prediction error hypothesis of dopamine model prediction error measured firing rate Bayer & Glimcher (2005) at end of trial:  t = r t - V t (just like R-W)

Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian misbehaviour –vigor Chapter 9 of Theoretical Neuroscience

Action Selection Evolutionary specification Immediate reinforcement: –leg flexion –Thorndike puzzle box –pigeon; rat; human matching Delayed reinforcement: –these tasks –mazes –chess Bandler; Blanchard

Immediate Reinforcement stochastic policy: based on action values: 5

6 Indirect Actor use RW rule: switch every 100 trials

Direct Actor

8

Could we Tell? correlate past rewards, actions with present choice indirect actor (separate clocks): direct actor (single clock):

Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

Matching income not return approximately exponential in r alternation choice kernel

12 Action at a (Temporal) Distance learning an appropriate action at x=1 : –depends on the actions at x=2 and x=3 –gains no immediate feedback idea: use prediction as surrogate feedback x=1 x=2x=3 x=1 x=2x=3

13 Action Selection start with policy: evaluate it: improve it: thus choose R more frequently than L;C x=1 x=2x=3 x=1 x=2x=3 x=1x=3x=2

14 Policy value is too pessimistic action is better than average x=1x=3x=2

actor/critic m1m1 m2m2 m3m3 mnmn dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies

Formally: Dynamic Programming

Variants: SARSA Morris et al, 2006

Variants: Q learning Roesch et al, 2007

Summary prediction learning –Bellman evaluation actor-critic –asynchronous policy iteration indirect method (Q learning) –asynchronous value iteration

Impulsivity & Hyperbolic Discounting humans (and animals) show impulsivity in: –diets –addiction –spending, … intertemporal conflict between short and long term choices often explained via hyperbolic discount functions alternative is Pavlovian imperative to an immediate reinforcer framing, trolley dilemmas, etc

Direct/Indirect Pathways direct: D1: GO; learn from DA increase indirect: D2: noGO; learn from DA decrease hyperdirect (STN) delay actions given strongly attractive choices Frank

DARPP-32: D1 effect DRD2: D2 effect

Three Decision Makers tree search position evaluation situation memory

Multiple Systems in RL model-based RL –build a forward model of the task, outcomes –search in the forward model (online DP) optimal use of information computationally ruinous cached-based RL –learn Q values, which summarize future worth computationally trivial bootstrap-based; so statistically inefficient learn both – select according to uncertainty

Animal Canary OFC; dlPFC; dorsomedial striatum; BLA? dosolateral striatum, amygdala

Two Systems:

Behavioural Effects

Effects of Learning distributional value iteration (Bayesian Q learning) fixed additional uncertainty per step

One Outcome shallow tree implies goal-directed control wins

Human Canary... if a  c and c  £££, then do more of a or b? –MB: b –MF: a (or even no effect) a b c

Behaviour action values depend on both systems: expect that will vary by subject (but be fixed)

Neural Prediction Errors (1  2) note that MB RL does not use this prediction error – training signal? R ventral striatum (anatomical definition)

Neural Prediction Errors (1) right nucleus accumbens behaviour 1-2, not 1

Pavlovian Control

The 6-State Task Huys et al, 2012

Evaluation

Results full lookahead: unbiased termination: value-based pruning: pigeon effect: full model best model

Parameter Values BDI (mean 3.7; range 0-15) more reliance on pruning means more `prone to depression’

Direct Evidence of Pruning no weird loss aversion +140 vs -X

40 Vigour Two components to choice: –what: lever pressing direction to run meal to choose –when/how fast/how vigorous free operant tasks real-valued DP

41 The model choose (action,  ) = (LP,  1 )  1 time Costs Rewards choose (action,  ) = (LP,  2 ) Costs Rewards  cost LP NP Other ? how fast  2 time S1S1 S2S2 vigour cost unit cost (reward) S0S0 URUR PRPR goal

42 The model choose (action,  ) = (LP,  1 )  1 time Costs Rewards choose (action,  ) = (LP,  2 ) Costs Rewards  2 time S1S1 S2S2 S0S0 Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) ARL

43 Compute differential values of actions Differential value of taking action L with latency  when in state x ρ = average rewards minus costs, per unit time steady state behavior (not learning dynamics) (Extension of Schwartz 1993) Q L,  (x) = Rewards – Costs + Future Returns Average Reward RL

44  Choose action with largest expected reward minus cost 1. Which action to take? slow  delays (all) rewards net rate of rewards = cost of delay (opportunity cost of time)  Choose rate that balances vigour and opportunity costs 2.How fast to perform it? slow  less costly (vigour cost) Average Reward Cost/benefit Tradeoffs explains faster (irrelevant) actions under hunger, etc masochism

45 Optimal response rates Experimental data Niv, Dayan, Joel, unpublished 1 st Nose poke seconds since reinforcement Model simulation 1 st Nose poke seconds since reinforcementseconds

46 Optimal response rates Model simulation % Reinforcements on lever A % Responses on lever A Model Perfect matching Pigeon A Pigeon B Perfect matching % Reinforcements on key A % Responses on key A Herrnstein 1961 Experimental data More: # responses interval length amount of reward ratio vs. interval breaking point temporal structure etc.

47 Effects of motivation (in the model) RR25 low utility high utility mean latency LPOther energizing effect

48 Effects of motivation (in the model) U R 50% RR25 response rate / minute seconds from reinforcement directing effect 1 low utility high utility mean latency LPOther energizing effect 2

49 What about tonic dopamine? Phasic dopamine firing = reward prediction error moreless Relation to Dopamine

50 Tonic dopamine = Average reward rate NB. phasic signal RPE for choice/value learning Aberman and Salamone 1999 # LPs in 30 minutes Control DA depleted Model simulation # LPs in 30 minutes Control DA depleted 1.explains pharmacological manipulations 2.dopamine control of vigour through BG pathways eating time confound context/state dependence (motivation & drugs?) less switching=perseveration

51 Tonic dopamine hypothesis Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992 reaction time firing rate …also explains effects of phasic dopamine on response times $$$$$$♫♫♫♫♫♫

Sensory Decisions as Optimal Stopping consider listening to: decision: choose, or sample

Optimal Stopping

Key Quantities

Optimal Stopping

equivalent of state u=1 is and states u=2,3 is

Transition Probabilities

Computational Neuromodulation dopamine –phasic: prediction error for reward –tonic: average reward (vigour) serotonin –phasic: prediction error for punishment? acetylcholine: –expected uncertainty? norepinephrine –unexpected uncertainty; neural interrupt?

59 Conditioning Ethology –optimality –appropriateness Psychology –classical/operant conditioning Computation –dynamic progr. –Kalman filtering Algorithm –TD/delta rules –simple weights Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum prediction: of important events control: in the light of those predictions