Episodic Control: Singular Recall and Optimal Actions Peter Dayan Nathaniel Daw Máté Lengyel Yael Niv.

Slides:



Advertisements
Similar presentations
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Advertisements

The imperfect brain: Rationality and the limits of knowledge Quentin Huys Wellcome Trust Neuroimaging Centre Gatsby Computational Neuroscience Unit Medical.
dopamine and prediction error
Pre-frontal cortex and Executive Function Squire et al Ch 52.
How to Win a Chinese Chess Game Reinforcement Learning Cheng, Wen Ju.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement learning (Chapter 21)
Patch to the Future: Unsupervised Visual Prediction
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Memory  All learning requires memory  Three stages of memory phenomena Acquisition Retention Retrieval.
Reinforcement learning
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Journal club Marian Tsanov Reinforcement Learning.
Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.
In two minds: The neuroscience of decision-making Deborah Talmi, School of Psychological Sciences, University of Manchester.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
2-1 MARKETING MANAGEMENT Developing Marketing Strategies and Plans.
Chapter 6 Consumer Learning
8/9/20151 DARPA-MARS Kickoff Adaptive Intelligent Mobile Robots Leslie Pack Kaelbling Artificial Intelligence Laboratory MIT.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Parallel and Interactive Memory Systems in the Human Brain and the limitations of fMRI studies.
Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.
Dopamine enhances model-based over model-free choice behavior Peter Smittenaar *, Klaus Wunderlich *, Ray Dolan.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
INTERNATIONAL INSTITUTE FOR GEO-INFORMATION SCIENCE AND EARTH OBSERVATION Transition Rule Elicitation Methods for Urban Cellular Automata Models Junfeng.
Learning from Observations Chapter 18 Through
The role of the basal ganglia in habit formation Group 4 Youngjin Kang Zhiheng Zhou Baoyu Wang.
Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.
FMRI studies of the human basal ganglia learning system Carol A. Seger Cognitive and Behavioral Neuroscience programs Department of Psychology Colorado.
CHAPTER 10 Reinforcement Learning Utility Theory.
The Role of the Basal Ganglia in Habit Formation By Henry H. Yin & Barbara J. Knowlton Group 3, Week 10 Alicia Iafonaro Kimberly Villalva Tawni Voyles.
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.
INTRODUCTION TO Machine Learning
If behavior was dominated in the past by Hull’s S-R reinforcement paradigm, what paradigm is it dominated by today? There is a causal relationship between.
A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference.
MDPs (cont) & Reinforcement Learning
Collaborative Filtering Zaffar Ahmed
Human and Optimal Exploration and Exploitation in Bandit Problems Department of Cognitive Sciences, University of California. A Bayesian analysis of human.
Reinforcement learning (Chapter 21)
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
18 Actions, Habits, and the Cortico-Striatal System.
Neurobiology of Learning and Memory
Reinforcement Learning (1)
Deep Learning.
Model-based RL (+ action sequences): maybe it can explain everything
neuromodulators; midbrain; sub-cortical;
התניה אופרנטית II: מטרות והרגלים
Announcements Homework 3 due today (grace period through Friday)
מוטיבציה והתנהגות free operant
Retrieval practice Quiz - theories of LTM.
CAP 5636 – Advanced Artificial Intelligence
Emotion, Decision Making, and the Amygdala
Reinforcement Learning
Reinforcement Learning
Reward, Motivation, and Reinforcement Learning
Chapter 9: Planning and Learning
CS 188: Artificial Intelligence Spring 2006
Orbitofrontal Cortex as a Cognitive Map of Task Space
World models and basis functions
Presentation transcript:

Episodic Control: Singular Recall and Optimal Actions Peter Dayan Nathaniel Daw Máté Lengyel Yael Niv

Two Decision Makers tree search position evaluation

Two Decision Makers tree search position evaluation situation memory: whole, bound episodes Three

Goal-Directed/Habitual/Episodic Control why have more than one system? –statistical versus computational noise –DMS/PFC vs DLS/DA why have more than two systems? –statistical versus computational noise (why have more than three systems?) when is episodic control a good idea? is the MTL involved?

Two Decision Makers tree search: –model-based reinforcement learning (PFC; DMS) position evaluation: –model free reinforcement learning (DA; DLS)  (t)=r(t)+V(t+1)-V(t) Pavlovian control –evolutionary preprogramming –misbehaviour Three

forward model (goal directed) S1S1 S3S3 S2S2 caching (habitual) (NB: trained hungry) H;S 1,L 4 H;S 1,R 3 H;S 2,L 4 H;S 2,R 0 H;S 3,L 2 H;S 3,R 3 Reinforcement Learning acquire recursivelyacquire with simple learning rules S1S1 S3S3 S2S2 L R L R L R = 4 = 0 = 2 = 3 = 2 = 0 = 4 = 1 Hunger Thirst = -1 = 0 = 2 = 3 Cheese  (t)=r(t)+V(t+1)-V(t)

Learning uncertainty-sensitive learning for both systems: –model-based: (propagate uncertainty) data efficient computationally ruinous –model-free (Bayesian Q-learning) data inefficient computationally trivial –uncertainty-sensitive control migrates from actions to habits Daw, Niv, Dayan

One Outcome shallow tree implies goal-directed control wins Daw, Niv, Dayan uncertainty- sensitive learning

One Outcome Daw, Niv, Dayan uncertainty- sensitive learning

Actions and Habits model-based system is Tolmanian evidence from Killcross et al: –prelimbic lesions: instant devaluation insensitivitity –infralimbic lesions: permanent devalulation sensitivity evidence from Balleine et al: –goal-directed control: PFC; dorsomedial thalamus –habitual control: dorsolateral striatum; dopamine both systems learn; compete for control arbitration: ACC; ACh?

But... top-down –hugely inefficient to do semantic control given little data  different way of using singular experience bottom-up –why store episodes?  use for control situation memory for Deep Blue

The Third Way simple domain model-based control: –build a tree –evaluate states –count cost of uncertainty episodic control: –store conjunction of states, actions, rewards –if reward > expectation, store all actions in the whole episode (Düzel) –choose rewarded action; else random

Semantic Controller T=0

Semantic Controller T=1 T=100

Episodic Controller T=0 best reward

Episodic Controller best reward best reward T=1T=100

Performance episodic advantage for early trials lasts longer for more complex environments can’t compute statistics/semantic information

Packard & McGaugh ’96 inactivate dorsal HC; dorsolateral caudate 8;16 days along training Hippocampal/Striatal Interactions CNHCCNHC test day 8test day 16 # animals place action SLLLLS SS place action

Hippocampal/Striatal Interactions Doeller, King & Burgess, 2008 (+D&B 2008)

Hippocampal/Striatal Interactions Poldrack et al: feedback condition event related analysis MTL caudate

simultaneous learning –but HC can overshadow striatum (unlike actions v habits) competitive interaction? –contribute according to activation strength –but vmPFC covaries with covariance content: –specific – space –generic – weather Hippocampal/Striatal Interactions

Discussion multiple memory systems and multiple control systems episodic memory for prospective control transition to PFC? striatum uncertainty-based arbitration memory-based forward model? –but episodic statistics are poor? Tolmanian test? overshadowing/blocking representational effects of HC (Knowlton, Gluck et al)