Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT.03.292 Seminar in Computational Neuroscience Zurab Bzhalava.

Slides:

Advertisements

Similar presentations

Reinforcement Learning I: prediction and classical conditioning

Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Markov Decision Process

dopamine and prediction error

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Reinforcement learning (Chapter 21)

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Markov Decision Processes

Planning under Uncertainty

Reinforcement learning

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Dopamine, Uncertainty and TD Learning CNS 2004 Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Decision making. ? Blaise Pascal Probability in games of chance How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Introduction: What does phasic Dopamine encode ? With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) ->

Dr. Bruce J. West Chief Scientist Mathematical & Information Science Directorate Army Research Office UNCLASSIFIED.

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Prediction in Human Presented by: Rezvan Kianifar January 2009.

Dopamine enhances model-based over model-free choice behavior Peter Smittenaar *, Klaus Wunderlich *, Ray Dolan.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.

Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

Attributions These slides were originally developed by R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. (They have been reformatted.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

MDPs (cont) & Reinforcement Learning

Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian.

Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Learning. What is Learning? Acquisition of new knowledge, behavior, skills, values, preferences or understanding.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.

Does the brain compute confidence estimates about decisions?

Neural Coding of Basic Reward Terms of Animal Learning Theory, Game Theory, Microeconomics and Behavioral Ecology Wolfram Schultz Current Opinion in Neurobiology.

Markov Decision Process (MDP)

Dopamine system: neuroanatomy

An Overview of Reinforcement Learning

Markov Decision Processes

Biomedical Data & Markov Decision Process

Neuroimaging of associative learning

Markov Decision Processes

Markov Decision Processes

CS 188: Artificial Intelligence Fall 2007

Neuroimaging of associative learning

Chapter 17 – Making Complex Decisions

Neuroimaging of associative learning

Markov Decision Processes

Orbitofrontal Cortex as a Cognitive Map of Task Space

Markov Decision Processes

World models and basis functions

A Deep Reinforcement Learning Approach to Traffic Management

Presentation transcript:

Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava

Introduction Operant Learning Dominant computational approach to model operant learning is model-free RL Human behavior is far more complex Remaining Challenges

Reinforcement Learning RL: A class of learning problems in which an agent interacts with an unfamiliar, dynamic and stochastic environment Goal: Learn a policy to maximize some measure of long-term reward

Markov Decision Process A (finite) set of states S A (finite) set of actions A Transition Model: T(s, a, s’) = P(s’ | a,s) Reward Function: R(s) is a discount factor ∈ [0; 1] Policy π Optimal policy π*

Markov Decision Process Bellman equation:

Biological Algorithms Behavioral control Evaluate the world quickly Choose appropriate behavior based on those valuations

midbrain's dopamine neurons Central role in guiding our behavior and thoughts Valuation of our world –Value of money –Other human being Major role in decision-making Reward-dependent learning Malfunction in mental illness Related to Parkinson's disease. Schizophrenia

Reinforcement signals define an agent's goals 1.organism is in state X an receives reward information; 2.organism queries stored value of state X; 3.organism updates stored value of state X based on current reward information; 4.organism selects action based on stored policy 5.organism transitions to state Y and receives reward information.

The reward-prediction error hypothesis Difference between the experienced and predicted “reward” of an event Neurons of the ventral tegmental area phasic activity changes encode a 'prediction error about summed future reward'

prediction-error signal encoded in dopamine neuron firing.

Value binding

Human reward responses Orbitofrontal Cortex (OFC) Amygdala (Amyg) Nucleus Accumbens Sublenticular extended amygdala Hypothalamus (Hyp) Ventral Tegmental Area (VTA)

Human reward responses

Model-based RL vs Model-free RL goal-directed vs habitual behaviors Implemented by two anatomically distinct systems (subject of debate) Some findings suggest: –Medial striatum is more engaged during planning –Lateral striatum is more engaged during choices in extensively trained tasks

Model-based RL vs Model-free RL (b) Model-free RL (c) Model-based RL Human subjects in exhibited a mixture of both effects.

Challenges in relating human behavior to RL algorithms Humans tend to alternate rather than repeat an action after receiving a positively surprising payoff Tremendous heterogeneity in reports on human operant learning Probability matching or not

Heterogeneity in world model Questions?

Learning the world model Questions?

Reference List: Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein The ubiquity of model-based reinforcement learning Bradley B Doll Dylan A Simon3 and Nathaniel D Daw Computational roles for dopamine in behavioral control P. Read Montague1,2, Steven E. Hyman3 & Jonathan D. Cohen4,5