CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Details and Biology 4/3/2008 Srini Narayanan – ICSI and UC Berkeley.

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Advertisements

Reinforcement Learning

Reinforcement learning

Lecture 18: Temporal-Difference Learning

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Reinforcement learning (Chapter 21)

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Planning under Uncertainty

Reinforcement learning

Journal club Marian Tsanov Reinforcement Learning.

Reinforcement Learning

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Chapter 6: Temporal Difference Learning

Introduction: What does phasic Dopamine encode ? With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) ->

Chapter 6: Temporal Difference Learning

Reward processing (1) There exists plenty of evidence that midbrain dopamine systems encode errors in reward predictions (Schultz, Neuron, 2002) Changes.

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement learning and human behavior Hanan Shteingart and Yonatan Loewenstein MTAT Seminar in Computational Neuroscience Zurab Bzhalava.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning

Chapter 16. Basal Ganglia Models for Autonomous Behavior Learning in Creating Brain-Like Intelligence, Sendhoff et al. Course: Robots Learning from Humans.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

CHAPTER 10 Reinforcement Learning Utility Theory.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

A View from the Bottom Peter Dayan Gatsby Computational Neuroscience Unit.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Reinforcement learning (Chapter 21)

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

Reinforcement Learning

Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.

Neural Coding of Basic Reward Terms of Animal Learning Theory, Game Theory, Microeconomics and Behavioral Ecology Wolfram Schultz Current Opinion in Neurobiology.

Chapter 6: Temporal Difference Learning

Reinforcement learning (Chapter 21)

Reinforcement Learning (1)

Reinforcement learning (Chapter 21)

Reinforcement Learning

An Overview of Reinforcement Learning

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2007

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Spring 2006

World models and basis functions

Presentation transcript:

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Details and Biology 4/3/2008 Srini Narayanan – ICSI and UC Berkeley

Lecture Outline  Reinforcement Learning:  Temporal Difference: TD-Learning, Q- Learning  Demos (MDP, Q-Learning)  Animal Learning and Biology  Neuro-modulators and temporal difference  Discounting  Exploration and Exploitation  Neuroeconomics-- Intro

Demo of MDP solution

Example: Bellman Updates

Example: Value Iteration  Information propagates outward from terminal states and eventually all states have correct value estimates

Full Estimation (Dynamic Programming) T T T TTTTTTTTTT

Simple Monte Carlo TTTTTTTTTT

Combining DP and MC TTTTTTTTTT PREDICTION ERROR

Model-Free Learning  Big idea: why bother learning T?  Update each time we experience a transition  Frequent outcomes will contribute more updates (over time)  Temporal difference learning (TD)  Policy still fixed!  Move values toward value of whatever successor occurs a s s, a s,a,s’ s’

Q-Learning  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Nudge the old estimate towards the new sample:

Any problems with this? No guarantee you will explore the state space.  The value of unexplored states is never computed.  Fundamental problem in RL and in Biology  How do we address this problem?  AI solutions include  e-greedy  Softmax  Evidence from Neuroscience (next lecture).

Exploration / Exploitation  Several schemes for forcing exploration  Simplest: random actions (  -greedy)  Every time step, flip a coin  With probability , act randomly  With probability 1- , act according to current policy (best q value for instance)  Problems with random actions?  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions

Q-Learning

 Q-learning produces tables of q-values:

Q Learning features  On-line, Incremental  Bootstrapping (like DP unlike MC)  Model free  Converges to an optimal policy (Watkins 1989).  On average when alpha is small  With probability 1 when alpha is high in the beginning and low at the end (say 1/k)

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act so as to maximize expected utility  Change the rewards, change the behavior DEMO

Demo of Q Learning  Demo arm-control  Parameters  learning rate)  discounted reward (high for future rewards)  exploration(should decrease with time)  MDP  Reward= number of the pixel moved to the right/ iteration number  Actions : Arm up and down (yellow line), hand up and down (red line)

Helicopter Control (Andrew Ng)

Lecture Outline  Reinforcement Learning:  Temporal Difference: TD-Learning, Q- Learning  Demos (MDP, Q-Learning)  Animal Learning and Biology  Neuro-modulators and temporal difference  Discounting  Exploration and Exploitation  Neuroeconomics-- Intro

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Conditioning  Mechanisms and sophistication debated  More recently neuroscience has provided data on  Biological reality of prediction error td-(and q) learning  Utility structure and reward discounting  Exploration vs. exploitation

Dopamine levels track prediction error Unpredicted reward (unlearned/no stimulus) Predicted reward (learned task) Omitted reward (probe trial) (Montague et al. 1996) Wolfram Schultz Lab

Dopamine and prediction error Rats were trained on maze sucrose solution at the end was the reward Photosensors Dopamine Antagonists

RL Model Behavior

Human learning

Reward prediction in humans Dopamine neurons in VTA fMRI study. Changes in BOLD signal Decision Lab, Stanford University

Reward prediction in humans Explicit losses (punishment) seems to have a different circuit than the positive signal Changes modulated by probability of reward Decision Lab, Stanford University

Dopamine neurons and their role

Hyperbolic discounting Ainslee 1992 Short term rewards are different from long term rewards Used in many animal discounting models Has been used to explain procrastination addiction Behavior changes as rewards become imminent

McCLure, Cohen fMRI expts

Different circuits for immediate and delayed rewards?

Immediate and Long-term rewards

Basic Conclusion of the McClure, Cohen experiments  Two critical predictions:  Choices that include a reward today will preferentially engage limbic structures relative to choices that do not include a reward today  Trials in which the later reward is selected will be associated with relatively higher levels of lateral prefrontal activation,  reflecting the ability of this system to value greater rewards even when they are delayed.  The hyperbolic discounting may reflect a tension between limbic and more pre-frontal structures…  As in the grasshopper and the ant (from Aesop)  Lots of implications for marketing, education…  Twist: More recent results suggest that the systems may be involved at different activity levels for immediate and delayed reward (Kable 2007, Nat. Neuroscience)  Either case provides unambiguous evidence that subjective value is explicity represented in neural activity.

Exploration vs. Exploitation  Fundamental issue in adapting to a complex (changing) world.  Complex biological issue. Multiple factors may play a role.  Consistently implicates neuro-modulatory systems thought to be involved in assessing reward and uncertainty. (D, NE, Ach)  The midbrain dopamine system has been reward prediction errors  The locus coeruleus (LC) noradrenergic system has been proposed to govern the balance between exploration and exploitation in response to reward history (Aston-Jones & Cohen 2005).  Basal forebrain cholinergic system together with the adrenergic system have been proposed  to monitor uncertainty, signalling both expected and unexpected forms, respectively, which in turn might be used to promote exploitation or exploration (Yu & Dayan 2005).

Discounting and exploration Aston-Jones, G. & Cohen, J. D An integrative theory of locus coeruleus–norepinephrine function: adaptive gain and optimal performance. Annu. Rev. Neurosci. 28, 403–450.

Toward a biological model McClure et al Phil. Trans. of the Royal Society Proc. 2007

The Ultimatum Game: Human utility Sanfey, A.G. et al. (2003) The neural basis of economic decision making in the Ultimatum Game. Science

Summary  Biological evidence for  Prediction error and td-learning  Discounting  Hyperbolic  Two systems?  Exploitation and Exploration  LC and NE phasic and tonic  Social features cue relationship between discounting, utility, and explore/exploit

Areas that are probably directly involved in RL  Basal Ganglia  Striatum (Ventral/Dorsal), Putamen, Substantia Nigra  Midbrain (VT) and brainstem/hypothalamus (NC)  Amygdala  Orbito-Frontal Cortex  Cingulate Circuit (ACC)  Cerebellum  PFC  Insula

Neuroeconomics: Current topics  How (and where) are value and probability combined in the brain to provide a utility signal? What are the dynamics of this computation?  What neural systems track classically defined forms of expected and discounted utility? Under what conditions do these computations break down?  How is negative utility signaled? Is there a negative utility prediction signal comparable to the one for positive utility?  How are rewards of different types mapped onto a common neural currency like utility?  How do systems that seem to be focused on immediate decisions and actions interact with systems involved in longer-term planning (e.g. making a career decision)?  For example, does an unmet need generate a tonic and progressively increasing signal (i.e. a mounting ‘drive’), or does it manifest as a recurring episodic/phasic signal with increasing amplitude?  What are the connections between utility and ethics? Social issues.

Reinforcement Learning: What you should know  Basics  Utilities, preferences, conditioning  Algorithms  MDP formulation, Bellman’s equation  Basic learning formulation, temporal-difference, q-learning  Biology  Role of neuromodulators  Dopamine role  Short vs. long term rewards, Hyperbolic discounting  Exploration vs. exploitation  Neuroeconomics –The basic idea and questions.  What you might wonder  Role of reinforcement learning in language learning  Role of rewards and utility maximization in ethics, boredom…  Role of Neuro-modulation in cognition and behavior..