Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Slides:



Advertisements
Similar presentations
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Advertisements

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Reinforcement Learning
Reinforcement Learning I: The setting and classical stochastic dynamic programming algorithms Tuomas Sandholm Carnegie Mellon University Computer Science.
Reinforcement learning
Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Journal club Marian Tsanov Reinforcement Learning.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Learning Rules 2 Computational Neuroscience 03 Lecture 9.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
FIGURE 4 Responses of dopamine neurons to unpredicted primary reward (top) and the transfer of this response to progressively earlier reward-predicting.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Jochen Triesch, UC San Diego, 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
Retraction: I’m actually 35 years old. Q-Learning.
Reinforcement Learning Elementary Solution Methods
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
Intelligent Control Methods Lecture 14: Neuronal Nets (Part 2) Slovak University of Technology Faculty of Material Science and Technology in Trnava.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Does the brain compute confidence estimates about decisions?
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
An Overview of Reinforcement Learning
Markov Decision Processes
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
یادگیری تقویتی Reinforcement Learning
CS 188: Artificial Intelligence Fall 2007
Chapter 6: Temporal Difference Learning
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Deep Reinforcement Learning
CS 188: Artificial Intelligence Fall 2008
Chapter 7: Eligibility Traces
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Rescola-Wagner Rule V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error

Rescola Wagner rule for multiple inputs can predict various phenomena: Blocking: learned s1 to r prevents learning of association s2 to r Inhibition: s2 reduces prediction when combined with any predicting stimulus

Temporal difference learning Interpret v(t) as ‘total future expected reward’ v(t) is predicted from the past

After learning delta(t)=0 implies: v(t=0) is sum of expected future reward v(t) constant, thus expected reward r(t)=0 v(t) decreasing, positive expected reward

Explanation fig 9.2 Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t) Eq. 9.7 becomes delta w(t)= \epsilon delta(t) Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t)) R(t)=delta(t,T) Step 1: only change is v(T)=v(T)+epsilon Step 2: change v(T-1) and v(T) Etc.

Dopamine Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2

Dopamine Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.

Static action choice Rewards result directly from actions Bees visit flowers whose color (blue, yellow) predict reward (sugar). –M are action values, encode expected reward. Beta implements exploration –P are action probabilities

The indirect actor model Learn the average nectar volumes for each flower and act accordingly. Implemented by on-line learning. When visit blue flower And leave yellow estimate unchanged Fig: rb=1, ry=2 for t=1:100 and reversed For t=101:200. A: my, mb; B-D Cumulated reward low beta (B), high Beta (C,D).

Bumble bees Risk aversion: –Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast. –A: av. Of 5 bees –B: subjective utility function m(2) > 2/3 m(0)+ 1/3 m(6) favours risk avoidance –C: model prediction

Direct actor (policy gradient)

Direct actor Stochastic gradient ascent: Fig: two sessions as in fig. 9.4 with good and Bad behaviour. Problem is size m prevents Exploration.

Sequential action choice/Delayed reward Reward obtained after sequence of actions –Rat moves without back tracking. After reward removed from maze and restart Delayed reward problem: –Choice at A has no direct reward

Sequential action choice/Delayed reward Policy iteration (see also Kaelbling 3.2.2): Loop: –Policy evaluation: Compute value V_pi for policy pi. Run Bellman backup until convergence –Policy improvement: Improve pi

Sequential action choice/Delayed reward Actor Critic (see also Kaelbling 4.1): Loop: –Critic: use TD eval. V(state) using current policy –Actor: improve policy p(state)

Policy evaluation Policy is random left/right at each turn. Implemented as TD (w=v):

Policy improvement Base action on expected future reward minus expected current reward Example: state A: Use epsilon greedy or softmax for exploration.

Policy improvement Policy improvement changes policy, thus reevaluate policy for proven convergence Interleaving PI and PE is called actor-critic Fig: AC learning of maze. NB learning at C is slow.

Generalizations Discounted reward: TD rule changes to TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.

Water maze State dependent place cell activity (Foster Eq. 1). 8 actions Critic and Actor (Foster Eqs. 3-10)

Comparing rats and model Left: average performance of 12 rats, four trials per day. RL predicts well initial learning, but not change to new task.

Markov decision process State transitions P(u’|u,a). Absorbing states: Find M such that Solution: solve Bellman equation

Policy iteration Is Policy evaluation + policy improvement Evaluation step: Find value of a policy M: RL evaluates rhs stochasticly V(u)=v(u) +eps delta(t)

Improvement step: maximize {...} wrt a Requires knowledge of P(u’|u,a). Earlier formula can be derived as stochastic version