Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Reinforcement Learning
Reinforcement learning
Lecture 18: Temporal-Difference Learning
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
RL for Large State Spaces: Value Function Approximation
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Journal club Marian Tsanov Reinforcement Learning.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
Learning Rules 2 Computational Neuroscience 03 Lecture 9.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Normalised Least Mean-Square Adaptive Filtering
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Jochen Triesch, UC San Diego, 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Neural Networks Chapter 7
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
INTRODUCTION TO Machine Learning
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Intelligent Control Methods Lecture 14: Neuronal Nets (Part 2) Slovak University of Technology Faculty of Material Science and Technology in Trnava.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Does the brain compute confidence estimates about decisions?
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
An Overview of Reinforcement Learning
Markov Decision Processes
Reinforcement Learning
Dr. Unnikrishnan P.C. Professor, EEE
یادگیری تقویتی Reinforcement Learning
Chapter 6: Temporal Difference Learning
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Deep Reinforcement Learning
Chapter 7: Eligibility Traces
Reinforcement Nisheeth 18th January 2019.
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)

Rescola-Wagner Rule V=wu, with u stimulus (0,1), w weight and v is predicted response. Adapt w to minimize quadratic error

Rescola Wagner rule for multiple inputs can predict various phenomena: Blocking: learned s1 to r prevents learning of association s2 to r Inhibition: s2 reduces prediction when combined with any predicting stimulus

Temporal difference learning Interpret v(t) as total future expected reward v(t) is predicted from the past

After learning delta(t)=0 implies: v(t=0) is sum of expected future reward v(t) constant, thus expected reward r(t)=0 v(t) decreasing, positive expected reward

Explanation fig 9.2 Since u(t)=delta(t,0), Eq. 9.6 becomes: v(t)=w(t) Eq. 9.7 becomes delta w(t)= \epsilon delta(t) Thus, delta v(t)= \epsilon(r(t)+v(t+1)-v(t)) R(t)=delta(t,T) Step 1: only change is v(T)=v(T)+epsilon Step 2: change v(T-1) and v(T) Etc.

Dopamine Monkey release button and press other after stimulus to receive reward. A: VTA cells respond to reward in early trials and to stimulus in late trials. Similar to delta in TD rule fig. 9.2

Dopamine Dopamine neurons encode reward prediction error (delta). B: witholding reward reduced neural firing in agreement with delta interpretation.

Static action choice Rewards result from actions Bees visit flowers whose color (blue, yellow) predict reward (sugar). M are action values, encode expected reward. Beta implements exploration

The indirect actor model Learn the average nectar volumes for each flower and act accordingly. Implemented by on-line learning. When visit blue flower And leave yellow estimate unchanged Fig: rb=1, ry=2 for t=1:100 and reversed For t=101:200. A: my, mb; B-D Cumulated reward low beta (B), high Beta (C,D).

Bumble bees Blue: r=2 for all flowers, yellow: r=6 for 1/3 of the flowers. When switched at t=15 bees adapt fast.

Bumble bees Model with m= with f concave, so that mb=f(2) larger than my=1/3 f(6)

Direct actor (policy gradient)

Direct actor Stochastic gradient ascent: Fig: two sessions as in fig. 9.4 with good and Bad behaviour. Problem is size m prevents Exploration.

Sequential action choice Reward obtained after sequence of actions Credit assignment problem.

Sequential action choice Policy iteration: –Critic: use TD eval. v(state) using current policy –Actor: improve policy m(state)

Policy evaluation Policy is random left/right at each turn. Implemented as TD:

Policy improvement Can be understood as policy gradient rule: where we replace ra-r by And m becomes state dependent. Example: current state is A

Policy improvement Policy improvement changes policy, thus reevaluate policy for proven convergence Interleaving PI and PE is called actor-critic Fig: AC learning of maze. NB learning at C is slow.

Generalizations Discounted reward: TD rule changes to TD(lambda): apply TD rule not only to update value of current state but also of recently past visited states. TD(0)=TD, TD(1)=updating all past states.

Water maze State u = 493 place cells, 8 actions AC rules:

Comparing rats and model RL predicts well initial learning, but not change to new task.

Markov decision process State transitions P(u|u,a). Absorbing states: Find M such that Solution: solve Bellman equation

Policy iteration Is Policy evaluation + policy improvement Evaluation step: Find value of a policy M: RL evaluates rhs stochasticly V(u)=v(u) +eps delta(t)

Improvement step: maximize {...} wrt a Requires knowledge of P(u|u,a). Earlier formula can be derived as stochastic version