Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.

Slides:



Advertisements
Similar presentations
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Advertisements

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Reinforcement Learning
Reinforcement learning
Perceptron Learning Rule
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
The loss function, the normal equation,
Planning under Uncertainty
Journal club Marian Tsanov Reinforcement Learning.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Decision making. ? Blaise Pascal Probability in games of chance How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x)
Reinforcement Learning Introduction Presented by Alp Sardağ.
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Chapter 6: Temporal Difference Learning
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Learning Rules 2 Computational Neuroscience 03 Lecture 9.
Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.
Uncertainty, Neuromodulation and Attention Angela Yu, and Peter Dayan.
Operant Conditioning Unit 4 - AoS 2 - Learning. Trial and Error Learning An organism’s attempts to learn or solve a problem by trying alternative possibilities.
Jochen Triesch, UC San Diego, 1 Organizing Principles for Learning in the Brain Associative Learning: Hebb rule and variations,
Operant Conditioning Unit 4 - AoS 2 - Learning. Trial and Error Learning An organism’s attempts to learn or solve a problem by trying alternative possibilities.
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 26- Reinforcement Learning for Robots; Brain Evidence.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Behaviourism Johnny D is very creative he loves playing the guitar, the drums and singing. In fact he loves anything involved in producing music. Why is.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Behaviorism Learning Theory By: Michelle Pascale.
Computational Modeling of Place Cells in the Rat Hippocampus Nov. 15, 2001 Charles C. Kemp.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Unconditioned stimulus (food) causes unconditioned response (saliva) Conditioned stimulus (bell) causes conditioned response (saliva)
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Neural Networks Chapter 7
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel.
Retraction: I’m actually 35 years old. Q-Learning.
Markov Decision Process (MDP)
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Neural correlates of risk sensitivity An fMRI study of instrumental choice behavior Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John O’Doherty Cohen.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Does the brain compute confidence estimates about decisions?
Dynamics of Reward Bias Effects in Perceptual Decision Making Jay McClelland & Juan Gao Building on: Newsome and Rorie Holmes and Feng Usher and McClelland.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Introduction of Reinforcement Learning
Adversarial Learning for Neural Dialogue Generation
Chapter 6: Temporal Difference Learning
A Simple Artificial Neuron
An Overview of Reinforcement Learning
Markov Decision Processes
Games with Chance Other Search Algorithms
Announcements Homework 3 due today (grace period through Friday)
Homework Schultz, Dayan, & Montague, Science, 1997
Dr. Unnikrishnan P.C. Professor, EEE
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
The loss function, the normal equation,
CS 188: Artificial Intelligence Fall 2008
Mathematical Foundations of BME Reza Shadmehr
Chapter 7: Eligibility Traces
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
World models and basis functions
Reinforcement Learning (2)
CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES
Presentation transcript:

Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no all knowing teacher, the reinforcement signal carries less information. Central problem – temporal credit assignment.

Example: Spatial learning is impaired by block of NMDA receptors (Morris, 1989) Morris water maze rat platform

Solving this problem is comprised of two separate tasks. 1. Predicting reward 2.Choosing the correct action or 1. Policy evaluation (critic) 2. Policy improvement (actor)

Classical vs. instrumental conditioning Classical think -> Pavlov dog In instrumental the animal is rewarded for “correct” actions, and not, or even punished for incorrect. In instrumental (Operant) what the animal does (Policy) matters.

Predicting reward – Rascola-Wagner rule Notation u – stimulus r - reward v – expected reward w – weight (filter) With: For more than one stimulus:

Learning, r=1Extinction, r=0 Random reward

Predicting future reward: Temporal Difference learning In more realistic conditions, especially in operant conditioning the actual reward might come some time after the signal for the reward. What we might care about is not the immediate reward at this time point, but rather the total reward predicted given the choice made at this time. How can we estimate the total reward? Total averagefuture reward at time t: Assume that we estimate this with a linear estimator:

Use the δ rule at time t: Where δ is the difference between the actual future rewards, and the prediction of these rewards: But, we do not know Instead we can approximate this by:

Which gives us: The temporal difference learning rule then becomes: (1) (2)

Dopamine and predicted reward Activity of VTA doparminergic neurons in a monkey. A. top- before learning, bottom after learning B. After learning. top- with reward, bottom – no reward

Generalization of TD(0) 1. u can be a vector u, so w is also a vector. This is for more complex, or multiple possible stimuli. 2. A decay term. Here: Current location Location moved to after action a This has the effect of putting a stronger emphasis on rewards that take fewer steps to reach.

Until now – how do we predict a reward. Still need to see how we make decisions of which path to take, or what policy to use. Describe bee foraging example: ? Different reward for each flower Different reward for each flower P(r b ) and P(r y )

Learn “action values” m b and m y (the actor), these will determine which choice to make. Assume r b =1, r y =2, what is the best choice we can make? The average reward is: What will maximize this reward?

Learn “action values” m b and m y, these will determine which choice to make. Use softmax: This is a stochastic choice, β is a variability parameter. A good choice for the “action values”: is to set them to the mean reward: This is also called “indirect actor” (???)

How good is this choice? Assume β=1, r b =1, r y =2, what is >> rb=1;ry=2; >> pb=exp(rb)/(exp(rb)+exp(ry))pb = >> py=exp(ry)/(exp(rb)+exp(ry))py = >> r_av=rb*pb+ry*pyr_av =

This choice can be learned using a delta rule β=1β=50 t<100; r b =1, r y =2 t >100; r b =2, r y =1

Another option (direct actor ???) is to set the activation values to maximize the expected reward: This can be done by stochastic gradient decent on For example: So that generally for actions variable m x given action a : A good choice for r 0 is the mean of r x over all possible choices. (See D&A book pg 344)

The Maze task and sequential action choice Policy evaluation: Initial random policy Policy evaluation What would it be for an ideal policy?

Policy improvement Using the direct actor learn to improve the policy. Note – policy improvement and policy evaluation are best carried out sequentially: evaluate – improve – evaluate – Improve … ? At A: For left turn For right turn

V(a)=1.75 V(B)=2.5V(C)=1

Reinforcement learning - summary