CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Markov Decision Process
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Details and Biology 4/3/2008 Srini Narayanan – ICSI and UC Berkeley.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Reinforcement learning
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley

Lecture Outline  Introduction  Basic Concepts  Expectation, Utility, MEU  Neural correlates of reward based learning  Utility theory from economics  Preferences, Utilities.  Reinforcement Learning: AI approach  The problem  Computing total expected value with discounting  Q-values, Bellman’s equation  TD-Learning

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act so as to maximize expected utility  Change the rewards, change the behavior DEMO

Elements of RL  Transition model, how action influences states  Reward R, immediate value of state-action transition  Policy , maps states to actions Agent Environment StateRewardAction Policy

Markov Decision Processes  Markov decision processes (MDPs)  A set of states s  S  A model T(s,a,s’) = P(s’ | s,a)  Probability that action a in state s leads to s’  A reward function R(s, a, s’) (sometimes just R(s) for leaving a state or R(s’) for entering one)  A start state (or distribution)  Maybe a terminal state  MDPs are the simplest case of reinforcement learning  In general reinforcement learning, we don’t know the model or the reward function

Elements of RL r(state, action) immediate reward values G

Reward Sequences  In order to formalize optimality of a policy, need to understand utilities of reward sequences  Typically consider stationary preferences: If I prefer one state sequence starting today, I would prefer the same starting tomorrow.  Theorem: only two ways to define stationary utilities  Additive utility:  Discounted utility:

Elements of RL  Value function: maps states to state values Discount factor   [0, 1) (here 0.9) V * (state) values r(state, action) immediate reward values G G   2 11 π trγ t γrtrsV... G G

RL task (restated)  Execute actions in environment, observe results.  Learn action policy  : state  action that maximizes expected discounted reward E [r(t) +  r(t + 1) +  2 r(t + 2) + …] from any starting state in S

Hyperbolic discounting Ainslee 1992 Short term rewards are different from long term rewards Used in many animal discounting models Has been used to explain procrastination addiction Evidence from Neuroscience (Next lecture)

MDP Solutions  In deterministic single-agent search, want an optimal sequence of actions from start to a goal  In an MDP we want an optimal policy  (s)  A policy gives an action for each state  Optimal policy maximizes expected utility (i.e. expected rewards) if followed Optimal policy when R(s, a, s’) = for all non-terminals s

Example Optimal Policies R(s) = -2.0 R(s) = -0.4 R(s) = -0.03R(s) = -0.01

Utility of a State  Define the utility of a state under a policy: V  (s) = expected total (discounted) rewards starting in s and following   Recursive definition (one-step look-ahead):  Also called policy evaluation

Bellman’s Equation for Selecting actions  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation That’s my equation!

r(state, action) immediate reward values Q(state, action) values V * (state) values G G G Q-values  The expected utility of taking a particular action a in a particular state s (Q-value of the pair (s,a))

Representation  Explicit  Implicit  Weighted linear function/neural network Classical weight updating StateActionQ(s, a) 2MoveLeft81 2MoveRight100...

A table of values for each action: Q-Functions  A q-value is the value of a (state and action) under a policy  Utility of taking starting in state s, taking action a, then following  thereafter

The Bellman Equations  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy  Formally:

Optimal Utilities  Goal: calculate the optimal utility of each state V*(s) = expected (discounted) rewards with optimal actions  Why: Given optimal utilities, MEU tells us the optimal policy

MDP solution methods  If we know T(s, a, s’) and R(s,a,s’), then we can solve the MDP to find the optimal policy in a number of ways.  Dynamic programming  Iterative Estimation methods  Value Iteration  Assume 0 initial values for each state and update using the Bellman equation to pick actions.  Policy iteration  Evaluate a given policy (find V(s) for the policy), then change it using Bellman updates till there is no improvement in the policy.

Value Iteration  Idea:  Start with bad guesses at all utility values (e.g. V 0 (s) = 0)  Update all values simultaneously using the Bellman equation (called a value update or Bellman update):  Repeat until convergence  Theorem: will converge to unique optimal values  Basic idea: bad guesses get refined towards optimal values  Policy may converge long before values do

Reinforcement Learning  Reinforcement learning:  W have an MDP:  A set of states s  S  A set of actions (per state) A  A model T(s,a,s’)  A reward function R(s,a,s’)  Are looking for a policy  (s)  We don’t know T or R  I.e. don’t know which states are good or what the actions do  Must actually try actions and states out to learn

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Mechanisms and sophistication debated  Example: foraging  Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies  Bees have a direct neural connection from nectar intake measurement to motor planning area

Reinforcement Learning  Target function is  : state  action  However…  We have no training examples of form  Training examples are of form, new-state, reward>

Passive Learning  Simplified task  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You are given a policy  (s)  Goal: learn the state values (and maybe the model)  In this case:  No choice about what actions to take  Just execute the policy and learn from experience

Example: Direct Estimation Simple Monte Carlo  Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) U(1,1) ~ ( ) / 2 = -7 U(3,3) ~ ( ) / 3 = 31.3  = 1, R =

Full Estimation (Dynamic Programming) T T T TTTTTTTTTT

Simple Monte Carlo TTTTTTTTTT

Combining DP and MC TTTTTTTTTT

Reinforcement Learning  Target function is  : state  action  However…  We have no training examples of form  Training examples are of form, new-state, reward>

Model-Free Learning  Big idea: why bother learning T?  Update each time we experience a transition  Frequent outcomes will contribute more updates (over time)  Temporal difference learning (TD)  Policy still fixed!  Move values toward value of whatever successor occurs a s s, a s,a,s’ s’

TD Learning features  On-line, Incremental  Bootstrapping (like DP unlike MC)  Model free  Converges for any policy to the correct value of a state for that policy.  On average when alpha is small  With probability 1 when alpha is high in the beginning and low at the end (say 1/k)

Problems with TD Value Learning  TD value learning is model- free for policy evaluation  However, if we want to turn our value estimates into a policy, we’re sunk:  Idea: Learn state-action pairings (Q-values) directly  Makes action selection model-free too! a s s, a s,a,s’ s’

Q-Learning  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Nudge the old estimate towards the new sample:

Any problems with this?  What if the starting policy doesn’t let you explore the state space?  T(s,a,s’) is unknown and never estimated.  The value of unexplored states is never computed.  How do we address this problem?  Fundamental problem in RL and in Biology  AI solutions include  e-greedy  Softmax  Evidence from Neuroscience (next lecture).

Exploration / Exploitation  Several schemes for forcing exploration  Simplest: random actions (  -greedy)  Every time step, flip a coin  With probability , act randomly  With probability 1- , act according to current policy (best q value for instance)  Problems with random actions?  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions

Q-Learning

Q Learning features  On-line, Incremental  Bootstrapping (like DP unlike MC)  Model free  Converges to an optimal policy.  On average when alpha is small  With probability 1 when alpha is high in the beginning and low at the end (say 1/k)

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act so as to maximize expected utility  Change the rewards, change the behavior  Examples:  Learning your way around, reward for reaching the destination.  Playing a game, reward at the end for winning / losing  Vacuuming a house, reward for each piece of dirt picked up  Automated taxi, reward for each passenger delivered DEMO

Demo of Q Learning  Demo arm-control  Parameters  learning rate)  discounted reward (high for future rewards)  exploration(should decrease with time)  MDP  Reward= number of the pixel moved to the right/ iteration number  Actions : Arm up and down (yellow line), hand up and down (red line)

Exploration Functions  When to explore  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function  Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)