CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Reinforcement Learning

CS 188: Artificial Intelligence

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Announcements Homework 3: Games Project 2: Multi-Agent Pacman

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2009

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.

CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

CHAPTER 10 Reinforcement Learning Utility Theory.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.

Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.

MDPs (cont) & Reinforcement Learning

gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Reinforcement learning (Chapter 21)

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Comparison Value vs Policy iteration

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

Reinforcement Learning

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.

CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement Learning (1)

Markov Decision Processes

Reinforcement Learning

CSE 473: Artificial Intelligence

CS 188: Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008

CSE 473: Artificial Intelligence Autumn 2011

CS 188: Artificial Intelligence Fall 2007

CSE 473 Markov Decision Processes

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2008

Hidden Markov Models (cont.) Markov Decision Processes

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Fall 2007

Announcements Homework 2 Project 2 Mini-contest 1 (optional)

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1

Announcements  P2: Due Wednesday  P3: MDPs and Reinforcement Learning is up!  W2: Out late this week 2

Recap: MDPs  Markov decision processes:  States S  Actions A  Transitions P(s’|s,a) (or T(s,a,s’))  Rewards R(s,a,s’) (and discount  )  Start state s 0  Quantities:  Policy = map of states to actions  Episode = one run of an MDP  Utility = sum of discounted rewards  Values = expected future utility from a state  Q-Values = expected future utility from a q-state a s s, a s,a,s’ s’ 3 [DEMO – MDP Quantities]

Recap: Optimal Utilities  The utility of a state s: V * (s) = expected utility starting in s and acting optimally  The utility of a q-state (s,a): Q * (s,a) = expected utility starting in s, taking action a and thereafter acting optimally  The optimal policy:  * (s) = optimal action from state s 4 a s s’ s, a (s,a,s’) is a transition s,a,s’ s is a state (s, a) is a q-state

Recap: Bellman Equations  Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values: Total optimal rewards = maximize over choice of (first action plus optimal future)  Formally: a s s, a s,a,s’ s’ 5

Practice: Computing Actions  Which action should we chose from state s:  Given optimal values V?  Given optimal q-values Q?  Lesson: actions are easier to select from Q’s! 6 [DEMO – MDP action selection]

Value Estimates  Calculate estimates V k * (s)  Not the optimal value of s!  The optimal value considering only next k time steps (k rewards)  As k  , it approaches the optimal value  Almost solution: recursion (i.e. expectimax)  Correct solution: dynamic programming 7 [DEMO -- V k ]

Memoized Recursion?  Recurrences (basically truncated expectimax):  Cache all function call results so you never repeat work  What happened to the evaluation function? 8

Value Iteration  Problems with the recursive computation:  Have to keep all the V k * (s) around all the time  Don’t know which depth  k (s) to ask for when planning  Solution: value iteration  Calculate values for all states, bottom-up  Keep increasing k until convergence 9

Value Iteration  Idea:  Start with V 0 * (s) = 0, which we know is right (why?)  Given V i *, calculate the values for all states for depth i+1:  Throw out old vector V i *  Repeat until convergence  This is called a value update or Bellman update  Theorem: will converge to unique optimal values  Basic idea: approximations get refined towards optimal values  Policy may converge long before values do 10

Convergence*  Define the max-norm:  Theorem: For any two approximations U and V  I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution  Theorem:  I.e. once the change in our approximation is small, it must also be close to correct 11

Utilities for a Fixed Policy  Another basic operation: compute the utility of a state s under a fixed (generally non-optimal) policy  Define the utility of a state s, under a fixed policy  : V  (s) = expected total discounted rewards (return) starting in s and following   Recursive relation (one-step look- ahead / Bellman equation):  (s) s s,  (s) s,  (s),s’ s’ 12 [DEMO – Right-Only Policy]

Policy Evaluation  How do we calculate the V’s for a fixed policy?  Idea one: turn recursive equations into updates  Idea two: it’s just a linear system, solve with Matlab (or whatever) 13

Policy Iteration  Alternative approach:  Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence  Step 2: Policy improvement: update policy using one- step look-ahead with resulting converged (but not optimal!) utilities as future values  Repeat steps until policy converges  This is policy iteration  It’s still optimal!  Can converge faster under some conditions 14

Policy Iteration  Policy evaluation: with fixed current policy , find values with simplified Bellman updates:  Iterate until values converge  Policy improvement: with fixed utilities, find the best action according to one-step look-ahead 15

Comparison  Both compute same thing (optimal values for all states)  In value iteration:  Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (implicitly, based on current utilities)  Tracking the policy isn’t necessary; we take the max  In policy iteration:  Several passes to update utilities with fixed policy  After policy is evaluated, a new policy is chosen  Together, these are dynamic programming for MDPs 16

Asynchronous Value Iteration*  In value iteration, we update every state in each iteration  Actually, any sequences of Bellman updates will converge if every state is visited infinitely often  In fact, we can update the policy as seldom or often as we like, and we will still converge  Idea: Update states whose value we expect to change: If is large then update predecessors of s

Reinforcement Learning  Reinforcement learning:  Still have an MDP:  A set of states s  S  A set of actions (per state) A  A model T(s,a,s’)  A reward function R(s,a,s’)  Still looking for a policy  (s)  New twist: don’t know T or R  I.e. don’t know which states are good or what the actions do  Must actually try actions and states out to learn [DEMO] 18

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Mechanisms and sophistication debated  Example: foraging  Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies  Bees have a direct neural connection from nectar intake measurement to motor planning area 19

Example: Backgammon  Reward only for win / loss in terminal states, zero otherwise  TD-Gammon learns a function approximation to V(s) using a neural network  Combined with depth 3 search, one of the top 3 players in the world  You could imagine training Pacman this way…  … but it’s tricky! (It’s also P3) 20

Passive Learning  Simplified task  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You are given a policy  (s)  Goal: learn the state values (and maybe the model)  I.e., policy evaluation  In this case:  Learner “along for the ride”  No choice about what actions to take  Just execute the policy and learn from experience  We’ll get to the active case soon  This is NOT offline planning! 21

Example: Direct Estimation  Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) V(1,1) ~ ( ) / 2 = -7 V(3,3) ~ ( ) / 3 = 31.3  = 1, R = [DEMO – Optimal Policy]

Model-Based Learning  Idea:  Learn the model empirically (rather than values)  Solve the MDP as if the learned model were correct  Better than direct estimation?  Empirical model learning  Simplest case:  Count outcomes for each s,a  Normalize to give estimate of T(s,a,s’)  Discover R(s,a,s’) the first time we experience (s,a,s’)  More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”) 23

Example: Model-Based Learning  Episodes: x y T(, right, ) = 1 / 3 T(, right, ) = 2 /  = 1 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) 24

Recap: Model-Based Policy Evaluation  Simplified Bellman updates to calculate V for a fixed policy:  New V is expected one-step-look- ahead using current V  Unfortunately, need T and R 25  (s) s s,  (s) s,  (s),s’ s’

Sample Avg to Replace Expectation?  Who needs T and R? Approximate the expectation with samples (drawn from T!) 26  (s) s s,  (s) s1’s1’ s2’s2’ s3’s3’

Model-Free Learning  Big idea: why bother learning T?  Update V each time we experience a transition  Frequent outcomes will contribute more updates (over time)  Temporal difference learning (TD)  Policy still fixed!  Move values toward value of whatever successor occurs: running average! 27  (s) s s,  (s) s’

Example: TD Policy Evaluation Take  = 1,  = 0.5 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) 28

Problems with TD Value Learning  TD value leaning is model-free for policy evaluation (passive learning)  However, if we want to turn our value estimates into a policy, we’re sunk:  Idea: learn Q-values directly  Makes action selection model-free too! a s s, a s,a,s’ s’ 29