Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Advertisements

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Batch RL Via Least Squares Policy Iteration
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Planning under Uncertainty
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,
Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
Policies and exploration and eligibility, oh my!.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning (1)
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Eligibility traces: The “atomic breadcrumbs” approach to RL.
Policies and exploration and eligibility, oh my!.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Reinforcement Learning
Reinforcement Learning
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, Class Text:
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.
Comparison Value vs Policy iteration
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Reinforcement Learning (1)
Markov Decision Processes
CS 188: Artificial Intelligence
Announcements Homework 3 due today (grace period through Friday)
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Hidden Markov Models (cont.) Markov Decision Processes
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Q

The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { V i =eval_policy( M, π i,  ) // from Bellman eqn π i+1 =local_update_policy( π i, V i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, V ) for i=1..| S | { π’(s i ) =argmax a ∈ A ( sum j ( T(s i,a,s j )*V(s j ) ) ) }

Q : A key operative Critical step in policy iteration π’(s i ) =argmax a ∈ A ( sum j ( T(s i,a,s j )*V(s j ) ) ) Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?” Alt: regardless of current π, what would be the best a I could pick for the next timestep (greedily)

Q : A key operative Commonly used operation. Gets a special name: Definition: the Q function, is: Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”

What to do with Q Can think of Q as a big table: one entry for each state/action pair “If I’m in state s and take action a, this is my expected discounted reward...” A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?” Can get V and π from Q :

Policy iteration, restated Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { Q i =eval_policy( M, π i,  ) // from Bellman eqn π i+1 =local_update_policy( π i, Q i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, Q ) for i=1..| S | { π’(s i ) =argmax a ∈ A ( Q ( s i, a ) ) }

Learning with Q Q and the notion of policy evaluation give us a nice way to do actual learning Use Q table to represent policy Update Q through experience Every time you see a (s,a,r,s’) tuple, update Q

Learning with Q Each example of (s,a,r,s’) is a sample from T(s,a,s’) and from R W/ enough samples, can get a good idea of how the world works, where reward is, etc. Note: Never actually learn T or R ; let Q encode everything you need to know about the world

The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount  (0<=  <1); Learning rate  (0<=  <1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+  *( r +  *max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

Q -learning in action 15x15 maze world; R (goal)= 1; R( other)=0  =0.9  =0.65

Q -learning in action Initial policy

Q -learning in action After 20 episodes

Q -learning in action After 30 episodes

Q -learning in action After 100 episodes

Q -learning in action After 150 episodes

Q -learning in action After 200 episodes

Q -learning in action After 250 episodes

Q -learning in action After 300 episodes

Q -learning in action After 350 episodes

Q -learning in action After 400 episodes

Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state)

Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state) Every 10 episodes, “freeze” policy (turn off learning) Measure avg time to goal from a number of starting states Average over a number of test episodes to iron out noise Plot learning curve: #episodes of learning vs. avg performance

Learning performance

Notes on learning perf. After 400 learning episodes, still hasn’t asymptoted Note: that’s ~700,000 steps of experience!!! Q learning is really, really slow!!! Same holds for many RL methods (sadly) Fixing this is a good research topic... ;-)

Why does this work? Multiple ways to think of it The (more nearly) intuitive: Look at the key update step in the Q -learning alg: I.e., a weighted avg between current Q(s,a) and sampled Q(s’,a’)

Why does this work? Still... Why should that weighted avg be the right thing? Compare update eqn w/ Bellman eqn:

Why does this work? Still... Why should that weighted avg be the right thing? Compare w/ Bellman eqn: