CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
RL for Large State Spaces: Value Function Approximation
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
Class Project Due at end of finals week Essentially anything you want, so long as it’s AI related and I approve Any programming language you want In pairs.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Planning under Uncertainty
Reinforcement learning
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CSE 573: Artificial Intelligence
CHAPTER 10 Reinforcement Learning Utility Theory.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
1 Introduction to Reinforcement Learning Freek Stulp.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Reinforcement Learning
Warmup: Which one would you try next? 1 Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 ABCD.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun

Schedule  TA: Vishakha Sharma  Address: WeekDateTopicReadingHomework 108/30/2011Introduction & Intelligent AgentCh 1 & 2N/A 209/06/2011Search: search strategy and heuristic searchCh 3 & 4sHW1 (Search) 309/13/2011Search: Constraint Satisfaction & Adversarial SearchCh 4s & 5 & 6 Teaming Due 409/20/2011Logic: Logic Agent & First Order LogicCh 7 & 8sHW1 due, Midterm Project (Game) 509/27/2011Logic: Inference on First Order LogicCh 8s & 9 610/04/2011Uncertainty and Bayesian NetworkCh 13 &14sHW2 (Logic) 10/11/2011No class (function as Monday) 710/18/2011Midterm Presentation Midterm Project Due 810/25/2011Inference in Baysian NetworkCh 14sHW2 Due, HW3 (Probabilistic Reasoning) 911/01/2011Probabilistic Reasoning over TimeCh /08/2011TA Session: Review of HW1 & 2 HW3 due, HW4 (MDP) 1111/15/2011Supervised Machine LearningCh /22/2011Markov Decision ProcessCh /29/2011Reinforcement learningCh 21HW4 due 1412/06/2011Final Project Presentation Final Project Due

Recap of Lecture X  Planning under uncertainty  Planning with Markov Decision Processes  Utilities and Value Iteration  Partially Observable Markov Decision Processes

Review of MDPs

Review: Markov Decision Processes  An MDP is defined by:  set of states s  S  set of actions a  A  transition function P(s’| s, a) aka T(s,a,s’)  reward function R(s, a, s’) or R(s, a) or R(s’)  Problem:  Find policy π(s) to maximize  ∑ t R(s, π(s), s′ ) 5

Value Iteration  Idea:  Start with U 0 (s) = 0  Given U i (s), calculate:  Repeat until convergence  Theorem: will converge to unique optimal utilities  But: Requires model (R and P) 6

What is reinforcement learning

Reinforcement Learning  What if the state transition function P and the reward function R are unknown? 8

Example: Backgammon  Reward only for win / loss in terminal states, zero otherwise  TD-Gammon (Tesauro, 1992) learns a function approximation to U(s) using a neural network  Combined with depth 3 search, one of the top 3 human players in the world – good at positional judgment, but mistakes in end game beyond depth 3 9

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Mechanisms and sophistication debated  Example: foraging  Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies  Bees have a direct neural connection from nectar intake measurement to motor planning area 10

What to Learn  If you do not know P, R, what you learn depends on what you want to do  Utility-based agent: learn U(s)  Q-learning agent: learn utility Q(s, a)  Reflex agent: learn policy π (s)  And on how patient/reckless you are  Passive: use a fixed policy π  Active: modify policy as you go 11

Passive Temporal-Differences

Passive Temporal-Difference 13

Example

Example

Example

Example

Example

Example

Sample results 20

Problems in passive learning

Problem  The problem  Algorithm computes the value of one specific policy  May never try the best moves  May never visit some states  May waste time computing unneeded utilities 22

Quiz  Is there a fix?  Can we explore different policies (active reinforcement learning), yet still learn the optimal policy? 23

Active reinforcement learning

Greedy TD-Agent  Learn U with TD-Learning  After each change to U, solve MDP to compute new optimal policy π (s)  Continue learning with new policy 25

Greedy Agent Results  Wrong policy  Too much exploitation, not enough exploration  Fix: higher U(s) for unexplored states 26

Exploratory Agent  Converges to correct policy 27

Alternative: Q-Learning  Instead of learning U(s), learn Q(a,s): the utility of action a in state s  With utilities, need a model, P, to act:  With Q-values, no model needed  Compute Q-values with TD/exploration algorithm 28

Q-Learning

Q-Learning

Q-Learning ??

Q-Learning

Q-Learning

Q-Learning  Q-learning produces tables of q-values: 34

Q-Learning Properties  Amazing result: Q-learning converges to optimal policy  If you explore enough  If you make the learning rate small enough  … but not decrease it too quickly!  Basically doesn’t matter how you select actions (!)  Neat property: off-policy learning  learn optimal policy without following it (some caveats, see SARSA) SE SE 35

Practical Q-Learning  In realistic situations, we cannot possibly learn about every single state!  Too many states to visit them all in training  Too many states to hold the q-tables in memory  Instead, we want to generalize:  Learn about some small number of training states from experience  Generalize that experience to new, similar states  This is a fundamental idea in machine learning, and we’ll see it over and over again 36

Generalization

Generalization Example  Let’s say we discover through experience that this state is bad:  In naïve Q learning, we know nothing about this state or its Q states:  Or even this one! 38

Feature-Based Representations  Solution: describe a state using a vector of features  Features are functions from states to real numbers (often 0/1) that capture important properties of the state  Example features:  Distance to closest ghost  Distance to closest dot  Number of ghosts  1 / (dist to dot) 2  Is Pacman in a tunnel? (0/1)  …… etc.  Can also describe a Q-state (s, a) with features (e.g. action moves closer to food) 39

Linear Feature Functions  Using a feature representation, we can write a Q function (or Utility function) for any state using a few weights:  Advantage: our experience is summed up in a few powerful numbers  Disadvantage: states may share features but be very different in value! 40

Function Approximation  Q-learning with linear Q-functions:  Intuitive interpretation:  Adjust weights of active features  E.g. if something unexpectedly bad happens, disprefer all states with that state’s features (prudence or superstition?) 41

Summary: MDPs and RL  If we know the MDP  Compute U*, Q*,  * exactly  Evaluate a fixed policy   If we don’t know the MDP  We can estimate the MDP then solve  We can estimate U for a fixed policy   We can estimate Q*(s,a) for the optimal policy while executing an exploration policy 42  Model-based DP  Value Iteration  Policy iteration  Model-based RL  Model-free RL:  Value learning  Q-learning Things we know how to do: Techniques: