Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement learning
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
Markov Decision Processes
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Announcements Grader office hours posted on course website
Making complex decisions
Reinforcement learning (Chapter 21)
Markov Decision Processes
Reinforcement Learning
Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Presentation transcript:

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision Process” –1–1 s0s0 Actions are stochastic: E.g., in book example, probability of taking “desired” action is 0.8, and probability of taking action at right angle to desired action is 0.2.

Utility of a state sequence = discounted sum of rewards s0s –1–1 s0s0

–1–1 s0s0 Policy: Function that maps states to actions : Optimal policy  *: has highest expected utility

Utility of state s given policy  : where s t is the state reached after starting in s and executing  for t steps. s0s –1–1 s0s0

Define: s0s –1–1 s0s0

Suppose that the agent, in s at time t, following  *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation

Suppose that the agent, in s at time t, following  *, to s´ at time t+1. We can write U(s) in terms of U(s´): Bellman’s Equation Bellman’s equation yields a set of simultaneous equations that can be solved (given certain assumptions) to find utilities.

s0s –1– State utilities

How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state.

How to learn an optimal policy? Value iteration: –Calculate utility of each state and then using the state utilities to select optimal action in each state. But: must know R(s) and T(s, a, s´). In most problems, the agent doesn’t have this knowledge.

Reinforcement Learning Agent has no teacher (contrast with NN), no prior knowledge of reward function or of state transition function. “Imagine playing a new game whose rules you don’t know; after a hundred or so moves, your opponent announces ‘you lose’. This is reinforcement learning in a nutshell.” (Textbook) Question: How best to explore “on-line” (while getting rewards and punishment?) Analogous to multi-armed bandit problem I mentioned earlier.

Q-learning Don’t learn utilities! Instead, learn a “value” function, Q: S  A   Q(s,a) = estimated value of U(s), for best action a from state s. “Model-free” method If we knew Q(s,a) for each state/action pair, we could simply choose the action which maximizes Q.

How to learn Q (simplified from Figure 21.8) Assume T: S  A  S is a deterministic state transition function. Assume  =  = 1.

How to learn Q (simplified from Figure 21.8) Q-Learn() { Q-matrix = 0; // All zeros t = 0; s = s 0 ; While s is not a terminal state { choose action a; // Many different // ways to do this. s’ = T(s,a); Q(s,a) = R(s) + max_a’ Q(s’,a’); s = s’; }

Pathfinder demo

–1–1 s0s0 How to do HW problem 4