Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Slides:

Advertisements

Similar presentations

Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.

Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Decision Theoretic Planning

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Nonlinear Optimization for Optimal Control

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

An Introduction to Markov Decision Processes Sarah Hickmott

Bayes Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read the.

Markov Decision Processes

Planning under Uncertainty

Announcements Homework 3: Games Project 2: Multi-Agent Pacman

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Particle Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.

Maximum A Posteriori (MAP) Estimation Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Discretization Pieter Abbeel UC Berkeley EECS

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning

Kalman Filtering Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

12/16 Projects Write-up – Generally, 1-2 pages – What was your idea? – How did you try to accomplish it? – What worked? – What you could have done differently?

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.

INTRODUCTION TO Machine Learning

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

MDPs (cont) & Reinforcement Learning

gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Department of Computer Science Undergraduate Events More

MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.

Making complex decisions

Markov Decision Processes

UAV Route Planning in Delay Tolerant Networks

CS 188: Artificial Intelligence

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Chapter 3: The Reinforcement Learning Problem

Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Spring 2006

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Announcements Homework 2 Project 2 Mini-contest 1 (optional)

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAA

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998] Markov Decision Process Assumption: agent gets to observe the state

Markov Decision Process (S, A, T, R, H) Given S: set of states A: set of actions T: S x A x S x {0,1,…,H}  [0,1], T t (s,a,s’) = P( s t+1 = s’ | s t = s, a t =a) R: S x A x S x {0, 1, …, H}  < R t (s,a,s’) = reward for ( s t+1 = s’, s t = s, a t =a) H: horizon over which the agent will act Goal: Find ¼ : S x {0, 1, …, H}  A that maximizes expected sum of rewards, i.e.,

MDP (S, A, T, R, H), goal:  Cleaning robot  Walking robot  Pole balancing  Games: tetris, backgammon  Server management  Shortest path problems  Model for animals, people Examples

Canonical Example: Grid World  The agent lives in a grid  Walls block the agent’s path  The agent’s actions do not always go as planned:  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  Big rewards come at the end

Grid Futures 6 Deterministic Grid WorldStochastic Grid World X X E N S W X ? X XX

Solving MDPs In an MDP, we want an optimal policy  *: S x 0:H → A A policy  gives an action for each state for each time An optimal policy maximizes expected sum of rewards Contrast: In deterministic, want an optimal plan, or sequence of actions, from start to a goal t=0 t=1 t=2 t=3 t=4 t=5=H

Value Iteration Idea: = the expected sum of rewards accumulated when starting from state s and acting optimally for a horizon of i steps Algorithm: Start with for all s. For i=1, …, H Given V i *, calculate for all states s 2 S: This is called a value update or Bellman update/back-up

Example

Example: Value Iteration Information propagates outward from terminal states and eventually all states have correct value estimates V2V2 V3V3

Practice: Computing Actions Which action should we chose from state s: Given optimal values V*? = greedy action with respect to V* = action choice with one step lookahead w.r.t. V* 11

Optimal control: provides general computational approach to tackle control problems. Dynamic programming / Value iteration Discrete state spaces (DONE!) Discretization of continuous state spaces Linear systems LQR Extensions to nonlinear settings: Local linearization Differential dynamic programming Optimal Control through Nonlinear Optimization Open-loop Model Predictive Control Examples: Today and forthcoming lectures