1 Markov Decision Processes Basics Concepts Alan Fern.

Slides:

Advertisements

Similar presentations

Dialogue Policy Optimisation

Advertisements

Partially Observable Markov Decision Process (POMDP)

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.

Decision Theoretic Planning

Randomized Algorithms Kyomin Jung KAIST Applied Algorithm Lab Jan 12, WSAC

1 Markov Decision Processes Basics Concepts Alan Fern.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

An Introduction to Markov Decision Processes Sarah Hickmott

Infinite Horizon Problems

Planning under Uncertainty

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

POMDPs: Partially Observable Markov Decision Processes Advanced AI

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Announcements Homework 3: Games Project 2: Multi-Agent Pacman

Markov Decision Processes

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Department of Computer Science Undergraduate Events More

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

RL for Large State Spaces: Policy Gradient

1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.

MAKING COMPLEX DEClSlONS

ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 1 Chapter 1: The DP Algorithm To do:  sequential decision-making  state.

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

Reinforcement Learning

Introduction Many decision making problems in real life

Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Department of Computer Science Undergraduate Events More

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 8: Dynamic Programming – Value Iteration Dr. Itamar Arel College of Engineering Department.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

Department of Computer Science Undergraduate Events More

Markov Decision Process (MDP)

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Making complex decisions

Monte-Carlo Planning:

A Crash Course in Reinforcement Learning

Reinforcement Learning in POMDPs Without Resets

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

Markov Decision Processes

Course Logistics CS533: Intelligent Agents and Decision Making

CS 188: Artificial Intelligence Fall 2007

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Reinforcement Nisheeth 18th January 2019.

Reinforcement Learning (2)

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

1 Markov Decision Processes Basics Concepts Alan Fern

Some AI Planning Problems Fire & Rescue Response Planning Solitaire Real-Time Strategy Games Helicopter Control Legged Robot ControlNetwork Security/Control

3 Some AI Planning Problems  Health Care  Personalized treatment planning  Hospital Logistics/Scheduling  Transportation  Autonomous Vehicles  Supply Chain Logistics  Air traffic control  Assistive Technologies  Dialog Management  Automated assistants for elderly/disabled  Household robots  Sustainability  Smart grid  Forest fire management …..

4 Common Elements  We have a systems that changes state over time  Can (partially) control the system state transitions by taking actions  Problem gives an objective that specifies which states (or state sequences) are more/less preferred  Problem: At each moment must select an action to optimize the overall (long-term) objective  Produce most preferred state sequences

5 Observations Actions ???? world/ system State of world/system Observe-Act Loop of AI Planning Agent action Goal maximize expected reward over lifetime

6 World State Action from finite set ???? Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Goal maximize expected reward over lifetime Markov Decision Process

7 State describes all visible info about game Action are the different choices of dice to roll (or to select a category to score) ???? Example MDP Goal maximize score at end of game

8 Markov Decision Processes  An MDP has four components: S, A, R, T:  finite state set S  finite action set A  transition function T(s,a,s’) = Pr(s’ | s,a)  Probability of going to state s’ after taking action a in state s  bounded, real-valued reward function R(s,a)  Immediate reward we get for being in state s and taking action a  Roughly speaking the objective is to select actions in order to maximize total reward over time  For example in a goal-based domain R(s,a) may equal 1 for goal states and 0 for all others (or -1 reward for non-goal states)

Roll(die1,die2) State Roll(die1) Roll(die1,die2,die3)... Actions

Roll(die1,die2) … 1,1 1,2 6,6 Probabilistic state transition State Reward: only get reward for “category selection” actions. Reward equal to points gained.

11 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: ????  Should the solution to an MDP be just a sequence of actions such as ( a1,a2,a3, ….) ?  Consider a single player card game like Blackjack/Solitaire.  No! In general an action sequence is not sufficient  Actions have stochastic effects, so the state we end up in is uncertain  This means that we might end up in states where the remainder of the action sequence doesn’t apply or is a bad choice  A solution should tell us what the best action is for any possible situation/state that might arise

12 Policies (“plans” for MDPs)  For this class we will assume that we are given a finite planning horizon H  I.e. we are told how many actions we will be allowed to take  A solution to an MDP is a policy that says what to do at any moment  Policies are functions from states and times to actions  π :S x T → A, where T is the non-negative integers  π (s,t) tells us what action to take at state s when there are t stages-to-go  A policy that does not depend on t is called stationary, otherwise it is called non-stationary

13 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy such that ????  We don’t want to output just any policy  We want to output a “good” policy  One that accumulates lots of reward

14 Value of a Policy

15

16 What is a solution to an MDP?

17 Computational Problems

18 Computational Problems  Dynamic programming techniques can be used for both policy evaluation and optimization  Polynomial time in # of states and actions   Is polytime in # of states and actions good?  Not when these numbers are enormous!  As is the case for most realistic applications  Consider Klondike Solitaire, Computer Network Control, etc Enters Monte-Carlo Planning

19 Approaches for Large Worlds: Monte-Carlo Planning  Often a simulator of a planning domain is available or can be learned from data 19 Klondike Solitaire Fire & Emergency Response

20 Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator 20 World Simulator Real World action State + reward

21 Example Domains with Simulators  Traffic simulators  Robotics simulators  Military campaign simulators  Computer network simulators  Emergency planning simulators  large-scale disaster and municipal  Sports domains  Board games / Video games  Go / RTS In many cases Monte-Carlo techniques yield state-of-the-art performance.

22 MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T, I:  finite state set S (|S| is generally very large)  finite action set A (|A|=m and will assume is of reasonable size)  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Stochastic transition function T(s,a) = s’ (i.e. a simulator)  Stochastically returns a state s’ given input s and a  Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  Stochastic initial state function I.  Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!

23 Computational Problems

24 Trajectories  We can use the simulator to observe trajectories of any policy π from any state s:  Let Traj(s, π, h) be a stochastic function that returns a length h trajectory of π starting at s.  Traj(s, π, h)  s 0 = s  For i = 1 to h-1  s i = T(s i-1, π(s i-1 ))  Return s 0, s 1, …, s h-1  The total reward of a trajectory is given by

25 Policy Evaluation

26 Sampling-Error Bound approximation due to sampling Note that the r i are samples of random variable R(Traj(s, π, h)) We can apply the additive Chernoff bound which bounds the difference between an expectation and an emprical average

27 Aside: Additive Chernoff Bound Let X be a random variable with maximum absolute value Z. An let x i i=1,…,w be i.i.d. samples of X The Chernoff bound gives a bound on the probability that the average of the x i are far from E[X] Let {x i | i=1,…, w} be i.i.d. samples of random variable X, then with probability at least we have that, equivalently

28 Aside: Coin Flip Example Suppose we have a coin with probability of heads equal to p. Let X be a random variable where X=1 if the coin flip gives heads and zero otherwise. (so Z from bound is 1) E[X] = 1*p + 0*(1-p) = p After flipping a coin w times we can estimate the heads prob. by average of x i. The Chernoff bound tells us that this estimate converges exponentially fast to the true mean (coin bias) p.

29 Sampling Error Bound approximation due to sampling We get that, with probability at least Can increase w to get arbitrarily good approximation.

Two Player MDP (aka Markov Games) action State/ reward action State/ reward So far we have only discussed single-player MDPs/games Your labs and competition will be 2-player zero-sum games (zero sum means sum of player rewards is zero) We assume players take turns (non-simultaneous moves) Player 1 Player 2 Markov Game

31 Simulators for 2-Player Games

32 Finite Horizon Value of Game

33 Summary  Markov Decision Processes (MDPs) are common models for sequential planning problems  The solution to an MDP is a policy  The goodness of a policy is measured by its value function  Expected total reward over H steps  Monte Carlo Planning (MCP) is used for enormous MDPs for which we have a simulator  Evaluating a policy via MCP is very easy