1 Markov Decision Processes Basics Concepts Alan Fern.

1 Markov Decision Processes Basics Concepts Alan Fern

Some AI Planning Problems Fire & Rescue Response Planning Solitaire Real-Time Strategy Games Helicopter Control Legged Robot ControlNetwork Security/Control

3 Some AI Planning Problems  Health Care  Personalized treatment planning  Hospital Logistics/Scheduling  Transportation  Autonomous Vehicles  Supply Chain Logistics  Air traffic control  Assistive Technologies  Dialog Management  Automated assistants for elderly/disabled  Household robots  Sustainability  Smart grid  Forest fire management …..

4 Common Elements  We have a systems that changes state over time  Can (partially) control the system state transitions by taking actions  Problem gives an objective that specifies which states (or state sequences) are more/less preferred  Problem: At each moment must select an action to optimize the overall (long-term) objective  Produce most preferred state sequences

5 Observations Actions ???? world/ system State of world/system Observe-Act Loop of AI Planning Agent action Goal maximize expected reward over lifetime

6 World State Action from finite set ???? Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Goal maximize expected reward over lifetime Markov Decision Process

7 State describes all visible info about dice Action are the different choices of dice to roll ???? Example MDP Goal maximize score or score more than opponent

8 Markov Decision Processes  An MDP has four components: S, A, R, T:  finite state set S (|S| = n)  finite action set A (|A| = m)  transition function T(s,a,s’) = Pr(s’ | s,a)  Probability of going to state s’ after taking action a in state s  bounded, real-valued reward function R(s,a)  Immediate reward we get for being in state s and taking action a  Roughly speaking the objective is to select actions in order to maximize total reward  For example in a goal-based domain R(s,a) may equal 1 for goal states and 0 for all others (or -1 reward for non-goal states)

action = roll the two 5’s … 1,1 1,2 6,6 Probabilistic state transition State

10 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: ????  Should the solution to an MDP be just a sequence of actions such as ( a1,a2,a3, ….) ?  Consider a single player card game like Blackjack/Solitaire.  No! In general an action sequence is not sufficient  Actions have stochastic effects, so the state we end up in is uncertain  This means that we might end up in states where the remainder of the action sequence doesn’t apply or is a bad choice  A solution should tell us what the best action is for any possible situation/state that might arise

11 Policies (“plans” for MDPs)  A solution to an MDP is a policy  Two types of policies: nonstationary and stationary  Nonstationary policies are used when we are given a finite planning horizon H  I.e. we are told how many actions we will be allowed to take  Nonstationary policies are functions from states and times to actions  π :S x T → A, where T is the non-negative integers  π (s,t) tells us what action to take at state s when there are t stages-to-go (note that we are using the convention that t represents stages/decisions to go, rather than the time step)

12 Policies (“plans” for MDPs)  What if we want to continue taking actions indefinately?  Use stationary policies  A Stationary policy is a mapping from states to actions  π: S → A  π (s) is action to do at state s (regardless of time)  specifies a continuously reactive controller  Note that both nonstationary and stationary policies assume or have these properties:  full observability of the state  history-independence  deterministic action choice

13 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy such that ????  We don’t want to output just any policy  We want to output a “good” policy  One that accumulates lots of reward

14 Value of a Policy  How good is a policy π ?  How do we measure reward “accumulated” by π?  Value function V: S →ℝ associates value with each state (or each state and time for non-stationary π)  V π (s) denotes value of policy at state s  Depends on immediate reward, but also what you achieve subsequently by following π  An optimal policy is one that is no worse than any other policy at any state  The goal of MDP planning is to compute an optimal policy

15 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T) Output: a policy that achieves an “optimal value”  This depends on how we define the value of a policy  There are several choices and the solution algorithms depend on the choice  We will consider finite-horizon value

16 Finite-Horizon Value Functions  We first consider maximizing expected total reward over a finite horizon  Assumes the agent has H time steps to live  To act optimally, should the agent use a stationary or non-stationary policy?  I.e. Should the action it takes depend on absolute time?  Put another way:  If you had only one week to live would you act the same way as if you had fifty years to live?

17 Finite Horizon Problems  Value (utility) depends on stage-to-go  hence use a nonstationary policy  is k-stage-to-go value function for π  expected total reward for executing π starting in s for k time steps  Here R t and s t are random variables denoting the reward received and state at time-step t when starting in s  These are random variables since the world is stochastic

19 Computational Problems  There are two problems that are of typical interest  Policy evaluation:  Given an MDP and a policy π  Compute finite-horizon value function for any k and s  Policy optimization:  Given an MDP and a horizon H  Compute the optimal finite-horizon policy  How many finite horizon policies are there?  |A| Hn  So can’t just enumerate policies for efficient optimization

20 Computational Problems  Dynamic programming techniques can be used for both policy evaluation and optimization  Polynomial time in # of states and actions  http://web.engr.oregonstate.edu/~afern/classes/cs533/ http://web.engr.oregonstate.edu/~afern/classes/cs533/  Is polytime in # of states and actions good?  Not when these numbers are enormous!  As is the case for most realistic applications  Consider Klondike Solitaire, Computer Network Control, etc Enters Monte-Carlo Planning

21 Approaches for Large Worlds: Monte-Carlo Planning  Often a simulator of a planning domain is available or can be learned from data 21 Klondike Solitaire Fire & Emergency Response

22 Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator 22 World Simulator Real World action State + reward

23 Example Domains with Simulators  Traffic simulators  Robotics simulators  Military campaign simulators  Computer network simulators  Emergency planning simulators  large-scale disaster and municipal  Sports domains  Board games / Video games  Go / RTS In many cases Monte-Carlo techniques yield state-of-the-art performance. Even in domains where model-based planners are applicable.

24 MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T, I:  finite state set S (|S|=n and is generally very large)  finite action set A (|A|=m and will assume is of reasonable size)  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Stochastic transition function T(s,a) = s’ (i.e. a simulator)  Stochastically returns a state s’ given input s and a  Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  Stochastic initial state function I.  Stochastically returns a state according to an initial state distribution These stochastic functions can be implemented in any language!

25 Computational Problems

26 Trajectories  We can use the simulator to observe trajectories of any policy π from any state s:  Let Traj(s, π, h) be a stochastic function that returns a length h trajectory of π starting at s.  Traj(s, π, h)  s 0 = s  For i = 1 to h-1  s i = T(s i-1, π(s i-1 ))  Return s 0, s 1, …, s h-1  The total reward of a trajectory is given by

27 Policy Evaluation

28 Sampling-Error Bound approximation due to sampling Note that the r i are samples of random variable R(Traj(s, π, h)) We can apply the additive Chernoff bound which bounds the difference between an expectation and an emprical average

29 Aside: Additive Chernoff Bound Let X be a random variable with maximum absolute value Z. An let x i i=1,…,w be i.i.d. samples of X The Chernoff bound gives a bound on the probability that the average of the x i are far from E[X] Let {x i | i=1,…, w} be i.i.d. samples of random variable X, then with probability at least we have that, equivalently

30 Aside: Coin Flip Example Suppose we have a coin with probability of heads equal to p. Let X be a random variable where X=1 if the coin flip gives heads and zero otherwise. (so Z from bound is 1) E[X] = 1*p + 0*(1-p) = p After flipping a coin w times we can estimate the heads prob. by average of x i. The Chernoff bound tells us that this estimate converges exponentially fast to the true mean (coin bias) p.

31 Sampling Error Bound approximation due to sampling We get that, with probability at least

Two Player MDP (aka Markov Games) action State/ reward action State/ reward So far we have only discussed single-player MDPs/games Your labs and competition will be 2-player zero-sum games (zero sum means sum of player rewards is zero) We assume players take turns (non-simultaneous moves) Player 1 Player 2 Markov Game

33 Simulators for 2-Player Games

34 Finite Horizon Value of Game

1 Markov Decision Processes Basics Concepts Alan Fern.

Similar presentations

Presentation on theme: "1 Markov Decision Processes Basics Concepts Alan Fern."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Markov Decision Processes Basics Concepts Alan Fern.

Similar presentations

Presentation on theme: "1 Markov Decision Processes Basics Concepts Alan Fern."— Presentation transcript:

Similar presentations

About project

Feedback