Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
4/3. (FO)MDPs: The plan General model has no initial state; complex cost and reward functions, and finite/infinite/indefinite horizons Standard algorithms.
Summary of MDPs (until Now) Finite-horizon MDPs – Non-stationary policy – Value iteration Compute V 0..V k.. V T the value functions for k stages to go.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
4/3. Outline… Talk about SSSP problem Talk about DP vs. A* Talk about heuristic—and how the “deterministic plan” can be an admissible heuristic –What.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
CS 188: Artificial Intelligence Spring 2006
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G : set of goals  s 0 : start state   : discount factor  R ( s,a,s’): reward model Value function: expected long term reward from the state Q values: Expected long term reward of doing a in s V(s) = max Q(s,a) Greedy Policy w.r.t. a value function Value of a policy Optimal value function

Examples of MDPs  Goal-directed, Indefinite Horizon, Cost Minimization MDP Most often studied in planning community  Infinite Horizon, Discounted Reward Maximization MDP Most often studied in reinforcement learning  Goal-directed, Finite Horizon, Prob. Maximization MDP Also studied in planning community  Oversubscription Planning: Non absorbing goals, Reward Max. MDP Relatively recent model

SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation) –Goals are sort of modeled by reward functions Allows pretty expressive goals (in theory) –Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway). Could consider “envelope extension” methods –Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution –RTDP methods SSSP are a special case of MDPs where –(a) initial state is given –(b) there are absorbing goal states –(c) Actions have costs. All states have zero rewards A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy) –Value/Policy Iteration don’t consider the notion of relevance –Consider “heuristic state search” algorithms Heuristic can be seen as the “estimate” of the value of a state.

  Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state.  J* should satisfy the following equation: Bellman Equations for Cost Minimization MDP (absorbing goals)[also called Stochastic Shortest Path] Q*(s,a)

  Define V*(s) {optimal value} as the maximum expected discounted reward from this state.  V* should satisfy the following equation: Bellman Equations for infinite horizon discounted reward maximization MDP

  Define P*(s,t) {optimal prob.} as the maximum probability of reaching a goal from this state at t th timestep.  P* should satisfy the following equation: Bellman Equations for probability maximization MDP

Modeling Softgoal problems as deterministic MDPs Consider the net-benefit problem, where actions have costs, and goals have utilities, and we want a plan with the highest net benefit How do we model this as MDP? –(wrong idea): Make every state in which any subset of goals hold into a sink state with reward equal to the cumulative sum of utilities of the goals. Problem—what if achieving g1 g2 will necessarily lead you through a state where g1 is already true? –(correct version): Make a new fluent called “done” dummy action called Done-Deal. It is applicable in any state and asserts the fluent “done”. All “done” states are sink states. Their reward is equal to sum of rewards of the individual states.

Ideas for Efficient Algorithms.. Use heuristic search (and reachability information) –LAO*, RTDP Use execution and/or Simulation –“Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) –“Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc. Use “factored” representations –Factored representations for Actions, Reward Functions, Values and Policies –Directly manipulating factored representations during the Bellman update

Heuristic Search vs. Dynamic Programming (Value/Policy Iteration) VI and PI approaches use Dynamic Programming Update Set the value of a state in terms of the maximum expected value achievable by doing actions from that state. They do the update for every state in the state space –Wasteful if we know the initial state(s) that the agent is starting from Heuristic search (e.g. A*/AO*) explores only the part of the state space that is actually reachable from the initial state Even within the reachable space, heuristic search can avoid visiting many of the states. –Depending on the quality of the heuristic used.. But what is the heuristic? –An admissible heuristic is a lowerbound on the cost to reach goal from any given state –It is a lowerbound on J*!

Real Time Dynamic Programming [Barto, Bradtke, Singh’95]  Trial: simulate greedy policy starting from start state; perform Bellman backup on visited states  RTDP: repeat Trials until cost function converges RTDP was originally introduced for Reinforcement Learning  For RL, instead of “simulate” you “execute”  You also have to do “exploration” in addition to “exploitation”  with probability p, follow the greedy policy with 1-p pick a random action What if we simulate the action’s effect with noise (rather than exactly wrt its transition probabilities)

Min ? ? s0s0 JnJn JnJn JnJn JnJn JnJn JnJn JnJn Q n+1 (s 0,a) J n+1 (s 0 ) a greedy = a 2 Goal a1a1 a2a2 a3a3 RTDP Trial ? Note that the value function is being updated per each level. How about waiting until you hit goal and then update everyone?

Greedy “On-Policy” RTDP without execution  Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

Comments  Properties if all states are visited infinitely often then J n → J* Only relevant states will be considered A state is relevant if the optimal policy could visit it.  Notice emphasis on “optimal policy”—just because a rough neighborhood surrounds National Mall doesn’t mean that you will need to know what to do in that neighborhood  Advantages Anytime: more probable states explored quickly  Disadvantages complete convergence is slow! no termination condition Do we care about complete convergence?  Think Cpt. Sullenberger

Labeled RTDP [Bonet&Geffner’03]  Initialise J 0 with an admissible heuristic ⇒ J n monotonically increases  Label a state as solved if the J n for that state has converged  Backpropagate ‘solved’ labeling  Stop trials when they reach any solved state  Terminate when s 0 is solved sG high Q costs best action ) J(s) won’t change! sG ? t both s and t get solved together high Q costs

Properties  admissible J 0 ⇒ optimal J*  heuristic-guided explores a subset of reachable state space  anytime focusses attention on more probable states  fast convergence focusses attention on unconverged states  terminates in finite time

Recent Advances: Focused RTDP [Smith&Simmons’06] Similar to Bounded RTDP except a more sophisticated definition of priority that combines gap and prob. of reaching the state adaptively increasing the max-trial length Recent Advances: Learning DFS [Bonet&Geffner’06]  Iterative Deepening A* equivalent for MDPs  Find strongly connected components to check for a state being solved.

Other Advances  Ordering the Bellman backups to maximise information flow. [Wingate & Seppi’05] [Dai & Hansen’07]  Partition the state space and combine value iterations from different partitions. [Wingate & Seppi’05] [Dai & Goldsmith’07]  External memory version of value iteration [Edelkamp, Jabbar & Bonet’07]  …