Optimal Policies for POMDP Presented by Alp Sardağ.

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Decision Theoretic Planning

5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

Meeting 3 POMDP (Partial Observability MDP) 資工四阮鶴鳴李運寰 Advisor: 李琳山教授.

CS594 Automated decision making University of Illinois, Chicago

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

主講人：虞台文大同大學資工所智慧型多媒體研究室

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.

POMDPs: Partially Observable Markov Decision Processes Advanced AI

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

An Introduction to PO-MDP Presented by Alp Sardağ.

Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.

Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Markov Decision Processes

Department of Computer Science Undergraduate Events More

Presented by Alp Sardağ Algorithms for POMDP. Monahan Enumeration Phase Generate all vectors: Number of gen. Vectors = |A|M |  | where M vectors of previous.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

MDPs (cont) & Reinforcement Learning

© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

Decision Making Under Uncertainty Lec #10: Partially Observable MDPs UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Department of Computer Science Undergraduate Events More

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

POMDP We maximize the expected cummulated reward.

Making complex decisions

POMDPs Logistics Outline No class Wed

Markov Decision Processes

Markov Decision Processes

Markov Decision Processes

CS 188: Artificial Intelligence Fall 2007

Chapter 17 – Making Complex Decisions

Heuristic Search Value Iteration

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Reinforcement Nisheeth 18th January 2019.

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Optimal Policies for POMDP Presented by Alp Sardağ

As Much Reward As Possible? Greedy Agent

How long agent take decision? Finite Horizon Infinite Horizon (discount factor) Values will converge. Good model if the number of decision step is not given.

Policy General plan Deterministic : one action for each state Stochastic : pdf over the set of actions Stationary : can be applied at any time Non-stationary : dependent on time Memoryless : no history

Finite Horizon Agent has to make k decisions, non-stationary

Infinite Horizon We do not need different policy for each time step. 0<  <1 Infiniteness helps us to find stationary policy.  ={  0,  1,...,  t }  ={  i,  i,...,  i }

MDP Finite horizon, solved with dynamic programming. Infinite horizon S equations S unknowns LP.

MDP Actions may be stochastic. Do you know what state end up? Dealing with uncertainity in observations.

POMDP Model Finite set of states Finite set of actions Transition probabilities (as in MDP) Observation model Reinforcement

POMDP Model Immediate reward for performing action a in state i.

POMDP Model Belief state : probability distribution over states.  = {  0,  1,....,  |S| } Drawback to compute next state world model needed. From Bayes rule:

POMDP Model Control dynamics for a POMDP

Policies for POMDP Belief states infinite, value functions in tables infeasible. For horizon length 1. No control over observations (not found in MDP), weigh all observations

Value functions for POMDPs Formula is complex, however if VF is piecewise linear (a way of rep. Continous space VF), it can be written:

Value functions for POMDPs

Value Functions for POMDPs Given V t-1, V t can be calculated. Keep the action which gives rise to specific  vector. To find optimal policy at a belief state, just perform maximization over all  vectors and take the associated action.

Geometric Interpretation of VF Belief simplex: 2 dimensional case:

Geometric Interpretation of VF 3 dimensional case :

Alternate VF Interpretation A decision tree could enumerate each possible policy for k-horizon, if initial belief state given.

Alternate VF Interpretation The number of nodes for each action: The number of possible tree (|A| possible actions for each node) Somehow only generate useful trees, the complexity will be greatly reduced. Previously, to create entire VF generate  for all , too many for the algorithm to work.

POMDP Solutions For finite horizon: Iterate over time steps. Given V t-1 compute V t. Retain all intermediate solutions. For finitely transient, same idea apply to find infinite horizon. Iterate until previous optimal value functions are the same for any two consecutive time steps. Once infinite horizon found, discard all intermediate results.

POMDP Solutions Given V t-1 V t can be calculated for one  from previous formula. No knowledge about which region this is optimal. (Sondik) Too many  to construct VF, one possible solution: Choose random points. If the number of points is large, one can’t miss any of true vectors. How many points to choose? No guarantee. Find optimal policies by developing a systematic algorithm to explore the entire continous space of beliefs.

Tiger Problem Actions: open left door, open right door, listen. Listenning not accurate. s 0 : tiger on the left, s 1 : tiger on the right. Rewards: +10 openning right door, -100 for wrong door, -1 for listenning. Initially:  = ( )

Tiger Problem

First action, intuitively:  2=-55 & -1 for listenning For horizon length 1:

Tiger Problem For Horizon length 2:

Tiger Problem For horizon length 4, nice features: A belief state for the same action & observation transformed to a single belief state. Observations made precisely define the nodes in the graph that would be traversed.

Infinite Horizon Finite horizon cumbersome, different policy for the same belief point for each time step. Different set of vectors for each time step. Add discount factor to tiger problem, after 56. Step the underlying vectors are slightly different:

Infinite Horizon for Tiger Problem By this way the finite horizon algorithms can be used for the infinite horizon problems. Advantage of infinite horizon, keep the last policy.

Policy Graphs A way to encode, without keeping vectors, no dot products. Beginning stateEndstate

Finite Transience All the belief states within a particular partition element will be transformed to another element for a particular action and observation. For non-finitely transient policies the policy graphs that are exactly optimal can not be constructed.

Overview of Algorithms All performed iteratively. All try to find the set of vectors that define both the value function and the optimal policy at each time step. Two separate class: Given V t-1, generate superset of V t, reduce that set until the optimal V t found (Monahan and Eagle). Given V t-1 construct subset of optimal V t. These subsets grow larger until optimal V t found.

Monahan Algorithm Easy to implement Do not expect to solve anything but smallest of problems. Provides background for understanding of other algorithms.

Monahan Enumeration Phase Generate all vectors: Number of gen. Vectors = |A|M |  | where M vectors of previous state

Monahan Reduction Phase All vectors can be kept: Each time maximize over all vectors. Lot of excess baggage The number of vectors in next step will be even large. LP used to trim away useless vectors

Monahan Reduction Phase For a vector to be useful, there must be at least one belief point it gives larger value than others:

Monahan Algorithm

Monahan’s LP Complication

Future Work Eagle’s Variant of Monahan’s Algorithm. Sondik’s One-Pass Algorithm. Cheng’s Relaxed Region Algorithm. Cheng’s Linear Support Algorithm.