1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
CS594 Automated decision making University of Illinois, Chicago
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
An Introduction to PO-MDP Presented by Alp Sardağ.
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.
Planning to Gather Information
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
Discretization Pieter Abbeel UC Berkeley EECS
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Fifth International Conference on Autonomous Agents and Multi-agent Systems (AAMAS-06) Exact Solutions of Interactive POMDPs Using Behavioral Equivalence.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
POMDP We maximize the expected cummulated reward.
Making complex decisions
POMDPs Logistics Outline No class Wed
Monty Hall, Tiger 3 dveře, za jedněmi zlato ukážete jedny
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
Heuristic Search Value Iteration
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Presentation transcript:

1 Policies for POMDPs Minqing Hu

2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping from probability distributions (over states) to actions. –belief state: a probability distribution over states –belief space: the entire probability space, infinite

3 Policies in MDP k-horizon Value function: Optimal policy δ *, is the one where, for all states, s i and all other policies,

4 Finite k-horizon POMDP POMDP: transition probability: probability of observing z after taking action a and ending in state s j : immediate rewards: Immediate reward of performing action a in state s i : Object: to find an optimal policy for finite k- horizon POMDP δ * = (δ 1, δ 2,…, δ k )

5 A two state POMDP represent the belief state with a single number p. the entire space of belief states can be represented as a line segment. belief space for a 2 state POMDP

6 belief state updating finite number of possible next belief states, given a belief state –a finite number of actions –a finite number of observations b’ = T(b| a, z). Given a and z, b’ is fully determined.

7 the process of maintaining the belief state is Markovian: the next belief state depends only on the current belief state (and the current action and observation) we are now back to solving a MDP policy problem with some adaptations

8 continuous space: value function is some arbitrary function –b: belief space –V(b): value function Problem: how we can easily represent this value function? Value function over belief space

9 Fortunately, the finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. Sample PWLC function

10 A Piecewise Linear function consists of linear, or hyper-plane segments –Linear function: –K th linear segment: –the -vector –each liner or hyper-plane could be represented with

11 Value function: a convex function

12 1-horizon POMDP problem –Single action a to execute –Starting out belief state b –Ending belief state b’ b’ = T(b | a, z) –Immediate rewards –Terminating rewards for state s i Expected terminating reward in b’

13 Value function of t =1 The optimal policy for t =1

14 General k-horizon value function Same strategy for 1-horizon case Assume that we have the optimal value function at t – 1, V t-1 * (. ) Value function has same basic form as MDP, but –Current belief state –Possible observations –Transformed belief state

15 Value function Piecewise linear and convex?

16 Inductive Proof Base case: Inductive hypothesis: –transformed to: –Substitute the transformed belief state:

17 Inductive Proof (contd) Value function at step t (using recursive definition) New α-vector at step t Value function at step t (PWLC)

18 Geometric interpretation of value function |S| = 2 Sample value function for |S| = 2

19 |S| = 3 Hyper-planes Finite number of regions over the simplex Sample value function for |S| = 3

20 POMDP Value Iteration Example a 2-horizon problem assume the POMDP has –two states s1 and s2 –two actions a1 and a2 –three observations z1, z2 and z3

21 Given belief state b = [ 0.25, 0.75 ] terminating reward = 0 Horizon 1 value function

22 The blue region: –the best strategy is a1 the green region: –a2 is the best strategy Horizon 1 value function

23 2-Horizon value function construct the horizon 2 value function with the horizon 1 value function. three steps: –how to compute the value of a belief state for a given action and observation –how to compute the value of a belief state given only an action –how to compute the actual value for a belief state

24 Step1: A restrict problem: given a belief state b, what is the value of doing action a1 first, and receiving observation z1? The value of a belief state for horizon 2 is the value of the immediate action plus the value of the next action.

25 b’ = T(b | a1,z1) Value of a fixed action and observation immediate reward functionhorizon 1 value function

26 S(a, z), a function which directly gives the value of each belief state after the action a1 is taken and observation z1 is seen

27 Value function of horizon 2 : Immediate rewardsS(a, z) Step 1: done

28 Step 2: how to compute the value of a belief state given only the action Transformed value function

29 So what is the horizon 2 value of a belief state, given a particular action a1? depends on: the value of doing action a1 what action we do next –depend on observation after action a1

30 plus the immediate reward of doing action a1 in b

31 Transformed value function for all observations Belief point in transformed value function partitions

32 Partition for action a1

33 The figure allows us to easily see what the best strategies are after doing action a1. the value of the belief point b at horizon 2 = the immediate reward from doing action a1 + the value of the functions S(a1,z1), S(a1,z3), S(a1,z3) at belief point b.

34 Each line segment is constructed by adding the immediate reward line segment to the line segments for each future strategy. Horizon 2 value function and partition for action a1

35 Repeat the process for action a2 Value function and partition for action a2

36 Step 3: best horizon 2 policy Combined a1 and a2 value functionsValue function for horizon 2

37 Repeat the process for value functions of 3-horizon,…, and k-horizon POMDP

38 Alternate Value function interpretation A decision tree –Nodes represent an action decision –Branches represent observation made Too many trees to be generated!