Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005.
CS594 Automated decision making University of Illinois, Chicago
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Markov Decision Processes
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Stochastic Dynamic Programming with Factored Representations Presentation by Dafna Shahaf (Boutilier, Dearden, Goldszmidt 2000)
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Department of Computer Science Undergraduate Events More
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Decision Making Under Uncertainty Lec #10: Partially Observable MDPs UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授

Reference “ Planning and acting in partially observable stochastic domains ” “ Planning and acting in partially observable stochastic domains ” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial Intelligence 1998 “ Spoken Dialogue Management Using Probabilistic Reasoning ”, “ Spoken Dialogue Management Using Probabilistic Reasoning ”, Nicholas Roy and Joelle Pineau and Sebastian Thrun, in ACL 2000

MDP (Markov Decision Process) A MDP model contains: A MDP model contains: –A set of states S –A set of actions A –A set of state transition description T  Deterministic or Stochastic –A reward function R (s, a)

MDP For MDPs we can compute the optimal policy π and use it to act by simply executing π(s) for current state s. For MDPs we can compute the optimal policy π and use it to act by simply executing π(s) for current state s. What happens if the agent is no longer able to determine the state it is currently in with complete reliability? What happens if the agent is no longer able to determine the state it is currently in with complete reliability?

POMDP A POMDP model contains: A POMDP model contains: –A set of states S –A set of actions A –A set of state transition description T –A reward function R (s, a) –A finite set of observations Ω –An observation function O:S ╳ A →Π(Ω)  O(s ’, a, o)

POMDP Problem 1. Belief state 1. Belief state –First approach: chose the most probable state of the world, given past experience  Informational properties described via observations –Not explicit –Second approach: probability distributions over states of the world.

An example Actions: EAST and WEST Actions: EAST and WEST –each succeeds with probability 0.9, and when they fail the movement is in the opposite direction. If no movement is possible in particular direction, then the agent remains in the same location –Initially [0.33, 0.33, 0, 0.33] –After taking one EAST movement  [0.1, 0.45, 0, 0.45] –After taking another EAST movement  [0.1, 0.164, 0, 0.736]

POMDP Problem 2. Finding an optimal policy: 2. Finding an optimal policy: –Maps the belief state to actions

Policy Tree A tree of depth t that specifies a complete t-step policy. A tree of depth t that specifies a complete t-step policy. –Nodes: actions, the top node determines the first action to be taken. –Edges: the resulting observation

Sample Policy Tree

Policy Tree Value Evaluation: Value Evaluation: –V p (s) is the value function of step-t that starting from state s and executing policy tree p.

Policy Tree Value Evaluation: Value Evaluation: –Expected value under policy tree p:  Where –Expected value that execute different policy trees from different initial belief states

Policy Tree Value Evaluation: Value Evaluation: –V t with only two states:

Policy Tree Value Evaluation: Value Evaluation: –V t with three states:

Infinite Horizon The three algorithm to compute V: The three algorithm to compute V: –Naive approach –Improved by choosing useful policy tree –Witness algo.

Infinite Horizon Naive approach: Naive approach: –εis a small number –This policy tree contains:  nodes  Each nodes can be labeled with |A| possible actions –Total number of policy threes:

Infinite Horizon Improved by choosing useful policy tree: Improved by choosing useful policy tree: –V t-1 is the set of useful (t – 1)-step policy trees, can be used to construct a superset of the useful t-step policy tree. –And there are | A || V t-1 | |Ω| elements in V t +

Infinite Horizon Improved by choosing useful policy tree: Improved by choosing useful policy tree:

Infinite Horizon Witness algorithm: Witness algorithm:

Infinite Horizon Witness algorithm: Witness algorithm: – is a set of t-step policy trees that have action a at their root – is the value function –And

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:  At each iteration we ask, Is there some belief state,b, for which the true value,, computed by one-step lookahead using Vt-1, is different from the estimated value,, computed using the set U?  Provided

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:  Now we can state the witness theorem [25]: The true Q-function,, differs from the approximate Q-function,, if and only if there is some,, and for which there is some b such that

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:  The linear program used to find witness points:

Infinite Horizon Witness algorithm: Witness algorithm: –Complete value-iteration:  An agenda containing any single policy tree  A set U containing the set of desired policy tree  Using p new to determine whether it is an improvement over the policy trees in U –1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates. –2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.

Infinite Horizon Witness algorithm: Witness algorithm: –Complexity:  Since we know that no more than witness points are discovered (each adds a tree to the set of useful policy trees) –only trees can ever be added to the agenda (in addition to the one tree in the initial agenda).  Each of these linear programs either removes a policy from the agenda (this happens at most times) or a witness point is discovered (this happens at most times).

Tiger Problem Two doors: Two doors: –Behind one door is a tiger –Behind another door is a large reward Two states: Two states: –the state of the world when the tiger is on the left as s l and when it is on the right as s r Three actions: Three actions: –left, right, and listen. Rewards: Rewards: –reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1 Observations: Observations: –to hear the tiger on the left (T l ) or to hear the tiger on the right (T r ) –in state s l, the listen action results in observation T l with probability 0.85 and the observation T r with probability 0.15; conversely for world state s r.

Tiger Problem

Decreasing listening reliability from 0.85 down to 0.65: Decreasing listening reliability from 0.85 down to 0.65:

The End