An Introduction to PO-MDP Presented by Alp Sardağ.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
CS594 Automated decision making University of Illinois, Chicago
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
1 Reasoning Under Uncertainty Over Time CS 486/686: Introduction to Artificial Intelligence Fall 2013.
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
Markov Decision Processes
Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Fifth International Conference on Autonomous Agents and Multi-agent Systems (AAMAS-06) Exact Solutions of Interactive POMDPs Using Behavioral Equivalence.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
U NCERTAINTY IN S ENSING ( AND ACTION ). P LANNING W ITH P ROBABILISTIC U NCERTAINTY IN S ENSING No motion Perpendicular motion.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Markov Decision Process (MDP)
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
POMDP We maximize the expected cummulated reward.
Partially Observable Markov Decision Process and RL
Making complex decisions
POMDPs Logistics Outline No class Wed
Reinforcement Learning in POMDPs Without Resets
Monty Hall, Tiger 3 dveře, za jedněmi zlato ukážete jedny
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
Markov Decision Processes
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Markov Decision Processes
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Approximate POMDP planning: Overcoming the curse of history!
Instructor: Vincent Conitzer
Heuristic Search Value Iteration
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.
Presentation transcript:

An Introduction to PO-MDP Presented by Alp Sardağ

MDP  Components: –State –Action –Transition –Reinforcement  Problem: –choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution  Solution: –Policy: value function

Definition  Horizon length  Value Iteration: –Temporal Difference Learning: Q(x,a)  Q(x,a) +  (r+  max b Q(y,b) - Q(x,a)) Q(x,a)  Q(x,a) +  (r+  max b Q(y,b) - Q(x,a)) where  learning rate and  discount rate.  Adding PO to CO-MDP is not trivial: –Requires the complete observability of the state. –PO clouds the current state.

PO-MDP  Components: –States –Actions –Transitions –Reinforcement –Observations

Mapping in CO-MDP & PO-MDP  In CO-MDPs, mapping is from states to actions.  In PO-MDPs, mapping is from probability distributions (over states) to actions.

VI in CO-MDP & PO-MDP  In a CO-MDP, –Track our current state –Update it after each action  In a PO-MDP, –Probability distribution over states –Perform an action and make an observation, then update the distribution

Belief State and Space  Belief State: probability distribution over states.  Belief Space: the entire probability space.  Example: –Assume two state PO-MDP. –P(s 1 ) = p & P(s 2 ) = 1-p. –Line become hyper-plane in higher dimension. s1s1

Belief Transform  Assumption: –Finite action –Finite observation –Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation  Finite number of possible next belief state

PO-MDP into continuous CO-MDP  The process is Markovian, the next belief state depends on: –Current belief state –Current action –Observation  Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

Problem  Using VI in continuous state space.  No nice tabular representation as before.

PWLC  Restrictions on the form of the solutions to the continuous space CO-MDP: –The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. –the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

Steps in VI  Represent the value function for each horizon as a set of vectors. –Overcome how to represent a value function over a continuous space.  Find the vector that has the largest dot product with the belief state.

PO-MDP Value Iteration Example  Assumption: –Two states –Two actions –Three observations  Ex: horizon length is 1. b=[ ] [ s1s2s1s2 a 1 a 2 ] V(a 1,b) = 0.25x1+0.75x0 = 0.25 V(a 2,b)=0.25x0+0.75x1.5=1.125 a 1 is the best a 2 is the best

 The value of a belief state for horizon length 2 given b,a 1,z 1 : –immediate action plus the value of the next action. –Find best achievable value for the belief state that results from our initial belief state b when we perform action a 1 and observe z 1. PO-MDP Value Iteration Example

 Find the value for all the belief points given this fixed action and observation.  The Transformed value function is also PWLC.

 How to compute the value of a belief state given only the action?  The horizon 2 value of the belief state, given that: –Values for each observation: z 1 : 0.7 z 2 : 0.8 z 3 : 1.2 –P(z 1 | b,a 1 )=0.6; P(z 2 | b,a 1 )=0.25; P(z 3 | b,a 1 )= x x x1.2 = x x x1.2 = PO-MDP Value Iteration Example

Transformed Value Functions  Each of these transformed functions partitions the belief space differently.  Best next action to perform depends upon the initial belief state and observation.

Best Value For Belief States  The value of every single belief point, the sum of: –Immediate reward. –The line segments from the S() functions for each observation's future strategy.  since adding lines gives you lines, it is linear.

 All the useful future strategies are easy to pick out: Best Strategy for any Belief Points

Value Function and Partition  For the specific action a 1, the value function and corresponding partitions:

Value Function and Partition  For the specific action a 2, the value function and corresponding partitions:

Which Action to Choose?  put the value functions for each action together to see where each action gives the highest value.

Compact Horizon 2 Value Function

Value Function for Action a 1 with a Horizon of 3

Value Function for Action a 2 with a Horizon of 3

Value Function for Both Action with a Horizon of 3

Value Function for Horizon of 3