Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

Markov Decision Process
RL for Large State Spaces: Value Function Approximation
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Optimal Policies for POMDP Presented by Alp Sardağ.
An Introduction to Markov Decision Processes Sarah Hickmott
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Infinite Horizon Problems
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Algorithmic and Economic Aspects of Networks Nicole Immorlica.
Markov Decision Processes
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.
MAKING COMPLEX DEClSlONS
Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Simultaneous Localization and Mapping Presented by Lihan He Apr. 21, 2006.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Making complex decisions
Reinforcement Learning in POMDPs Without Resets
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning Dealing with Partial Observability
Reinforcement Nisheeth 18th January 2019.
Presentation transcript:

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University

Talk Outline Bounded Rationality and Partially Observable MDPs Mathematical Model of POMDPs Learning in POMDPs –Planning in POMDPs –Tracking in POMDPs

Bounded Rationality Rationality: –Unlimited Computational power players Bounded Rationality –Computational limitation –Finite Automata Challenge: play optimally against a Finite Automata –Size of automata unknown

Bounded Rationality and RL Model: –Perform an action –See an observation –Either immediate rewards or delay reward This is a POMDP –Unknown size is a serious challenge

Classical Reinforcement Learning Agent – Environment Interaction Agent Environment Agent action Next state Reward

Reinforcement Learning - Goal Maximize the return. Discounted return ∑  t r t 0<  <1 Undiscounted return ∑r t / T t=1 T ∞

Markov Decision Process s1s1 s2s2 s3s3 S the states A actions P sa (-) next state distribution R(s,a) Reward distribution E[R ( s 3,a )] = 10

Reinforcement Learning Model Policy Policy Π: – Mapping states to distribution over Optimal policy Π * : –Attains optimal return from any start state. Theorem: There exists a stationary deterministic optimal policy

Planning and Learning in MDPs Planning: –Input: a complete model –Output: an optimal policy Π*: Learning: –Interaction with the environment –Achieve near optimal return. For MDPs both planning and learning can be done efficiently –Polynomial in the number of states –representation in tabular form

Partial Observable Agent – Environment Interaction Agent Environment Agent action Signal correlated with state Reward

Partially Observable Markov Decision Process s1s1 s2s2 s3s3 S the states A actions P sa (-) next state distribution R(s,a) Reward distribution E[R ( s 3,a )] = 10 O Observations O(s,a) Observation distribution O 1 = = =.1 O 1 = = =.8 O 1 = = =.1

Partial Observables – problems in Planning The optimal policy is not stationary furthermore it is history dependent Example:

Partial Observables – Complexity Hardness results policyhorizonApproximationComplexity stationaryfinite  -additiveNP-comp History dependent finite  -additivePSPACE- comp stationarydiscounted  -additiveNP-comp LGM01, L95

Learning in PODMPs – Difficulties Suppose an agent knows its state initially, can he keep track of his state? –Easy given a completely accurate model. –Inaccurate model: Our new tracking result. How can the agent return to the same state? What is the meaning of very long histories? –Do we really need to keep all the history?!

Planning in POMDPs – Belief State Algorithm A Bayesian setting Prior over initial state Given an action and observation defines a posterior –belief state: distribution over states View the possible belief states as “states” –Infinite number of states Assumes also a “perfect model”

Learning in POMDPs – Popular methods Policy gradient methods : –Find local optimal policy in a restricted class of polices (parameterized policies) –Need to assume a reset to the start state! –Cannot guarantee asymptotic results –[Peshkin et al, Baxter & Bartlett,…]

Learning in POMDPs Trajectory trees [KMN]: –Assume a generative model A strong RESET procedure –Find “near best” policy in a restricted class of polices finite horizon policies parameterized policies

Trajectory tree [KMN] s0s0 o2o2 o3o3 o4o4 o1o1 o1o1 o2o2 a1a1 a2a2 a2a2 a2a2 a1a1 a1a1

Our setting Return: Average reward criteria One long trajectory –No RESET –Connected environment (unichain POMDP) Goal: Achieve the optimal return (average reward) with probability 1

Homing strategies - POMDPs Homing strategy is a strategy that identifies the state. –Knows how to return “home” Enables to “approximate reset” in during a long trajectory.

Homing strategies Learning finite automata [Rivest Schapire] –Use homing sequence to identify the state The homing sequence is exact It can lead to many states –Use finite automata learning of [Angluin 87] Diversity based learning [Rivest Schpire] –Similar to our setting Major difference: deterministic transitions

Homing strategies - POMDPs Definition: H is an ( ,K)-homing strategy if for every two belief states x 1 and x 2, after K steps of following H, the expected belief states b 1 and b 2 are within  distance.

Homing strategies – Random Walk The POMDP is strongly connected, then the random walk Markov chain is irreducible Following the random walk assures that we converge to the steady state

Homing strategies – Random Walk What if the Markov chain is periodic? –a cycle Use “stay action” to overcome periodicity problems

Homing strategies – Amplifying Claim: If H is an ( ,K)-homing sequence then repeating H for T times is an (  T,KT)- homing sequence

Reinforcement learning with homing Usually algorithms should balance between exploration and exploitation Now they should balance between exploration, exploitation and homing Homing is performed in both exploration and exploitation

Policy testing algorithm Theorem: For any connected POMDP the policy testing algorithm obtains the optimal average reward with probability 1 After T time steps is competes with policies of horizon log log T

Policy testing Enumerate the policies –Gradually increase horizon Run in phases: –Test policy π k Average runs, resetting between runs –Run the best policy so far Ensures good average return Again, reset between runs.

Model based algorithm Theorem: For any connected POMDP the model based algorithm obtains the optimal average reward with probability 1 After T time steps is competes with policies of horizon log T

Model based algorithm For t=1 to ∞ –For K 1 (t)times do Run random for t steps and build an empirical model Use homing sequence to approximate reset –Compute optimal policy on the empirical model –For K 2 (t) times do Run the empirical optimal policy for t steps Use homing sequence to approximate reset Exploration Exploitation

Model based algorithm s0s0 o2o2 o1o1 ~ a1a1 a1a1 a1a1 a2a2 a2a2 a2a2 o2o2 o1o1 …………………………………………………………………………

Model based algorithm – Computing the optimal policy Bounding the error in the model –Significant Nodes Sampling Approximate reset –Insignificant Nodes Compute an ε-optimal t horizon policy in each step

Model Based algorithm- Convergence w.p 1 proof Proof idea: At any stage K 1 (t) is large enough so we compute an  t -optimal t horizon policy K 2 (t) is large enough such that all phases before influence is bounded by  t For a large enough horizon, the homing sequence influence is also bounded

Model Based algorithm Convergence rate Model based algorithm produces an  - optimal policy with probability 1 -  in time polynomial in, |A|,|O|, log(1/  ), Homing sequence length, and exponential in the horizon time of the optimal policy Note the algorithm does not depend on |S|

Planning in POMDP Unfortunately, not today … Basic results: –Tight connections with Multiplicity Automata Well establish theory starting in the 60’s –Rank of the Hankel matrix Similar to PSR Always less then the number of states –Planning algorithm: Exponential in the rank of the Hankel matrix

Tracking in POMDPs Belief states algorithm –Assumes perfect tracking Perfect model. Imperfect model, tracking impossible –For example: No observable New results: –“Informative observables” implies efficient tracking. Towards a spectrum of “partially” …