Reinforcement Learning in POMDPs Without Resets

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MAKING COMPLEX DEClSlONS
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning
Markov Decision Process (MDP)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Markov Decision Process (MDP)
CS b659: Intelligent Robotics
Making complex decisions
Reinforcement Learning (1)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Biomedical Data & Markov Decision Process
Policy Gradient in Continuous Time
Markov Decision Processes
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
Chapter 2: Evaluative Feedback
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Instructor: Vincent Conitzer
October 6, 2011 Dr. Itamar Arel College of Engineering
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Reinforcement Learning in POMDPs Without Resets E.Even-Dar, S.M.Kakade & Y.Mansour IJCAI, August, 2005 Presented by Lihan He Machine Learning Reading Group Duke University 07/29/2005

During reinforcement learning in POMDP, we usually reset the agent to a same situation in the beginning of each attempt. This guarantees the agent starts from the same point so that the comparison of rewards is fair. This paper gives an approach of approximate reset in the situation where the agent is not allowed to be exactly reset. Author proved that by this approximate reset, or homing strategy, the agent moves towards a reset within a given tolerance in the sense of expectation.

Outline POMDP, policy, and horizontal length; Reinforcement learning Homing strategies Two algorithms of reinforce learning with homing (1) model-free (2) model based Conclusion

POMDP POMDP = HMM + controllable actions. A POMDP model is defined by the tuple < S, A, T, R, , O >. An example: Hallway2 – navigation problem. 89 states: 4 orientations in 22 room, plus a goal. 17 observations: all combination of walls, plus ‘star’. 5 actions: stay in place, move forward, turn right, turn left, turn around. The state is hidden since the agent doesn’t know its current state based on current observation (wall / no wall in 4 orientations).

POMDP policy A policy π is a mapping from belief state b to action a, which tells agent which action to be taken under an estimated belief state. T-horizon optimal policy: The algorithm looks only T step ahead to maximize expected reward value V. immediate reward also a function of a T=1: consider only immediate reward T=infinite : consider all the discounted future reward t 1 T Horizontal length v.s. reward value for optimal policy

Reinforcement Learning How to get an optimal policy if the agent does not know the model parameters (state transition probability T(s,a,s’) and observation function O(a,s’,o)), or even does not know the structure of model? Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. Model-based: first estimate model parameters by exploring environment, and then get policy based on this model. During the exploration, the agent improves the model and also policy continuously. Model-free: totally discard model, find policy directly by exploring environment. Usually, algorithm searches in the space of behaviors to find the best performance.

Reinforcement Learning : reset To compare performance during trial-and-error process, the agent usually reset itself to the same initial situation (belief state) before each try. In this way, the comparison is fair. Same starting here This is usually done by “offline” simulation.

Reinforcement Learning : without reset Assume a realistic situation in which an agent starts in an unknown environment and must follow one continuous and uninterrupted chain of experience with no access to ‘resets’ or ‘offline’ simulation. The paper presents an algorithm with an approximate reset strategy, or a homing strategy. The homing strategy always exists in every POMDP. By performing homing strategy, the agent approximately resets its current belief state ε-close to the (unknown) initial belief state. The algorithm balances exploration and exploitation. A series of actions to achieve the approximate reset. When homing, the agent is neither exploring nor exploiting.

Homing Strategies Definition 1: H is an (ε, k)-approximate reset (or homing) strategy if for every 2 belief states b1 and b2, we have ||HE(b1)-HE(b2)||1≤ ε, where HE(b) is the expected belief state started from b after k homing actions of H and H(b) is a random variable such that HE(b)=EB[H(b)]. This definition states that H will approximately reset b, but this approximation quality could be poor. Next lemma will amplify the accuracy.

Homing Strategies Lemma 1 (accuracy amplification): Suppose that H is an (ε, k)-approximate reset, then is an -approximate reset, where consecutively implements H for times. H Lemma 2 (existence of homing strategy): For all POMDPs, the random walk strategy (using ‘stay’ actions) constitutes an (ε, k)-approximate reset for some k≥1 and 0< ε<1/2. Assumption: POMDP is connected, i.e., for any states s,s’, there exists a strategy which can reach s’ with positive probability starting from s. Must contain ‘stay’ action in random walk – avoiding trapped into loop. According to these two lemmas, for any POMDP, we have at least “random walk” to achieve approximate reset with any accuracy.

Reinforcement Learning with Homing Algorithm 1: model-free algorithm a (1/2,KH)-approximate reset strategy, e.g. random walk Exploration in Phase t Input : H for t = 1 to do foreach Policy π in πt do for i = 1 to k1t do Run π for t steps; Repeatedly run H for log(1/εt) times; end Let vπ be the average return of π from these k1t trials; Let for i = 1 to k2t do Run for t steps; Exploitation Homing a set of all possible policies horizontal length number of exploration time optimal policy number of exploitation time

Reinforcement Learning with Homing Algorithm 1: model-free algorithm What is a policy in this model-free POMDP? Definition: A history h is a sequence of actions, rewards and observations of some finite length, i.e., h={(a1,r1,o1), …, (at,rt,ot)}. A policy is defined as a mapping from histories to actions. Approximate reset: No relationship between the tth and the (t+1)th iteration. Very inefficient, since it is testing all possible policies. Impossible to implement. Reason of choosing k1t (number of exploration time) and k2t (number of exploitation time): Run enough times to guarantee convergence of estimated average reward.

Reinforcement Learning with Homing Algorithm 2: model-based algorithm a (1/2,KH)-approximate reset strategy, e.g. random walk Exploration in Phase t Input : H Let L=|A|.|O|; for t = 1 to do for k1t times do Run RANDOM WALK for t+1 steps; Repeatedly run H for log(Lt/εt) times; end for do if then ; end Compute using for k2t times do Run for t steps; Exploitation Homing Model update in Phase t a set of all possible histories

Reinforcement Learning with Homing Algorithm 2: model-based algorithm is a history with h followed by (a,o). POMDP is equivalent to an MDP where the history are states. So we can compute policy from Again, no relationship between the tth and the (t+1)th iteration. Instead of trying all the policy in algorithm 1, here the algorithm 2 uses sparse model parameters to compute policy.

Conclusion Author gives an approach of approximate reset in the situation where the agent is not allowed to be reset in the lifelong learning. A model-free algorithm and a model-based algorithm are suggested. The model-free algorithm is inefficient.

Reference Eyal Even-Dar, Sham M.Kakade, Yishay Mansour, “Reinforcement Learning in POMDPs without Resets”, 19th IJCAI, Jul.31, 2005. Mance E.Harmon, Stephanie S. Harmon. “Reinforcement Learning: A Tutorial”. Website about reinforcement learning: http://www-anw.cs.umass.edu/rlr/