Reinforcement Learning in POMDPs Without Resets

Reinforcement Learning in POMDPs Without Resets
E.Even-Dar, S.M.Kakade & Y.Mansour IJCAI, August, 2005 Presented by Lihan He Machine Learning Reading Group Duke University 07/29/2005

During reinforcement learning in POMDP, we usually reset the agent to a same situation in the beginning of each attempt. This guarantees the agent starts from the same point so that the comparison of rewards is fair. This paper gives an approach of approximate reset in the situation where the agent is not allowed to be exactly reset. Author proved that by this approximate reset, or homing strategy, the agent moves towards a reset within a given tolerance in the sense of expectation.

Outline POMDP, policy, and horizontal length; Reinforcement learning
Homing strategies Two algorithms of reinforce learning with homing (1) model-free (2) model based Conclusion

POMDP POMDP = HMM + controllable actions.
A POMDP model is defined by the tuple < S, A, T, R, , O >. An example: Hallway2 – navigation problem. 89 states: 4 orientations in 22 room, plus a goal. 17 observations: all combination of walls, plus ‘star’. 5 actions: stay in place, move forward, turn right, turn left, turn around. The state is hidden since the agent doesn’t know its current state based on current observation (wall / no wall in 4 orientations).

POMDP policy A policy π is a mapping from belief state b to action a, which tells agent which action to be taken under an estimated belief state. T-horizon optimal policy: The algorithm looks only T step ahead to maximize expected reward value V. immediate reward also a function of a T=1: consider only immediate reward T=infinite : consider all the discounted future reward t 1 T Horizontal length v.s. reward value for optimal policy

Reinforcement Learning
How to get an optimal policy if the agent does not know the model parameters (state transition probability T(s,a,s’) and observation function O(a,s’,o)), or even does not know the structure of model? Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. Model-based: first estimate model parameters by exploring environment, and then get policy based on this model. During the exploration, the agent improves the model and also policy continuously. Model-free: totally discard model, find policy directly by exploring environment. Usually, algorithm searches in the space of behaviors to find the best performance.

Reinforcement Learning : reset
To compare performance during trial-and-error process, the agent usually reset itself to the same initial situation (belief state) before each try. In this way, the comparison is fair. Same starting here This is usually done by “offline” simulation.

Reinforcement Learning : without reset
Assume a realistic situation in which an agent starts in an unknown environment and must follow one continuous and uninterrupted chain of experience with no access to ‘resets’ or ‘offline’ simulation. The paper presents an algorithm with an approximate reset strategy, or a homing strategy. The homing strategy always exists in every POMDP. By performing homing strategy, the agent approximately resets its current belief state ε-close to the (unknown) initial belief state. The algorithm balances exploration and exploitation. A series of actions to achieve the approximate reset. When homing, the agent is neither exploring nor exploiting.

Homing Strategies Definition 1: H is an (ε, k)-approximate reset (or homing) strategy if for every 2 belief states b1 and b2, we have ||HE(b1)-HE(b2)||1≤ ε, where HE(b) is the expected belief state started from b after k homing actions of H and H(b) is a random variable such that HE(b)=EB[H(b)]. This definition states that H will approximately reset b, but this approximation quality could be poor. Next lemma will amplify the accuracy.

Homing Strategies Lemma 1 (accuracy amplification): Suppose that H is an (ε, k)-approximate reset, then is an approximate reset, where consecutively implements H for times. H Lemma 2 (existence of homing strategy): For all POMDPs, the random walk strategy (using ‘stay’ actions) constitutes an (ε, k)-approximate reset for some k≥1 and 0< ε<1/2. Assumption: POMDP is connected, i.e., for any states s,s’, there exists a strategy which can reach s’ with positive probability starting from s. Must contain ‘stay’ action in random walk – avoiding trapped into loop. According to these two lemmas, for any POMDP, we have at least “random walk” to achieve approximate reset with any accuracy.

Reinforcement Learning with Homing
Algorithm 1: model-free algorithm a (1/2,KH)-approximate reset strategy, e.g. random walk Exploration in Phase t Input : H for t = 1 to do foreach Policy π in πt do for i = 1 to k1t do Run π for t steps; Repeatedly run H for log(1/εt) times; end Let vπ be the average return of π from these k1t trials; Let for i = 1 to k2t do Run for t steps; Exploitation Homing a set of all possible policies horizontal length number of exploration time optimal policy number of exploitation time

Algorithm 1: model-free algorithm What is a policy in this model-free POMDP? Definition: A history h is a sequence of actions, rewards and observations of some finite length, i.e., h={(a1,r1,o1), …, (at,rt,ot)}. A policy is defined as a mapping from histories to actions. Approximate reset: No relationship between the tth and the (t+1)th iteration. Very inefficient, since it is testing all possible policies. Impossible to implement. Reason of choosing k1t (number of exploration time) and k2t (number of exploitation time): Run enough times to guarantee convergence of estimated average reward.

Algorithm 2: model-based algorithm a (1/2,KH)-approximate reset strategy, e.g. random walk Exploration in Phase t Input : H Let L=|A|.|O|; for t = 1 to do for k1t times do Run RANDOM WALK for t+1 steps; Repeatedly run H for log(Lt/εt) times; end for do if then ; end Compute using for k2t times do Run for t steps; Exploitation Homing Model update in Phase t a set of all possible histories

Algorithm 2: model-based algorithm is a history with h followed by (a,o). POMDP is equivalent to an MDP where the history are states. So we can compute policy from Again, no relationship between the tth and the (t+1)th iteration. Instead of trying all the policy in algorithm 1, here the algorithm 2 uses sparse model parameters to compute policy.

Conclusion Author gives an approach of approximate reset in the situation where the agent is not allowed to be reset in the lifelong learning. A model-free algorithm and a model-based algorithm are suggested. The model-free algorithm is inefficient.

Reference Eyal Even-Dar, Sham M.Kakade, Yishay Mansour, “Reinforcement Learning in POMDPs without Resets”, 19th IJCAI, Jul.31, 2005. Mance E.Harmon, Stephanie S. Harmon. “Reinforcement Learning: A Tutorial”. Website about reinforcement learning:

Reinforcement Learning in POMDPs Without Resets

Similar presentations

Presentation on theme: "Reinforcement Learning in POMDPs Without Resets"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning in POMDPs Without Resets

Similar presentations

Presentation on theme: "Reinforcement Learning in POMDPs Without Resets"— Presentation transcript:

Similar presentations

About project

Feedback