Download presentation
Presentation is loading. Please wait.
Published byBertha Simon Modified over 9 years ago
1
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005
2
Outline Reinforcement Learning Explicit Explore or Exploit (E 3 ) algorithm Implicit Explore or Exploit (R-Max) algorithm Conclusions
3
What is Reinforcement Learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. Two strategies for solving reinforcement-learning problems: –Search in the space of behaviors to find the best performance; –Use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world.
4
Reinforcement Learning Model Formally, the model consists of a discrete set of environment states, S; a discrete set of agent actions, A; A set of scalar reinforcement signals; typically {0,1}, or the real numbers. Figure 1: The standard reinforcement-learning model
5
Example Dialogue The environment is non-deterministic but stationary.
6
Some Measurements Models of optimal behavior: –Finite-horizon ; –Infinite-horizon ; –Average-reward. Learning performance Convergence rate and Speed of convergence
7
Exploitation versus Exploration One major difference between reinforcement learning and supervised learning is that a reinforcement-learner must explicitly explore its environment. A simplest traditional reinforcement-learning problem: K-armed bandit problem – K gambling machines, h pulls. How do you decide which machine to pull?
8
Markov Decision Process Model The MDP is defined by the tuple S is a finite set of states of the world; A is a finite set of actions; T: S A (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’); R: S A is the reward for the agent in a given world state after performing an action, R(s, a). The agent does not know the parameters of this process.
9
Near-Optimal learning in Polynomial Time We call the value of the lower bound on T given above the – horizon time for the discounted MDP M.
10
Proof of the Lemma The lower bound follows from the definitions, since all expected payoffs are nonnegative. For the upper bound, fix any infinite path p, and let Ri be the expected payoffs along this path
11
The Explicit Explore or Exploit (E 3 ) Algorithm Model-based – Maintain a model for the transition probabilities and the expected payoffs for some subset of the states of the unknown MDP M. Balanced wandering – Take an arbitrary action from “unknown state”; enough visits to one state makes the state become a “known state”. Known-state MDP M s – Induced on the set of currently known states S; all of the unknown states are represented by a single additional, absorbing state s 0.
12
Initialization – The set S of known states is empty; Balanced wandering – Any time the current state is not in S, the algorithm performs balanced wandering; Discovery of New known states – Any time a state i has been visited m known times, it enters the known set S. Off-line optimization – Upon reaching a known state i in S, the algorithm performs the two off-line optimal policy computations on M s and M s ’ –Attempted Exploitation: If the resulting exploitation policy achieves return from i in Ms that is at least, the algorithm executes for the next T steps. –Attempted Exploration: Otherwise, the algorithm executes the exploration policy derived from M s ’ to do T steps exploration.
13
Explore or Exploit Lemma
14
R-Max – the implicit explore or exploit algorithm In the spirit of E 3 algorithm, a general polynomial time algorithm for near-optimal reinforcement learning. The agent does not know its behavior is exploitation or exploration. However, it knows that it will either optimize or learn efficiently. R-max is described in the context of stochastic game (SG), which also considers the actions of the adversary. (Maybe useful for moving target problem?)
15
SG and MDP An MDP is an SG in which the adversary has a single action at each state. SGMDP StateGiGi SiSi Action(a, a’)a TransitionP M (s,t,a,a’)P M (s,t,a) Unknown stateG0G0 S0S0 RewardMatrix on each G i R(s,a)
16
Initialization – Construct model M’ consisting of N+1 stage- games, {G0,G1,…,GN}. G0 is an additional fictitious game. Initialize all game matrices to have (Rmax,0) in all entries. Initialize PM(Gi,G0,a,a’)=1 for all I and all actions a,a’. Compute and Act – Compute an optimal T-step policy for the current state, and execute it for T-steps or until a new entry becomes known. Observe and update –Update the reward for (a,a’) in the state Gi –Update the set of states reached by playing (a,a’) in Gi –If the record of states reached from this entry contains elements mark this entry as KNOWN,and update the transition matrix for this entry.
17
Conclusion The author described R-Max, a simple RL algorithm that leads to polynomial time convergence to near-optimal reward. R-Max is an optimistic model-based algorithm in the spirit of E 3 algorithm. However, unlike E 3, R-Max makes implicit trade-off between exploration and exploitation.
18
Related to our work This paper focus on the proof of algorithm existence and discussion of optimality and convergence while the detailed MDP solution is not addressed. We may utilize our POMDP solver in this framework to make some extension. This algorithm does not require random walk for learning environment in advance. This may be interesting for our robot navigation problem.
19
Reference R.Brafman and M.Tennenholtz, “R-MAX – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning”, Journal of Machine Learning Research 2002 M.Kearns and S.Singh, “Near-optimal reinforcement learning in polynomial time”, ICML 1998 L.P.Kaelbling, M.L.Littleman and A.W.Moore, “Reinforcement learning: A survey.” Journal of AI Research 1996
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.