Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Similar presentations


Presentation on theme: "Reinforcement Learning in MDPs by Lease-Square Policy Iteration"— Presentation transcript:

1 Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Presented by Lihan He Machine Learning Reading Group Duke University 09/16/2005

2 Outline MDP and Q-function Value function approximation
LSPI: Least-Square Policy Iteration Proto-value Functions RPI: Representation Policy Iteration

3 Markov Decision Process (MDP)
An MDP is a model M = < S, A, T, R > a set of environment states S, a set of actions A, a transition function T: S  A  S  [0,1] , T(s,a,s’) = P(s’| s,a), a reward function R: S A  R . A policy is a function : S  A. Value function (expected cumulative reward) V: S  R . Satisfying Bellman Eq.: V(s) = R(s, (s)) +  s’ P(s’| s, (s)) V(s’)

4 Markov Decision Process (MDP)
An example of grid world environment +1 0.9 0.8 0.7 0.6 0.5 Optimal policy Value function

5 State-action value function Q
The state-action value function Q(s,a) of any policy  is defined over all possible combinations of states and actions and indicates the expected, discounted, total reward when taking action a in state s and following policy  thereafter. Q(s,a) = R(s,a) +  s’ P(s’| s,a) Q(s’, (s’)) Given policy , for each state-action pair, we have a Q(s,a) value. In matrix format, above Bellman equation becomes Q = R +  P Q Q , R: vectors of size |S||A| P: stochastic matrix of size (|S||A|  |S|) P((s,a),s’) = P(s’|s,a)

6 How the policy iteration works?
Q = R +  P Q Value Function Q Policy  Policy improvement Value Evaluation Model For model-free reinforcement learning, we don’t have model P.

7 Value function approximation
Let be an approximation to with free parameters w: where i.e., Q values are approximated by a linear parametric combination of k basis functions The basis functions are fixed, but arbitrarily selected (non-linear) functions of s and a. Note that Q is a vector of size |S||A|. If k=|S||A| and bases are independent, we can find w such that In general, k<<|S||A|, we use linear combination of only several bases to approximate value function Q. Solve w  evaluate and get updated policy

8 Value function approximation
Examples of basis functions: Polynomials: Use indicator function I(a=ai) to decouple actions so that each action gets its own parameters. Radial basis functions (RBF) Proto-value functions Other manually designed bases based on specific problems

9 Value function approximation
Least-Square Fixed-Point Approximation Let We have Use Q = R +  P Q, and remember is the projection of Q onto Φ space, by projection theory, finally we get

10 Least-Square Policy Iteration
Solving the parameter w is equivalent to solving linear system where A is the sum of many matrices b is the sum of many vectors And they are weighted by transition probability

11 Least-Square Policy Iteration
If we sampled data from underlying transition probability, samples: A and b can be learned (in block) as Or (real-time)

12 Least-Square Policy Iteration
Input: D, k, φ, γ, ε, π0(w0) π’  π0 repeat π  π’ until % value function update % could use real-time update % policy update

13 Proto-Value Function How to choose basis functions in LSPI algorithm?
Proto-value functions are good bases for value function approximation. Do not need to design bases manually Data tell us what are the corresponding proto-value functions Generate from topology of the underlying state space Do not estimate underlying state transition probability Capture the intrinsic smoothness constraints that true value functions have.

14 Proto-Value Function 1. Graph representation of the underlying state transition: s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 2. Adjacency matrix A: 1 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11

15 Proto-Value Function 3. Combinatorial Laplacian L: L=T - A
where T is the diagonal matrix whose entries are row sums of the adjacency matrix A 4. Proto value functions: Eigenvectors of the combinatorial Laplacian L Each eigenvector provides one basis φj(s), combined with indicator function for action a, we get φj(s,a),

16 Example of proto value functions:
Grid world: 1260 states Example of proto value functions: G 20 21 Adjacency matrix zoom in

17 Proto-value functions: Low-order eigenvectors as basis functions

18 Optimal value function
Value function approximation using 10 proto-value functions as bases

19 Representation Policy Iteration (offline)
Input: D, k, γ, ε, π0(w0) 1. Construct basis functions: Use sample D to learn a graph that encodes the underlying state space topology. Compute the lowest-order k eigenvectors of the combinatorial Laplacian on the graph. The basis functions φ(s,a) are produced by combining the k proto-value functions with indicator function of action a. 2. π’  π0 3. repeat π  π’ % value function update % policy update until

20 Representation Policy Iteration (online)
Input: D0, k, γ, ε, π0(w0) 1. Initialization: Using offline algorithm, based on D0, k, γ, ε, π0, learn policy π(0), and get 2. π’  π(0) 3. repeat (a) π(t)  π’ (b) execute π(t) to get new data D(t)={st, at, rt, s’t} . (c) If new data sample D(t) changes the topology of G, compute a new set of basis functions. (d) % value function update (e) % policy update until

21 Example: Chain MDP, rewards +1 at state 10 & 41, otherwise 0.
Optimal policy: 1-9, 26-41: Right; , Left

22 Example: Chain MDP 20 bases used Value function and approximation in each iteration

23 Performance comparison
Example: Chain MDP Policy L1 error with respect to optimal policy Steps to convergence Performance comparison

24 References: M. Lagoudakis & R. Parr, Least-Square Policy Iteration. Journal of Machine Learning Research 4 (2003), -- Give LSPI algorithm for reinforcement learning S. Mahadevan, Proto-Value Functions: Developmental Reinforcement Learning. Proceedings of ICML2005. -- How to build basis function for LSPI algorithm C.Kwok & D. Fox, Reinforcement Learning for Sensing Strategies. Proceedings of IROS2004. -- An application of LSPI


Download ppt "Reinforcement Learning in MDPs by Lease-Square Policy Iteration"

Similar presentations


Ads by Google