Download presentation
Presentation is loading. Please wait.
Published byKory Fowler Modified over 6 years ago
1
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Presented by Lihan He Machine Learning Reading Group Duke University 09/16/2005
2
Outline MDP and Q-function Value function approximation
LSPI: Least-Square Policy Iteration Proto-value Functions RPI: Representation Policy Iteration
3
Markov Decision Process (MDP)
An MDP is a model M = < S, A, T, R > a set of environment states S, a set of actions A, a transition function T: S A S [0,1] , T(s,a,s’) = P(s’| s,a), a reward function R: S A R . A policy is a function : S A. Value function (expected cumulative reward) V: S R . Satisfying Bellman Eq.: V(s) = R(s, (s)) + s’ P(s’| s, (s)) V(s’)
4
Markov Decision Process (MDP)
An example of grid world environment +1 0.9 0.8 0.7 0.6 0.5 Optimal policy Value function
5
State-action value function Q
The state-action value function Q(s,a) of any policy is defined over all possible combinations of states and actions and indicates the expected, discounted, total reward when taking action a in state s and following policy thereafter. Q(s,a) = R(s,a) + s’ P(s’| s,a) Q(s’, (s’)) Given policy , for each state-action pair, we have a Q(s,a) value. In matrix format, above Bellman equation becomes Q = R + P Q Q , R: vectors of size |S||A| P: stochastic matrix of size (|S||A| |S|) P((s,a),s’) = P(s’|s,a)
6
How the policy iteration works?
Q = R + P Q Value Function Q Policy Policy improvement Value Evaluation Model For model-free reinforcement learning, we don’t have model P.
7
Value function approximation
Let be an approximation to with free parameters w: where i.e., Q values are approximated by a linear parametric combination of k basis functions The basis functions are fixed, but arbitrarily selected (non-linear) functions of s and a. Note that Q is a vector of size |S||A|. If k=|S||A| and bases are independent, we can find w such that In general, k<<|S||A|, we use linear combination of only several bases to approximate value function Q. Solve w evaluate and get updated policy
8
Value function approximation
Examples of basis functions: Polynomials: Use indicator function I(a=ai) to decouple actions so that each action gets its own parameters. Radial basis functions (RBF) Proto-value functions Other manually designed bases based on specific problems
9
Value function approximation
Least-Square Fixed-Point Approximation Let We have Use Q = R + P Q, and remember is the projection of Q onto Φ space, by projection theory, finally we get
10
Least-Square Policy Iteration
Solving the parameter w is equivalent to solving linear system where A is the sum of many matrices b is the sum of many vectors And they are weighted by transition probability
11
Least-Square Policy Iteration
If we sampled data from underlying transition probability, samples: A and b can be learned (in block) as Or (real-time)
12
Least-Square Policy Iteration
Input: D, k, φ, γ, ε, π0(w0) π’ π0 repeat π π’ until % value function update % could use real-time update % policy update
13
Proto-Value Function How to choose basis functions in LSPI algorithm?
Proto-value functions are good bases for value function approximation. Do not need to design bases manually Data tell us what are the corresponding proto-value functions Generate from topology of the underlying state space Do not estimate underlying state transition probability Capture the intrinsic smoothness constraints that true value functions have.
14
Proto-Value Function 1. Graph representation of the underlying state transition: s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 2. Adjacency matrix A: 1 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11
15
Proto-Value Function 3. Combinatorial Laplacian L: L=T - A
where T is the diagonal matrix whose entries are row sums of the adjacency matrix A 4. Proto value functions: Eigenvectors of the combinatorial Laplacian L Each eigenvector provides one basis φj(s), combined with indicator function for action a, we get φj(s,a),
16
Example of proto value functions:
Grid world: 1260 states Example of proto value functions: G 20 21 Adjacency matrix zoom in
17
Proto-value functions: Low-order eigenvectors as basis functions
18
Optimal value function
Value function approximation using 10 proto-value functions as bases
19
Representation Policy Iteration (offline)
Input: D, k, γ, ε, π0(w0) 1. Construct basis functions: Use sample D to learn a graph that encodes the underlying state space topology. Compute the lowest-order k eigenvectors of the combinatorial Laplacian on the graph. The basis functions φ(s,a) are produced by combining the k proto-value functions with indicator function of action a. 2. π’ π0 3. repeat π π’ % value function update % policy update until
20
Representation Policy Iteration (online)
Input: D0, k, γ, ε, π0(w0) 1. Initialization: Using offline algorithm, based on D0, k, γ, ε, π0, learn policy π(0), and get 2. π’ π(0) 3. repeat (a) π(t) π’ (b) execute π(t) to get new data D(t)={st, at, rt, s’t} . (c) If new data sample D(t) changes the topology of G, compute a new set of basis functions. (d) % value function update (e) % policy update until
21
Example: Chain MDP, rewards +1 at state 10 & 41, otherwise 0.
Optimal policy: 1-9, 26-41: Right; , Left
22
Example: Chain MDP 20 bases used Value function and approximation in each iteration
23
Performance comparison
Example: Chain MDP Policy L1 error with respect to optimal policy Steps to convergence Performance comparison
24
References: M. Lagoudakis & R. Parr, Least-Square Policy Iteration. Journal of Machine Learning Research 4 (2003), -- Give LSPI algorithm for reinforcement learning S. Mahadevan, Proto-Value Functions: Developmental Reinforcement Learning. Proceedings of ICML2005. -- How to build basis function for LSPI algorithm C.Kwok & D. Fox, Reinforcement Learning for Sensing Strategies. Proceedings of IROS2004. -- An application of LSPI
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.