An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L. Littman
A Walk Through Our Paper Features: Training Data: (s,r,s’),(s,r,s’), (s,r,s’) Linear Model P – (k × k) R – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V= w Solve for exact value Function given P , R Solve for linear fixed Point using linear TD, LSTD, etc.
A Walk Through Our Paper Features: Training Data: (s,r,s’),(s,r,s’) (s,r,s’)… Linear Model P – (k × k) R – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V= w Solve for exact value Function given P , R Solve for linear fixed Point using linear TD, LSTD, etc. Bellman error of linear fixed point solution Reward error Per feature error Insight into feature selection! Number of basis functions Number of basis functions Number of basis functions Total Bellman Error Reward Error Feature Error
Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results
Basic Terminology Markov Reward Process (MRP)* – States – S=[s 1 …s n ] – Reward – R:S ℝ – Transition matrix – P[i,j]=P(s i |s j ) – Discount – 0≤ <1 – Value – V(s) = expected, discounted value of state s True Value Function: V*=(I- P) -1 R *Ask about MDPs later.
Linear Value Function Approximation |S| typically quite large Pick linearly independent features =( 1 … k ) (basis functions) Desire weights w=w 1 …w k, s.t.
Bellman Operator Used in, e.g., value iteration: Defines fixed point: V*=TV* Bellman error (residual): BE bounds actual error:
Linear Fixed Point V=weights of projection of V into span( ) LSTD, linear TD, etc. solve for the linear fixed point: span( )
Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results
Linear Model Approximation Linearly independent features =( 1 … k ) (n × k) Want R = reward model (k × 1 ) w/smallest L 2 error: Want P = feature × feature model (k × k ) w/ smallest L 2 error Expected next feature values (n × k)
Value Function of the Linear Model Value function is in span( ) Can express value functions as w If V is bounded, then: Note similarity to conventional solution: (k × k)(k × 1) (n × 1) (n × n)
Linear Model, Linear Fixed Point Equivalence Theorem: For features , the linear model’s exact value function and the linear fixed point solution are identical. Proof sketch: Note: Preliminary observations along these lines by Boyan [99] Approximate model appears in linear fixed point definition! Definition of linear fixed point solution
Linear Model, Linear Fixed Point Equivalence (s,a,s’), (s,a,s’), … Training Data Given: Linearly independent features =( 1 … k ) Linear Model P – (k × k) R – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V= w Solve for exact value Function given P , R Solve for linear fixed Point using linear TD, LSTD, etc.
Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results
Model Error Linearly independent features =( 1 … k ) Error in reward: Error in predicted next features: (per feature error) Expected next feature values Predicted next feature values (n × k) (n × 1)
Bellman Error Theorem: Bellman error of linear fixed point solution Reward error Per feature error Punch line: Bellman error decomposes into a function of model errors!
Outline Terminology/Notation Review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results
Insights into Feature Selection I Features should model the reward. The reward itself is a useful feature!
Insights into Feature Selection II Features “predict themselves” BE= R, no dependence on Value function approximation, feature selection reduce to regression problems on R w =(I- P) -1 R When = 0: Approximate reward True P!
Achieving Zero Feature Error ( = 0) When are features sufficient for 0 error in expected next feature values? – Rank( )=|S| – is composed of eigenvectors of P – span an invariant subspace of P Invariant subspace:
Insight into Adding Features Methods for adding features – Add the Bellman error as a feature (BEBF) [Wu & Givan, 2004; Sanner & Boutilier, 2005; Keller et al., 2006), Parr et al. (2007)] – Add the model errors ( , R) as features (MEBF) – Add P k R for increasing k (Krylov basis) [Petrik (2007)] Theorem: BEBF, MEBF, and the Krylov basis are equivalent when initialized with ={} Note: Special thanks to Marek Petrik for demonstrating Krylov=BEBF
Insight into Proto Value Functions Proto value functions (PVFs) compute eigenvectors of a modified adjacency graph (Laplacian) [Mahadevan & Maggioni] Adjacency graph = approximate P PVFs ~ eigenvectors of P ∴ PVFs = approximation to subspace invariant features (Empirically, closeness of this approximation varies) Note: Similar observations made by Petrik, who considered a version of PVFs that used P instead of Laplacian.
Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results
Experimental Results Four Algorithms – PVFs (In order of “smoothness”) – PVF-MP (Matching pursuits w/PVF dictionary) – eig-MP (Matching pursuits w/eigenvectors of P) – BEBF (AKA MEBF, Krylov basis) Measured (in L 2 ) as a function of number of basis functions added: – Total Bellman error – Reward error R – Total feature error w Three problems – Chain [Lagoudakis & Parr] (talk, paper, poster) – Two Room [Mahadevan & Maggioni] (paper, poster) – Blackjack [Sutton & Barto] (paper, poster)
Chain Results PVF PVF-MP Eig-MP BEBF Number of basis functions Total Bellman ErrorReward ErrorFeature Error 50 state chain from Lagoudakis & Parr Ask about blackjack or the two-room domain – or come to our poster!
Conclusions From Experiments eig-mp will always have =0 PVFs sometimes approximate subspace invariance (potentially useful because of stability issues w/eig-mp) PVF-MP dominates PVF because PVF ignores R BEBF will always have R=0 BEBF has a more steady/predictable reduction in BE Don’t ignore R!
Ground Covered Features: Training Data: (s,r,s’),(s,r,s’), (s,r,s’) Linear Model P – (k × k) R – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V= w Solve for exact value Function given P , R Solve for linear fixed Point using linear TD, LSTD, etc. Bellman error of linear fixed point solution Reward error Per feature error Insight into feature selection! Number of basis functions Number of basis functions Number of basis functions Total Bellman Error Reward Error Feature Error
Thank you! Also, special thanks to Jeff Johns, Sridhar Mahadevan, and Marek Petrik