An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor,

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Solving connectivity problems parameterized by treewidth in single exponential time Marek Cygan, Marcin Pilipczuk, Michal Pilipczuk Jesper Nederlof, Dagstuhl.
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Markov Decision Process
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Batch RL Via Least Squares Policy Iteration
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.
Decision Theoretic Planning
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Infinite Horizon Problems
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
A Constraint Generation Approach to Learning Stable Linear Dynamical Systems Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Matrix sparsification and the sparse null space problem Lee-Ad GottliebWeizmann Institute Tyler NeylonBynomial Inc. TexPoint fonts used in EMF. Read the.
Discretization Pieter Abbeel UC Berkeley EECS
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
IEOR March 121 The Complexity of Trade-offs Christos H. Papadimitriou UC Berkeley (JWWMY)
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
A Constraint Generation Approach to Learning Stable Linear Dynamical Systems Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Predictive State Representation Masoumeh Izadi School of Computer Science McGill University UdeM-McGill Machine Learning Seminar.
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
MAKING COMPLEX DEClSlONS
+ Review of Linear Algebra Optimization 1/14/10 Recitation Sivaraman Balakrishnan.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.
Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions by S. Mahadevan & M. Maggioni Discussion led by Qi An ECE, Duke University.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.
Kernelized Value Function Approximation for Reinforcement Learning Gavin Taylor and Ronald Parr Duke University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
8.3.2 Constant Distance Approximations
Boosting and Additive Trees (2)
István Szita & András Lőrincz
Random Sampling over Joins Revisited
Reinforcement learning
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Parallelization of Sparse Coding & Dictionary Learning
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L. Littman

A Walk Through Our Paper Features:  Training Data: (s,r,s’),(s,r,s’), (s,r,s’) Linear Model P  – (k × k) R  – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V=  w Solve for exact value Function given P , R  Solve for linear fixed Point using linear TD, LSTD, etc.

A Walk Through Our Paper Features:  Training Data: (s,r,s’),(s,r,s’) (s,r,s’)… Linear Model P  – (k × k) R  – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V=  w Solve for exact value Function given P , R  Solve for linear fixed Point using linear TD, LSTD, etc. Bellman error of linear fixed point solution Reward error Per feature error Insight into feature selection! Number of basis functions Number of basis functions Number of basis functions Total Bellman Error Reward Error Feature Error

Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results

Basic Terminology Markov Reward Process (MRP)* – States – S=[s 1 …s n ] – Reward – R:S ℝ – Transition matrix – P[i,j]=P(s i |s j ) – Discount – 0≤  <1 – Value – V(s) = expected, discounted value of state s True Value Function: V*=(I-  P) -1 R *Ask about MDPs later.

Linear Value Function Approximation |S| typically quite large Pick linearly independent features  =(  1 …  k ) (basis functions) Desire weights w=w 1 …w k, s.t.

Bellman Operator Used in, e.g., value iteration: Defines fixed point: V*=TV* Bellman error (residual): BE bounds actual error:

Linear Fixed Point   V=weights of projection of V into span(  ) LSTD, linear TD, etc. solve for the linear fixed point: span(  )

Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results

Linear Model Approximation Linearly independent features  =(  1 …  k ) (n × k) Want R  = reward model (k × 1 ) w/smallest L 2 error: Want P  = feature × feature model (k × k ) w/ smallest L 2 error Expected next feature values (n × k)

Value Function of the Linear Model Value function is in span(  )  Can express value functions as  w If V is bounded, then: Note similarity to conventional solution: (k × k)(k × 1) (n × 1) (n × n)

Linear Model, Linear Fixed Point Equivalence Theorem: For features , the linear model’s exact value function and the linear fixed point solution are identical. Proof sketch: Note: Preliminary observations along these lines by Boyan [99] Approximate model appears in linear fixed point definition! Definition of linear fixed point solution

Linear Model, Linear Fixed Point Equivalence (s,a,s’), (s,a,s’), … Training Data Given: Linearly independent features  =(  1 …  k ) Linear Model P  – (k × k) R  – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V=  w Solve for exact value Function given P , R  Solve for linear fixed Point using linear TD, LSTD, etc.

Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results

Model Error Linearly independent features  =(  1 …  k ) Error in reward: Error in predicted next features: (per feature error) Expected next feature values Predicted next feature values (n × k) (n × 1)

Bellman Error Theorem: Bellman error of linear fixed point solution Reward error Per feature error Punch line: Bellman error decomposes into a function of model errors!

Outline Terminology/Notation Review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results

Insights into Feature Selection I Features should model the reward. The reward itself is a useful feature!

Insights into Feature Selection II Features “predict themselves” BE=  R, no dependence on  Value function approximation, feature selection reduce to regression problems on R  w  =(I-  P) -1  R  When  = 0: Approximate reward True P!

Achieving Zero Feature Error (  = 0) When are features sufficient for 0 error in expected next feature values? – Rank(  )=|S| –  is composed of eigenvectors of P –  span an invariant subspace of P Invariant subspace:

Insight into Adding Features Methods for adding features – Add the Bellman error as a feature (BEBF) [Wu & Givan, 2004; Sanner & Boutilier, 2005; Keller et al., 2006), Parr et al. (2007)] – Add the model errors ( ,  R) as features (MEBF) – Add P k R for increasing k (Krylov basis) [Petrik (2007)] Theorem: BEBF, MEBF, and the Krylov basis are equivalent when initialized with  ={} Note: Special thanks to Marek Petrik for demonstrating Krylov=BEBF

Insight into Proto Value Functions Proto value functions (PVFs) compute eigenvectors of a modified adjacency graph (Laplacian) [Mahadevan & Maggioni] Adjacency graph = approximate P PVFs ~ eigenvectors of P ∴ PVFs = approximation to subspace invariant features (Empirically, closeness of this approximation varies) Note: Similar observations made by Petrik, who considered a version of PVFs that used P instead of Laplacian.

Outline Terminology/notation review Linear model, linear fixed point equivalence Bellman error as function of model error Feature selection insights Experimental results

Experimental Results Four Algorithms – PVFs (In order of “smoothness”) – PVF-MP (Matching pursuits w/PVF dictionary) – eig-MP (Matching pursuits w/eigenvectors of P) – BEBF (AKA MEBF, Krylov basis) Measured (in L 2 ) as a function of number of basis functions added: – Total Bellman error – Reward error  R – Total feature error  w  Three problems – Chain [Lagoudakis & Parr] (talk, paper, poster) – Two Room [Mahadevan & Maggioni] (paper, poster) – Blackjack [Sutton & Barto] (paper, poster)

Chain Results PVF PVF-MP Eig-MP BEBF Number of basis functions Total Bellman ErrorReward ErrorFeature Error 50 state chain from Lagoudakis & Parr Ask about blackjack or the two-room domain – or come to our poster!

Conclusions From Experiments eig-mp will always have  =0 PVFs sometimes approximate subspace invariance (potentially useful because of stability issues w/eig-mp) PVF-MP dominates PVF because PVF ignores R BEBF will always have  R=0 BEBF has a more steady/predictable reduction in BE Don’t ignore R!

Ground Covered Features:  Training Data: (s,r,s’),(s,r,s’), (s,r,s’) Linear Model P  – (k × k) R  – (k × 1) Project dynamics into feature space (minimizing L 2 error in predicted next features) Linear Value Function V=  w Solve for exact value Function given P , R  Solve for linear fixed Point using linear TD, LSTD, etc. Bellman error of linear fixed point solution Reward error Per feature error Insight into feature selection! Number of basis functions Number of basis functions Number of basis functions Total Bellman Error Reward Error Feature Error

Thank you! Also, special thanks to Jeff Johns, Sridhar Mahadevan, and Marek Petrik