Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.

Slides:



Advertisements
Similar presentations
Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Advertisements

Batch RL Via Least Squares Policy Iteration
Partially Observable Markov Decision Process (POMDP)
Pattern Recognition and Machine Learning: Kernel Methods.
Decision Theoretic Planning
An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection for Reinforcement Learning Ronald Parr, Lihong Li, Gavin Taylor,
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
Infinite Horizon Problems
Generalizing Plans to New Environments in Relational MDPs
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Distributed Planning in Hierarchical Factored MDPs Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Discretization Pieter Abbeel UC Berkeley EECS
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Lecture 10: Robust fitting CS4670: Computer Vision Noah Snavely.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.
Radial Basis Function (RBF) Networks
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Non Negative Matrix Factorization
Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Reinforcement Learning
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
Semi-Supervised Clustering
Markov Decision Process (MDP)
Mastering the game of Go with deep neural network and tree search
ECE 5424: Introduction to Machine Learning
Goodfellow: Chap 6 Deep Feedforward Networks
CIS 700: “algorithms for Big Data”
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Introduction to Predictive Modeling
CS 188: Artificial Intelligence Fall 2008
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Chapter 8: Generalization and Function Approximation
CS 188: Artificial Intelligence Spring 2006
The loss function, the normal equation,
CS 188: Artificial Intelligence Fall 2008
Mathematical Foundations of BME Reza Shadmehr
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Advanced MDP Topics Ron Parr Duke University

Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State spaces are too big Many problems have continuous variables “Factored” (symbolic) representations don’t always save us How –Can tie in to vast body of Machine learning methods Pattern matching (neural networks) Approximation methods

Implementing VFA Can’t represent V as a big vector Use (parametric) function approximator –Neural network –Linear regression (least squares) –Nearest neighbor (with interpolation) (Typically) sample a subset of the the states Use function approximation to “generalize”

Basic Value Function Approximation Idea: Consider restricted class of value functions V0V0 FA V*? VI Alternate value iteration with supervised learning Subset of states resample?

VFA Outline 1. Initialize V 0 (s,w 0 ), n=1 2. Select some s 0 …s i 3. For each s j 4. Compute V n (s,w n ) by training w on 5. n := n+1 6. Unless V n+1 -V n <  goto 2 If supervised learning error is “small”, then V final “close” to V*.

Stability Problem Problem: Most VFA methods are unstable s2 s1 No rewards,  = 0.9: V* = 0 Example: Bertsekas & Tsitsiklis 1996

Least Squares Approximation Restrict V to linear functions: Find  s.t. V(s 1 ) = , V(s 2 ) = 2  Counterintuitive Result: If we do a least squares fit of   t+1 = 1.08  t s1s1 s2s2 S V(x)

Unbounded Growth of V 1 2 n S V(x)

What Went Wrong? VI reduces error in maximum norm Least squares (= projection) non-expansive in L 2 May increase maximum norm distance Grows max norm error at faster rate than VI shrinks it And we didn’t even use sampling! Bad news for neural networks… Success depends on –sampling distribution –pairing approximator and problem

Success Stories - Linear TD [Tsitsiklis & Van Roy 96, Bratdke & Barto 96] Start with a set of basis functions Restrict V to linear space spanned by bases Sample states from current policy Space of true value functions Restricted Linear Space N.B. linear is still expressive due to basis functions  = Projection VI

Linear TD Formal Properties Use to evaluate policies only Converges w.p. 1 Error measured w.r.t. stationary distribution Frequently visited states have low error Infrequent states can have high error

Linear TD Methods Applications –Inventory control: Van Roy et al. –Packet routing: Marbach et al. –Used by Morgan Stanley to value options –Natural idea: use for policy iteration No guarantees –Can produce bad policies for trivial problems [Koller & Parr 99] –Modified for better PI: LSPI [Lagoudakis & Parr 01] Can be done symbolically [Koller & Parr 00] Issues –Selection of basis functions –Mixing rate of process - affects k, speed

Success Story: Averagers [Gordon 95, and others…] Pick set, Y=y 1 …y i of representative states Perform VI on Y For x not in Y, Averagers are non expansions in max norm Converge to within 1/(1-  ) factor of “best”

Interpretation of Averagers x y1 y2 y3 11 22 33

Interpretation of Averagers II Averagers Interpolate: x y2y2 y3y3 y4y4 y1y1 Grid vertices = Y

General VFA Issues What’s the best we can hope for? –We’d like to get approximate close to –How does this relate to In practice: –We are quite happy if we can prove stability –Obtaining good results often involves an iterative process of tweaking the approximator, measuring empirical performance, and repeating…

Why I’m Still Excited About VFA Symbolic methods often fail –Stochasticity increases branching factor –Many trivial problems have no exploitable structure “Bad” value functions can have good performance We can bound “badness” of value functions –By simulation –Symbolically in some cases [Koller & Parr 00; Guestrin, Koller & Parr 01; Dean & Kim 01] Basis function selection can be systematized

Hierarchical Abstraction Reduce problem in to simpler subproblems Chain primitive actions into macro-actions Lots of results that mirror classical results –Improvements dependent on user-provided decompositions –Macro-actions great if you start with good macros See Dean & Lin, Parr & Russell, Precup, Sutton & Singh, Schmidhuber & Weiring, Hauskrecht et al., etc.