Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.

Similar presentations


Presentation on theme: "Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State."— Presentation transcript:

1 Advanced MDP Topics Ron Parr Duke University

2 Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State spaces are too big Many problems have continuous variables “Factored” (symbolic) representations don’t always save us How –Can tie in to vast body of Machine learning methods Pattern matching (neural networks) Approximation methods

3 Implementing VFA Can’t represent V as a big vector Use (parametric) function approximator –Neural network –Linear regression (least squares) –Nearest neighbor (with interpolation) (Typically) sample a subset of the the states Use function approximation to “generalize”

4 Basic Value Function Approximation Idea: Consider restricted class of value functions V0V0 FA V*? VI Alternate value iteration with supervised learning Subset of states resample?

5 VFA Outline 1. Initialize V 0 (s,w 0 ), n=1 2. Select some s 0 …s i 3. For each s j 4. Compute V n (s,w n ) by training w on 5. n := n+1 6. Unless V n+1 -V n <  goto 2 If supervised learning error is “small”, then V final “close” to V*.

6 Stability Problem Problem: Most VFA methods are unstable s2 s1 No rewards,  = 0.9: V* = 0 Example: Bertsekas & Tsitsiklis 1996

7 Least Squares Approximation Restrict V to linear functions: Find  s.t. V(s 1 ) = , V(s 2 ) = 2  Counterintuitive Result: If we do a least squares fit of   t+1 = 1.08  t s1s1 s2s2 S V(x)

8 Unbounded Growth of V 1 2 n S V(x)

9 What Went Wrong? VI reduces error in maximum norm Least squares (= projection) non-expansive in L 2 May increase maximum norm distance Grows max norm error at faster rate than VI shrinks it And we didn’t even use sampling! Bad news for neural networks… Success depends on –sampling distribution –pairing approximator and problem

10 Success Stories - Linear TD [Tsitsiklis & Van Roy 96, Bratdke & Barto 96] Start with a set of basis functions Restrict V to linear space spanned by bases Sample states from current policy Space of true value functions Restricted Linear Space N.B. linear is still expressive due to basis functions  = Projection VI

11 Linear TD Formal Properties Use to evaluate policies only Converges w.p. 1 Error measured w.r.t. stationary distribution Frequently visited states have low error Infrequent states can have high error

12 Linear TD Methods Applications –Inventory control: Van Roy et al. –Packet routing: Marbach et al. –Used by Morgan Stanley to value options –Natural idea: use for policy iteration No guarantees –Can produce bad policies for trivial problems [Koller & Parr 99] –Modified for better PI: LSPI [Lagoudakis & Parr 01] Can be done symbolically [Koller & Parr 00] Issues –Selection of basis functions –Mixing rate of process - affects k, speed

13 Success Story: Averagers [Gordon 95, and others…] Pick set, Y=y 1 …y i of representative states Perform VI on Y For x not in Y, Averagers are non expansions in max norm Converge to within 1/(1-  ) factor of “best”

14 Interpretation of Averagers x y1 y2 y3 11 22 33

15 Interpretation of Averagers II Averagers Interpolate: x y2y2 y3y3 y4y4 y1y1 Grid vertices = Y

16 General VFA Issues What’s the best we can hope for? –We’d like to get approximate close to –How does this relate to In practice: –We are quite happy if we can prove stability –Obtaining good results often involves an iterative process of tweaking the approximator, measuring empirical performance, and repeating…

17 Why I’m Still Excited About VFA Symbolic methods often fail –Stochasticity increases branching factor –Many trivial problems have no exploitable structure “Bad” value functions can have good performance We can bound “badness” of value functions –By simulation –Symbolically in some cases [Koller & Parr 00; Guestrin, Koller & Parr 01; Dean & Kim 01] Basis function selection can be systematized

18 Hierarchical Abstraction Reduce problem in to simpler subproblems Chain primitive actions into macro-actions Lots of results that mirror classical results –Improvements dependent on user-provided decompositions –Macro-actions great if you start with good macros See Dean & Lin, Parr & Russell, Precup, Sutton & Singh, Schmidhuber & Weiring, Hauskrecht et al., etc.


Download ppt "Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State."

Similar presentations


Ads by Google