Download presentation
Presentation is loading. Please wait.
1
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado at Boulder grudic@cs.colorado.edu and Lyle Ungar University of Pennsylvania ungar@cis.upenn.edu
2
2 Reinforcement Learning (MDP) Policy Reinforcement feedback (environment) Goal: modify policy to maximize reward State-action value function
3
3 Policy parameterized by – Searching space implies searching policy space Performance function implicitly depends on – Policy Gradient Formulation
4
4 RL Policy Gradient Learning Whereis the performance gradient Update equation for parameters small positive step size
5
5 Computation linear in the number of parameters –avoids blow-up from discretization Generalization in state space is implicitly defined by the parametric representation Why Policy Gradient RL?
6
6 Estimating the Performance Gradient REINFORCE (Williams 1992): gives an unbiased estimate of –HOWEVER: slow convergence has high variance GOAL: Find PG algorithms with low variance estimates of
7
7 Performance Gradient Formulation Where: (arbitrary) [Sutton, McAllester, Singh, Mansour, 2000] and [Konda and Tsitsiklis, 2000]
8
8 Two Open Questions for Improving Convergence of PG Estimates How should observations of be used to reduce the variance in estimates of the performance gradient? Does there exist that reduces the variance in estimating the performance gradient?
9
9 Assumptions Where: Therefore, after N observations: Independently distributed (MDP)
10
10 PG Model 1: Direct Q estimates For N observations Where:
11
11 PG Model 2: PIFA chosen using N observations of Policy Iteration with Function Approximation [Sutton, McAllester, Singh, Mansour, 2000] Where:
12
12 PG Model 3: Non-Zero Bias For N observations Where: Average of Q estimate in s
13
13 Theoretical Results
14
14 Where:
15
15 Experimental Result 1: Convergence of Three Algorithms
16
16 Experimental Result 2:
17
17 Experimental Result 3:
18
18 Conclusion Implementation of PG algorithms significantly affects convergence Linear basis function representations of Q can substantially degrade convergence Appropriately chosen bias terms can improve convergence
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.