Policy Gradient in Continuous Time by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007
Outline Introduction Discretized Stochastic Processes Approximation Model-free Reinforcement Learning (RL) algorithm Example Results
Introduction of the Problem Consider an optimal control problem with continuous state Control State System dynamics: Deterministic process Continuous state Objective: Find an optimal control (ut) that maximize the functional Objective function:
Introduction of the Problem Consider a class of parameterized policies with Find parameter that maximize the performance measure Standard approach is to use gradient ascent method object of the paper
Introduction of the Problem How to compute Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. Pathwise estimation of the gradient Compute the gradient using one trajectory only
Introduction of the Problem Pathwise estimation of the gradient Define Dynamics of zt: Gradient unknown known In the reinforcement learning, is unknown. How to approximate zt?
Discretized Stochastic Processes Approximation A General Convergence Result If
Discretization of the state Stochastic policy Stochastic discrete state process Initialization: Jump in state
Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.
Discretization of the state gradient Stochastic discrete state gradient process Initialization: With
Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.
Model-free Reinforcement Learning Algorithm Let In this stochastic approximation, is observed, and is given, we only need to approximate
Least-Square Approximation of Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce
Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of
Algorithm
Experimental Results Six continuous state: x0, y0: hand position x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function
The system dynamics: Consider a Boltzmann-like stochastic policy where
Conclusion Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process