Presentation is loading. Please wait.

Presentation is loading. Please wait.

Policy Gradient in Continuous Time

Similar presentations


Presentation on theme: "Policy Gradient in Continuous Time"— Presentation transcript:

1 Policy Gradient in Continuous Time
by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007

2 Outline Introduction Discretized Stochastic Processes Approximation
Model-free Reinforcement Learning (RL) algorithm Example Results

3 Introduction of the Problem
Consider an optimal control problem with continuous state Control State System dynamics: Deterministic process Continuous state Objective: Find an optimal control (ut) that maximize the functional Objective function:

4 Introduction of the Problem
Consider a class of parameterized policies  with Find parameter  that maximize the performance measure Standard approach is to use gradient ascent method object of the paper

5 Introduction of the Problem
How to compute Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. Pathwise estimation of the gradient Compute the gradient using one trajectory only

6 Introduction of the Problem
Pathwise estimation of the gradient Define Dynamics of zt: Gradient unknown known In the reinforcement learning, is unknown. How to approximate zt?

7 Discretized Stochastic Processes Approximation
A General Convergence Result If

8 Discretization of the state
Stochastic policy Stochastic discrete state process Initialization: Jump in state

9 Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.

10 Discretization of the state gradient
Stochastic discrete state gradient process Initialization: With

11 Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.

12 Model-free Reinforcement Learning Algorithm
Let In this stochastic approximation, is observed, and is given, we only need to approximate

13 Least-Square Approximation of
Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce

14 Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of

15 Algorithm

16 Experimental Results Six continuous state: x0, y0: hand position
x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function

17 The system dynamics: Consider a Boltzmann-like stochastic policy where

18

19 Conclusion Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process


Download ppt "Policy Gradient in Continuous Time"

Similar presentations


Ads by Google