Policy Gradient in Continuous Time

Policy Gradient in Continuous Time
by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007

Outline Introduction Discretized Stochastic Processes Approximation
Model-free Reinforcement Learning (RL) algorithm Example Results

Introduction of the Problem
Consider an optimal control problem with continuous state Control State System dynamics: Deterministic process Continuous state Objective: Find an optimal control (ut) that maximize the functional Objective function:

Consider a class of parameterized policies  with Find parameter  that maximize the performance measure Standard approach is to use gradient ascent method object of the paper

How to compute Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. Pathwise estimation of the gradient Compute the gradient using one trajectory only

Pathwise estimation of the gradient Define Dynamics of zt: Gradient unknown known In the reinforcement learning, is unknown. How to approximate zt?

Discretized Stochastic Processes Approximation
A General Convergence Result If

Discretization of the state
Stochastic policy Stochastic discrete state process Initialization: Jump in state

Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.

Discretization of the state gradient
Stochastic discrete state gradient process Initialization: With

Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.

Model-free Reinforcement Learning Algorithm
Let In this stochastic approximation, is observed, and is given, we only need to approximate

Least-Square Approximation of
Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce

Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of

Algorithm

Experimental Results Six continuous state: x0, y0: hand position
x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function

The system dynamics: Consider a Boltzmann-like stochastic policy where

Conclusion Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process

Policy Gradient in Continuous Time

Similar presentations

Presentation on theme: "Policy Gradient in Continuous Time"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Policy Gradient in Continuous Time

Similar presentations

Presentation on theme: "Policy Gradient in Continuous Time"— Presentation transcript:

Similar presentations

About project

Feedback