Download presentation
Presentation is loading. Please wait.
1
Policy Gradient in Continuous Time
by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007
2
Outline Introduction Discretized Stochastic Processes Approximation
Model-free Reinforcement Learning (RL) algorithm Example Results
3
Introduction of the Problem
Consider an optimal control problem with continuous state Control State System dynamics: Deterministic process Continuous state Objective: Find an optimal control (ut) that maximize the functional Objective function:
4
Introduction of the Problem
Consider a class of parameterized policies with Find parameter that maximize the performance measure Standard approach is to use gradient ascent method object of the paper
5
Introduction of the Problem
How to compute Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. Pathwise estimation of the gradient Compute the gradient using one trajectory only
6
Introduction of the Problem
Pathwise estimation of the gradient Define Dynamics of zt: Gradient unknown known In the reinforcement learning, is unknown. How to approximate zt?
7
Discretized Stochastic Processes Approximation
A General Convergence Result If
8
Discretization of the state
Stochastic policy Stochastic discrete state process Initialization: Jump in state
9
Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.
10
Discretization of the state gradient
Stochastic discrete state gradient process Initialization: With
11
Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.
12
Model-free Reinforcement Learning Algorithm
Let In this stochastic approximation, is observed, and is given, we only need to approximate
13
Least-Square Approximation of
Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce
14
Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of
15
Algorithm
16
Experimental Results Six continuous state: x0, y0: hand position
x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function
17
The system dynamics: Consider a Boltzmann-like stochastic policy where
19
Conclusion Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.