Policy Gradient in Continuous Time

Slides:



Advertisements
Similar presentations
Completeness and Expressiveness
Advertisements

TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Tracking Unknown Dynamics - Combined State and Parameter Estimation Tracking Unknown Dynamics - Combined State and Parameter Estimation Presenters: Hongwei.
EM Algorithm Jur van den Berg.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Intelligent Packet Dropping for Optimal Energy-Delay Tradeoffs for Wireless Michael J. Neely University of Southern California
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Visual Recognition Tutorial
Reinforcement Learning
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
RL for Large State Spaces: Policy Gradient
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Simple Linear Regression. The term linear regression implies that  Y|x is linearly related to x by the population regression equation  Y|x =  +  x.
UAIG: Second Fall 2013 Meeting. Agenda  Introductory Icebreaker  How to get Involved with UAIG?  Discussion: Reinforcement Learning  Free Discussion.
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
1 Spring 2003 Prof. Tim Warburton MA557/MA578/CS557 Lecture 23.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Lecture Note 2 – Calculus and Probability Shuaiqiang Wang Department of CS & IS University of Jyväskylä
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Analytics and OR DP- summary.
Reinforcement Learning in POMDPs Without Resets
Chapter 5: Monte Carlo Methods
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
Computability and Complexity
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Planning to Maximize Reward: Markov Decision Processes
Hidden Markov Models Part 2: Algorithms
Outline Single neuron case: Nonlinear error correcting learning
Reinforcement Learning with Partially Known World Dynamics
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 2: Evaluative Feedback
Reinforcement Learning
September 22, 2011 Dr. Itamar Arel College of Engineering
October 6, 2011 Dr. Itamar Arel College of Engineering
Markov Decision Problems
Introduction to Reinforcement Learning and Q-Learning
Deep Reinforcement Learning
Selfish Load Balancing
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CHAPTER 2: Basic Summary Statistics
THE PYTHAGOREAN THEOREM
Reinforcement Learning Dealing with Partial Observability
THE PYTHAGOREAN THEOREM
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Policy Gradient in Continuous Time by Remi Munos, JMLR 2006 Presented by Hui Li Duke University Machine Learning Group May 30, 2007

Outline Introduction Discretized Stochastic Processes Approximation Model-free Reinforcement Learning (RL) algorithm Example Results

Introduction of the Problem Consider an optimal control problem with continuous state Control State System dynamics: Deterministic process Continuous state Objective: Find an optimal control (ut) that maximize the functional Objective function:

Introduction of the Problem Consider a class of parameterized policies  with Find parameter  that maximize the performance measure Standard approach is to use gradient ascent method object of the paper

Introduction of the Problem How to compute Finite-difference method This method requires a large number of trajectories to compute the gradient of performance measure. Pathwise estimation of the gradient Compute the gradient using one trajectory only

Introduction of the Problem Pathwise estimation of the gradient Define Dynamics of zt: Gradient unknown known In the reinforcement learning, is unknown. How to approximate zt?

Discretized Stochastic Processes Approximation A General Convergence Result If

Discretization of the state Stochastic policy Stochastic discrete state process Initialization: Jump in state

Proof of proposition 5: From Taylor’s formula The average jump: Directly apply the Theorem 3, proposition 5 is proved.

Discretization of the state gradient Stochastic discrete state gradient process Initialization: With

Proof of proposition 6: Since then Directly apply the Theorem 3, proposition 6 is proved.

Model-free Reinforcement Learning Algorithm Let In this stochastic approximation, is observed, and is given, we only need to approximate

Least-Square Approximation of Define The set of past discrete times t-cs t when action ut have been taken. From Taylor’s formula, for all discrete time s, We deduce

Where We may derive an approximation of by solving the least-square problem: Then we have Here denote the average value of

Algorithm

Experimental Results Six continuous state: x0, y0: hand position x, y: mass position vx, vy: mass velocity Four control action:U ={(1,0), (0,1), (-1,0),(0,-1)} Goal: reach a target (xG, yG) with the mass at specific time T Terminal reward function

The system dynamics: Consider a Boltzmann-like stochastic policy where

Conclusion Described a reinforcement learning method for approximating the gradient of a continuous-time deterministic problem with respect to the control parameters Used a stochastic policy to approximate the continuous system by a consistent stochastic discrete process