1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado.

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

The Right Way to do Reinforcement Learning with Function Approximation Rich Sutton AT&T Labs with thanks to Satinder Singh, David McAllester, Mike Kearns.
Evaluating Classifiers
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin.
Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.
Planning under Uncertainty
Visual Recognition Tutorial
Kuang-Hao Liu et al Presented by Xin Che 11/18/09.
Reinforcement learning
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Reinforcement Learning Tutorial
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.
XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.
An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
Rutgers CS440, Fall 2003 Reinforcement Learning Reading: Ch. 21, AIMA 2 nd Ed.
Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.
Parametric Inference.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Statistical analysis and modeling of neural data Lecture 4 Bijan Pesaran 17 Sept, 2007.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Dorin Comaniciu Visvanathan Ramesh (Imaging & Visualization Dept., Siemens Corp. Res. Inc.) Peter Meer (Rutgers University) Real-Time Tracking of Non-Rigid.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
RL for Large State Spaces: Policy Gradient
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
CHAPTER 10 Reinforcement Learning Utility Theory.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
© Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Retraction: I’m actually 35 years old. Q-Learning.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
M. Lopes (ISR) Francisco Melo (INESC-ID) L. Montesano (ISR)
Online Multiscale Dynamic Topic Models
Reinforcement Learning in POMDPs Without Resets
Dynamical Statistical Shape Priors for Level Set Based Tracking
Autonomous Cyber-Physical Systems: Reinforcement Learning for Planning
Policy Gradient in Continuous Time
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Reinforcement Learning
Chapter 2: Evaluative Feedback
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Chapter 2: Evaluative Feedback
Presentation transcript:

1 Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning Greg Grudic University of Colorado at Boulder and Lyle Ungar University of Pennsylvania

2 Reinforcement Learning (MDP) Policy Reinforcement feedback (environment) Goal: modify policy to maximize reward State-action value function

3 Policy parameterized by – Searching space implies searching policy space Performance function implicitly depends on – Policy Gradient Formulation

4 RL Policy Gradient Learning Whereis the performance gradient Update equation for parameters small positive step size

5 Computation linear in the number of parameters –avoids blow-up from discretization Generalization in state space is implicitly defined by the parametric representation Why Policy Gradient RL?

6 Estimating the Performance Gradient REINFORCE (Williams 1992): gives an unbiased estimate of –HOWEVER: slow convergence has high variance GOAL: Find PG algorithms with low variance estimates of

7 Performance Gradient Formulation Where: (arbitrary) [Sutton, McAllester, Singh, Mansour, 2000] and [Konda and Tsitsiklis, 2000]

8 Two Open Questions for Improving Convergence of PG Estimates How should observations of be used to reduce the variance in estimates of the performance gradient? Does there exist that reduces the variance in estimating the performance gradient?

9 Assumptions Where: Therefore, after N observations: Independently distributed (MDP)

10 PG Model 1: Direct Q estimates For N observations Where:

11 PG Model 2: PIFA chosen using N observations of Policy Iteration with Function Approximation [Sutton, McAllester, Singh, Mansour, 2000] Where:

12 PG Model 3: Non-Zero Bias For N observations Where: Average of Q estimate in s

13 Theoretical Results

14 Where:

15 Experimental Result 1: Convergence of Three Algorithms

16 Experimental Result 2:

17 Experimental Result 3:

18 Conclusion Implementation of PG algorithms significantly affects convergence Linear basis function representations of Q can substantially degrade convergence Appropriately chosen bias terms can improve convergence