PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine.

Slides:



Advertisements
Similar presentations
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Advertisements

Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Decision Theoretic Planning
1. Algorithms for Inverse Reinforcement Learning 2
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
Infinite Horizon Problems
Planning under Uncertainty
Apprenticeship learning for robotic control Pieter Abbeel Stanford University Joint work with Andrew Y. Ng, Adam Coates, Morgan Quigley.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Reinforcement Learning Tutorial
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Markov Decision Processes
Apprenticeship Learning by Inverse Reinforcement Learning Pieter Abbeel Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Reinforcement Learning
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Our acceleration prediction model Predict accelerations: f : learned from data. Obtain velocity, angular rates, position and orientation from numerical.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
Monte Carlo Methods in Partial Differential Equations.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
RL for Large State Spaces: Policy Gradient
 1  Outline  stages and topics in simulation  generation of random variates.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
1 Markov Decision Processes Basics Concepts Alan Fern.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Reinforcement learning (Chapter 21)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Chapter 6 Neural Network.
Discovering Optimal Training Policies: A New Experimental Paradigm Robert V. Lindsey, Michael C. Mozer Institute of Cognitive Science Department of Computer.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement learning (Chapter 21)
Apprenticeship Learning Using Linear Programming
Biomedical Data & Markov Decision Process
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Apprenticeship Learning via Inverse Reinforcement Learning
Greedy Importance Sampling
CS 188: Artificial Intelligence Fall 2008
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine

Motivation For large, complicated domains, estimation of value functions/Q-functions can take a long time. However, there often exist far simpler policies than the optimal that perform nearly as well. –Can directly search through a policy space

Preliminaries MDP – M = (S, D, A, {P sa (.)}, γ, R) –S – set of states –D – initial state distribution –A – set of action –P sa (.) : S -> [0,1] – transition probabilities –γ – discount factor –R – deterministic rewards (function of state)

Policies Policy п : S -> A Value Function V п : S -> Reals V п (s) = R(s) + γ E s’~P(s,п(s)) [V п (s’)] For convenience, also define: V(п) = E s0~D [V п (s 0 )]

Application Domain Helicopter Flight (Hovering in Place) –12-d continuous state space ([0,1] 12 ) (x,y,z,pitch,roll,yaw,x’,y’,z’,pitch’,roll’,yaw’) –4-d continuous action space ([0,1] 4 ) ( front/back cyclic pitch control,left/right cyclic pitch control main rotor pitch control,tail rotor pitch control) –Timesteps correspond to 1/50 th of a second –γ =.9995 –R(s) = -(a(x-x*) 2 +b(y-y*) 2 +c(z-z*) 2 +(yaw-yaw*) 2 )

Helicopter

Transformation of MDP’s Given M = (S, D, A, {P sa (.)}, γ, R) we construct M’ = (S’, D’, A, {P’ sa (.)}, γ, R’), an MDP with deterministic state transitions Intuition: Instead of rolling the dice when we move from state to state, we will roll all the dice we need ahead of time, and store their results as part of our state.

Parcheesi ……

Deterministic Simulative Model Assume we have a deterministic functional representation of our MDP Transitions –g : S x A x [0,1] dp –> S such that if p is distributed uniformly in [0,1] dp then Pr p [g(s, a, p) = s’] = P sa (s’). –More powerful than a generative model.

Transformations of MDP’s –S’ = S x [0,1]  –D’ – (s, p 1, p 2, p 3, …) such that s ~ D, and the p i ’s are drawn iid from Uniform[0,1] –P’ ta (t’) ={1 if g(s, a, p 1 )=s’,0 otherwise}(d P = 1) –R’(t) = R(s) t = (s, p 1, p 2, p 3, …) t’ =(s’, p 2, p 3, …)

Policies Given a policy space П for S, consider a corresponding policy space П’ for S’, s.t. –  п in П,  п’ in П’,  s in S,  p 1, p 2,… п’((s, p 1, p 2, p 3, …)) = п(s) As the transition probabilities and rewards are equivalent in the transformed MDP: V M п (s) = E p~Unif[0,1]^  [V M’ п’ (s,p)] V M (п) = V M’ (п’)

Policy Search V M п (s 0 ) = R(s 0 ) + γ E s’~P(s0,п(s0)) [V п (s’)] V M’ п’ ((s 0,p 1,p 2,…)) = R(s 0 )+γR(s 1 )+γ 2 R(s 2 )+… –s 1 = g(s 0, п’(s 0 ), p 1 ), s 2 = g(s 1, п’(s 1 ), p 2 ) As V M (п) = V M’ (п’), we can estimate V M (п) = E t0~D’ [V M’ п’ (t 0 )]

PEGASUS Policy Evaluation-of-Goodness and Search Using Scenarios Draw a sample of m initial states (scenarios) {s 0 (1), s 0 (2), s 0 (3), …, s 0 (m) } iid from D’ Estimate

PEGASUS Given {s 0 (1), s 0 (2), s 0 (3), …, s 0 (m) }, is a deterministic function The sum is infinite, but can truncate the sum after H ε = log γ (ε(1-γ)/2R max ), introducing at most ε/2 error. Also, this allows us to store our “dice rolls” in finite space.

PEGASUS Given the deterministic function V M’ (п), we can use an optimization technique to find argmax п V M’ (п). –If working in a continuous, smooth, differentiable domain, we can use gradient ascent –If R is discontinuous, may need to use “continuation” methods to smooth it out

Results On 5x5 Gridworld POMDP, discovers near optimal policy in very few scenarios (~5) On continuous space/action bicycle riding problem, results near optimal and far better than earlier reward shaping methods.

Helicopter Hovering Policy represented by a hand-crafted neural network. PEGASUS used to search through set of possible ANN weights. –Tried both gradient ascent and random walk searches

Neural Network Structure (x,y,z) = (forward, sideways, down) a 1 = front/back cyclic pitch control, a 2 = left/right cyclic pitch control a 3 = main rotor pitch control a 4 = tail rotor pitch control

Results Able to keep helicopter stable on its maiden flight. HoveringHovering Neural network modified to fly competition class maneuvers TriangleTriangle Finally, hovering upside down accomplished

Pseudo-Dimension H set of functions X -> Reals H shatters x 1, x 2, …, x d ε X if there exists a sequence of real numbers t 1, t 2, …, t d s.t. {(h(x 1 ) – t 1, h(x 2 ) – t 2, …, h(x d ) – t d ) | h ε H} intersects all 2 d orthants of R d The pseudo-dimension of H (dim p (H)) is the size of the largest set shattered by H

Lipschitz Continuity A function f is Lipschitz continuous with Lipschitz bound B if ||f(x) – f(y)|| <= B||x – y|| (with respect to Euclidean norm on range and domain)

Realizable Dynamics in an MDP Let S = [0, 1] ds, g: S x A x [0, 1] dp -> S be given. We can define F i as a set of functions {F i a : S x [0, 1] dp -> [0, 1], F i a (s, p 1,…,p dp ) = I i (g(s, a, p 1,…,p dp ))|  a in A} I i (x) returns the ith coordinate of x

PEGASUS Theoretical Result Let S = [0, 1] ds, policy class П, and model g: S x A x [0, 1] dp -> S be given. F is the family of realizable dynamics in the MDP and F i the resulting family of coordinate functions. For all i, let dim P (F i ) <= d, and let F i be uniformly Lipschitz continuous with bound B Reward Function R is Lipschitz continuous with bound B R. Then if: with probability at least 1 – δ, the PEGASUS estimate V’(п) will be uniformly close to the the actual value |V’(п) – V (п)| <= ε

Proof (1) Think of the reward at step i as a random variable V п (s 0 (1) ) = R(s o (1) ) + γ R(s 1 (1) ) + γ 2 R(s 2 (1) ) +… V п (s 0 (2) ) = R(s o (2) ) + γ R(s 1 (2) ) + γ 2 R(s 2 (2) ) +… V п (s 0 (3) ) = R(s o (3) ) + γ R(s 1 (3) ) + γ 2 R(s 2 (3) ) +… By bounding properties of each R(s i (j) ), we can prove uniform convergence for V(п)

Proof (2) Calling on work by Haussler, we show that if the psuedo-dimension of each F i, dim P (F i ) <= d, we can “nearly” represent our world dynamics functions F i a by a smaller set of functions of size

Proof (3) Similarly if F i uniformly has Lipschitz bound B, and the Reward function R has Lipschitz bound B R, we can “nearly” represent a function mapping from scenarios to ith step rewards by a set of size

Proof (4) A result by Haussler then shows that with probability 1 – δ, our ith step reward will be ε-close to the mean if we select a number of scenarios bounded by

Proof (5) Strengthening the bound to account for all H ε rewards and employing the Union bound, we find that a number of scenarios bounded by is sufficient.

Critique Success limited to very small fairly linear control problem, with high frequency controller Lots of human bias incorporated into system –Restrictions/Linear Regression for model identification –Structure of neural net for each of the tasks PAC learning guarantees still out of reach No theoretical bounds on final policy

Bibliography 1.Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, Information and Computation, Vol. 100, September, 1992, pp Ng, A. Y., Jordan, M. I. PEGASUS: A policy search method for large MDP’s and POMDP’s. In Uncertainty in Artificial Intelligence, Sixth Conference, Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. Autonomous helicopter flight via reinforcement learning. Advances in Neural Information Processing Systems Ng, A. Y.,Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. Inverted autonomous helicopter flight via reinforcement learning, In International Symposium on Experimental Robotics, 2004.

Application – Helicoptor Flight PEGASUS has been used to derive policies for hovering in place. Later generalized to handle slow motion maneuvers and upside down hovering. GPS system relays state information (position and velocity) to an off board computer which calculates a 4-dimensional action

Model Identification Construction of an MDP representation of the world dynamics Transition Dynamics learned from several minutes of data based on human flight –Fit using linear regression –Forced to respect innate properties of the domain (gravity, symmetry)