PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine.

PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine

Motivation For large, complicated domains, estimation of value functions/Q-functions can take a long time. However, there often exist far simpler policies than the optimal that perform nearly as well. –Can directly search through a policy space

Preliminaries MDP – M = (S, D, A, {P sa (.)}, γ, R) –S – set of states –D – initial state distribution –A – set of action –P sa (.) : S -> [0,1] – transition probabilities –γ – discount factor –R – deterministic rewards (function of state)

Policies Policy п : S -> A Value Function V п : S -> Reals V п (s) = R(s) + γ E s’~P(s,п(s)) [V п (s’)] For convenience, also define: V(п) = E s0~D [V п (s 0 )]

Application Domain Helicopter Flight (Hovering in Place) –12-d continuous state space ([0,1] 12 ) (x,y,z,pitch,roll,yaw,x’,y’,z’,pitch’,roll’,yaw’) –4-d continuous action space ([0,1] 4 ) ( front/back cyclic pitch control,left/right cyclic pitch control main rotor pitch control,tail rotor pitch control) –Timesteps correspond to 1/50 th of a second –γ =.9995 –R(s) = -(a(x-x*) 2 +b(y-y*) 2 +c(z-z*) 2 +(yaw-yaw*) 2 )

Helicopter

Transformation of MDP’s Given M = (S, D, A, {P sa (.)}, γ, R) we construct M’ = (S’, D’, A, {P’ sa (.)}, γ, R’), an MDP with deterministic state transitions Intuition: Instead of rolling the dice when we move from state to state, we will roll all the dice we need ahead of time, and store their results as part of our state.

Parcheesi ……

Deterministic Simulative Model Assume we have a deterministic functional representation of our MDP Transitions –g : S x A x [0,1] dp –> S such that if p is distributed uniformly in [0,1] dp then Pr p [g(s, a, p) = s’] = P sa (s’). –More powerful than a generative model.

Transformations of MDP’s –S’ = S x [0,1]  –D’ – (s, p 1, p 2, p 3, …) such that s ~ D, and the p i ’s are drawn iid from Uniform[0,1] –P’ ta (t’) ={1 if g(s, a, p 1 )=s’,0 otherwise}(d P = 1) –R’(t) = R(s) t = (s, p 1, p 2, p 3, …) t’ =(s’, p 2, p 3, …)

Policies Given a policy space П for S, consider a corresponding policy space П’ for S’, s.t. –  п in П,  п’ in П’,  s in S,  p 1, p 2,… п’((s, p 1, p 2, p 3, …)) = п(s) As the transition probabilities and rewards are equivalent in the transformed MDP: V M п (s) = E p~Unif[0,1]^  [V M’ п’ (s,p)] V M (п) = V M’ (п’)

Policy Search V M п (s 0 ) = R(s 0 ) + γ E s’~P(s0,п(s0)) [V п (s’)] V M’ п’ ((s 0,p 1,p 2,…)) = R(s 0 )+γR(s 1 )+γ 2 R(s 2 )+… –s 1 = g(s 0, п’(s 0 ), p 1 ), s 2 = g(s 1, п’(s 1 ), p 2 ) As V M (п) = V M’ (п’), we can estimate V M (п) = E t0~D’ [V M’ п’ (t 0 )]

PEGASUS Policy Evaluation-of-Goodness and Search Using Scenarios Draw a sample of m initial states (scenarios) {s 0 (1), s 0 (2), s 0 (3), …, s 0 (m) } iid from D’ Estimate

PEGASUS Given {s 0 (1), s 0 (2), s 0 (3), …, s 0 (m) }, is a deterministic function The sum is infinite, but can truncate the sum after H ε = log γ (ε(1-γ)/2R max ), introducing at most ε/2 error. Also, this allows us to store our “dice rolls” in finite space.

PEGASUS Given the deterministic function V M’ (п), we can use an optimization technique to find argmax п V M’ (п). –If working in a continuous, smooth, differentiable domain, we can use gradient ascent –If R is discontinuous, may need to use “continuation” methods to smooth it out

Results On 5x5 Gridworld POMDP, discovers near optimal policy in very few scenarios (~5) On continuous space/action bicycle riding problem, results near optimal and far better than earlier reward shaping methods.

Helicopter Hovering Policy represented by a hand-crafted neural network. PEGASUS used to search through set of possible ANN weights. –Tried both gradient ascent and random walk searches

Neural Network Structure (x,y,z) = (forward, sideways, down) a 1 = front/back cyclic pitch control, a 2 = left/right cyclic pitch control a 3 = main rotor pitch control a 4 = tail rotor pitch control

Results Able to keep helicopter stable on its maiden flight. HoveringHovering Neural network modified to fly competition class maneuvers TriangleTriangle Finally, hovering upside down accomplished http://ai.stanford.edu/~ang/rl-videos/helicopter/

Pseudo-Dimension H set of functions X -> Reals H shatters x 1, x 2, …, x d ε X if there exists a sequence of real numbers t 1, t 2, …, t d s.t. {(h(x 1 ) – t 1, h(x 2 ) – t 2, …, h(x d ) – t d ) | h ε H} intersects all 2 d orthants of R d The pseudo-dimension of H (dim p (H)) is the size of the largest set shattered by H

Lipschitz Continuity A function f is Lipschitz continuous with Lipschitz bound B if ||f(x) – f(y)|| <= B||x – y|| (with respect to Euclidean norm on range and domain)

Realizable Dynamics in an MDP Let S = [0, 1] ds, g: S x A x [0, 1] dp -> S be given. We can define F i as a set of functions {F i a : S x [0, 1] dp -> [0, 1], F i a (s, p 1,…,p dp ) = I i (g(s, a, p 1,…,p dp ))|  a in A} I i (x) returns the ith coordinate of x

PEGASUS Theoretical Result Let S = [0, 1] ds, policy class П, and model g: S x A x [0, 1] dp -> S be given. F is the family of realizable dynamics in the MDP and F i the resulting family of coordinate functions. For all i, let dim P (F i ) <= d, and let F i be uniformly Lipschitz continuous with bound B Reward Function R is Lipschitz continuous with bound B R. Then if: with probability at least 1 – δ, the PEGASUS estimate V’(п) will be uniformly close to the the actual value |V’(п) – V (п)| <= ε

Proof (1) Think of the reward at step i as a random variable V п (s 0 (1) ) = R(s o (1) ) + γ R(s 1 (1) ) + γ 2 R(s 2 (1) ) +… V п (s 0 (2) ) = R(s o (2) ) + γ R(s 1 (2) ) + γ 2 R(s 2 (2) ) +… V п (s 0 (3) ) = R(s o (3) ) + γ R(s 1 (3) ) + γ 2 R(s 2 (3) ) +… By bounding properties of each R(s i (j) ), we can prove uniform convergence for V(п)

Proof (2) Calling on work by Haussler, we show that if the psuedo-dimension of each F i, dim P (F i ) <= d, we can “nearly” represent our world dynamics functions F i a by a smaller set of functions of size

Proof (3) Similarly if F i uniformly has Lipschitz bound B, and the Reward function R has Lipschitz bound B R, we can “nearly” represent a function mapping from scenarios to ith step rewards by a set of size

Proof (4) A result by Haussler then shows that with probability 1 – δ, our ith step reward will be ε-close to the mean if we select a number of scenarios bounded by

Proof (5) Strengthening the bound to account for all H ε rewards and employing the Union bound, we find that a number of scenarios bounded by is sufficient.

Critique Success limited to very small fairly linear control problem, with high frequency controller Lots of human bias incorporated into system –Restrictions/Linear Regression for model identification –Structure of neural net for each of the tasks PAC learning guarantees still out of reach No theoretical bounds on final policy

Bibliography 1.Chapter on PAC learning model, and decision-theoretic generalizations, with applications to neural nets. From Mathematical Perspectives on Neural Networks, Lawrence Erlbaum Associates, 1995, Information and Computation, Vol. 100, September, 1992, pp. 78-150. Ng, A. Y., Jordan, M. I. PEGASUS: A policy search method for large MDP’s and POMDP’s. In Uncertainty in Artificial Intelligence, Sixth Conference, 2000. Ng, A. Y., Kim, H. J., Jordan, M. I., & Sastry, S. Autonomous helicopter flight via reinforcement learning. Advances in Neural Information Processing Systems 16. 2004. Ng, A. Y.,Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and Liang, E. Inverted autonomous helicopter flight via reinforcement learning, In International Symposium on Experimental Robotics, 2004.

Application – Helicoptor Flight PEGASUS has been used to derive policies for hovering in place. Later generalized to handle slow motion maneuvers and upside down hovering. GPS system relays state information (position and velocity) to an off board computer which calculates a 4-dimensional action

Model Identification Construction of an MDP representation of the world dynamics Transition Dynamics learned from several minutes of data based on human flight –Fit using linear regression –Forced to respect innate properties of the domain (gravity, symmetry)

PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine.

Similar presentations

Presentation on theme: "PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine.

Similar presentations

Presentation on theme: "PEGASUS: A policy search method for large MDP’s and POMDP’s Andrew Ng, Michael Jordan Presented by: Geoff Levine."— Presentation transcript:

Similar presentations

About project

Feedback