Reinforcement Learning with Partially Known World Dynamics

Reinforcement Learning with Partially Known World Dynamics
Christian R. Shelton Stanford University

Reinforcement Learning
Environment State Dynamics Agent Actions Rewards Goal

Motivation Reinforcement learning promises great things
Automatic task optimization Without any prior information about the world Reinforcement learning is hard Optimization goal could be arbitrary Every new situation might be different Modify the problem slightly Keep the basic general flexible framework Allow the specification of domain knowledge Don’t require full specification of the problem (planning)

Our Approach Partial World Modeling Example:
Keep the partial observability Allow conditional dynamics Example: Known dynamics: Sensor models, motion models, etc. Unknown dynamics: Enemy movements, maps, etc. Flexible barrier between the two

Partially Known Markov Decision Process (PKMDP)
Unknown Dynamics: s0 s1 s2 Known Dynamics: x0 x1 x2

z0 y1 z1 y2 z2 Interface: x0 x1 x2

z0 y1 z1 y2 z2 x0 x1 x2 Observation: o0 o1 o2

z0 y1 z1 y2 z2 x0 x1 x2 o0 o1 o2 Action: a0 a1 a2

unknown y0 z0 y1 z1 y2 z2 known, unobserved x0 x1 x2 o0 o1 o2 observed a0 a1 a2

Algorithm Outline Input Output Method Set of Trajectories (o, a, y, z)
Set of Policies Output Policy that maximizes expected return Method Construct non-parametric model of return Maximize with respect to policy

Algorithm Details Unknown Dynamics: Known Dynamics:
Use experience Importance sampling Known Dynamics: Use model DBN inference Exact calculation: lower variance Maximize using conjugate gradient Policy search method, but… Not policy gradient

Total Estimate For each sample, K and V involve reasoning in the DBN:
z0 x0 y0 o1 a1 z1 x1 y1 o2 a2 z2 x2 y2

Load-Unload Example 26 states, 14 observations, 4 actions
Three versions: 1. No world knowledge 2. Memory dynamics known 3. End-point & memory dynamics known

Clogged Pipe Example 144 states, 12 observations, 8 actions
Three versions: 1. Memory only 2. Known cart control 3. Incoming flow unknown

Conclusion Advantages Current Work
Uses samples for estimation of unknown dynamics Uses exact dynamics when known Allows natural specification of domain knowledge Current Work Improving gradient ascent planner Using structure within known dynamics Removing requirement observability of interface

Reinforcement Learning with Partially Known World Dynamics

Similar presentations

Presentation on theme: "Reinforcement Learning with Partially Known World Dynamics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning with Partially Known World Dynamics

Similar presentations

Presentation on theme: "Reinforcement Learning with Partially Known World Dynamics"— Presentation transcript:

Similar presentations

About project

Feedback