Reinforcement Learning with Partially Known World Dynamics Christian R. Shelton Stanford University
Reinforcement Learning Environment State Dynamics Agent Actions Rewards Goal
Motivation Reinforcement learning promises great things Automatic task optimization Without any prior information about the world Reinforcement learning is hard Optimization goal could be arbitrary Every new situation might be different Modify the problem slightly Keep the basic general flexible framework Allow the specification of domain knowledge Don’t require full specification of the problem (planning)
Our Approach Partial World Modeling Example: Keep the partial observability Allow conditional dynamics Example: Known dynamics: Sensor models, motion models, etc. Unknown dynamics: Enemy movements, maps, etc. Flexible barrier between the two
Partially Known Markov Decision Process (PKMDP) Unknown Dynamics: s0 s1 s2 Known Dynamics: x0 x1 x2
Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 Interface: x0 x1 x2
Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 x0 x1 x2 Observation: o0 o1 o2
Partially Known Markov Decision Process (PKMDP) z0 y1 z1 y2 z2 x0 x1 x2 o0 o1 o2 Action: a0 a1 a2
Partially Known Markov Decision Process (PKMDP) unknown y0 z0 y1 z1 y2 z2 known, unobserved x0 x1 x2 o0 o1 o2 observed a0 a1 a2
Algorithm Outline Input Output Method Set of Trajectories (o, a, y, z) Set of Policies Output Policy that maximizes expected return Method Construct non-parametric model of return Maximize with respect to policy
Algorithm Details Unknown Dynamics: Known Dynamics: Use experience Importance sampling Known Dynamics: Use model DBN inference Exact calculation: lower variance Maximize using conjugate gradient Policy search method, but… Not policy gradient
Total Estimate For each sample, K and V involve reasoning in the DBN: z0 x0 y0 o1 a1 z1 x1 y1 o2 a2 z2 x2 y2
Load-Unload Example 26 states, 14 observations, 4 actions Three versions: 1. No world knowledge 2. Memory dynamics known 3. End-point & memory dynamics known
Clogged Pipe Example 144 states, 12 observations, 8 actions Three versions: 1. Memory only 2. Known cart control 3. Incoming flow unknown
Conclusion Advantages Current Work Uses samples for estimation of unknown dynamics Uses exact dynamics when known Allows natural specification of domain knowledge Current Work Improving gradient ascent planner Using structure within known dynamics Removing requirement observability of interface