Download presentation
Presentation is loading. Please wait.
Published byYves Ringuette Modified over 6 years ago
1
Reinforcement Learning with Partially Known World Dynamics
Christian R. Shelton Stanford University
2
Reinforcement Learning
Environment State Dynamics Agent Actions Rewards Goal
3
Motivation Reinforcement learning promises great things
Automatic task optimization Without any prior information about the world Reinforcement learning is hard Optimization goal could be arbitrary Every new situation might be different Modify the problem slightly Keep the basic general flexible framework Allow the specification of domain knowledge Don’t require full specification of the problem (planning)
4
Our Approach Partial World Modeling Example:
Keep the partial observability Allow conditional dynamics Example: Known dynamics: Sensor models, motion models, etc. Unknown dynamics: Enemy movements, maps, etc. Flexible barrier between the two
5
Partially Known Markov Decision Process (PKMDP)
Unknown Dynamics: s0 s1 s2 Known Dynamics: x0 x1 x2
6
Partially Known Markov Decision Process (PKMDP)
z0 y1 z1 y2 z2 Interface: x0 x1 x2
7
Partially Known Markov Decision Process (PKMDP)
z0 y1 z1 y2 z2 x0 x1 x2 Observation: o0 o1 o2
8
Partially Known Markov Decision Process (PKMDP)
z0 y1 z1 y2 z2 x0 x1 x2 o0 o1 o2 Action: a0 a1 a2
9
Partially Known Markov Decision Process (PKMDP)
unknown y0 z0 y1 z1 y2 z2 known, unobserved x0 x1 x2 o0 o1 o2 observed a0 a1 a2
10
Algorithm Outline Input Output Method Set of Trajectories (o, a, y, z)
Set of Policies Output Policy that maximizes expected return Method Construct non-parametric model of return Maximize with respect to policy
11
Algorithm Details Unknown Dynamics: Known Dynamics:
Use experience Importance sampling Known Dynamics: Use model DBN inference Exact calculation: lower variance Maximize using conjugate gradient Policy search method, but… Not policy gradient
12
Total Estimate For each sample, K and V involve reasoning in the DBN:
z0 x0 y0 o1 a1 z1 x1 y1 o2 a2 z2 x2 y2
13
Load-Unload Example 26 states, 14 observations, 4 actions
Three versions: 1. No world knowledge 2. Memory dynamics known 3. End-point & memory dynamics known
14
Clogged Pipe Example 144 states, 12 observations, 8 actions
Three versions: 1. Memory only 2. Known cart control 3. Incoming flow unknown
15
Conclusion Advantages Current Work
Uses samples for estimation of unknown dynamics Uses exact dynamics when known Allows natural specification of domain knowledge Current Work Improving gradient ascent planner Using structure within known dynamics Removing requirement observability of interface
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.