U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Introduction Abstraction necessary to scale RL hierarchical RL Want to learn abstractions automatically Other approaches Find subgoals: McGovern & Barto’01, Simsek & Barto’04, Simsek, Wolfe, & Barto’05, Mannor et al ’04 … Build policy hierarchy: Hengst’02 Potentially proto-value functions: Mahadevan’05 Our approach Learn initial policy hierarchy using graphical model framework, then learn how to use policies using reinforcement learning and reward Related to imitation Price & Boutilier’03, Abbeel & Ng’04
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 Outline Dynamic Abstraction Networks Approach Experiments Results Summary Future Work
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 Conference Center Bonn Attend ICML’05 Register Dynamic Abstraction Network P0P0 Obs P0P0 FF Just one realization of a DAN; others are possible P1P1 P1P1 S1S1 S1S1 F0F0 F1F1 F0F0 F1F1 t=1 Obs t=2 S0S0 Obs S0S0 Policy Hierarchy State Hierarchy HHMM Fine, Singer, & Tishby’98 AHMM Bui, Venkatesh, & West’02 DAN Manfredi & Mahadevan’05
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 Approach Extract Abstractions Policy Improvement Phase 2 e.g., SMDP Q-Learning Hand-code Skills Observe Trajectories Learn DAN using EM Phase 1 Discrete variables? Continuous? How many state values? Levels? Expert
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 HAMs [Parr & Russell’98] # of levels Hierarchy of stochastic finite state machines Explicit action, call, choice, stop states DANs vs MAXQ/HAMs DANs # of levels in state/policy hierarchies # of values for each (abstract) state/policy node Training sequences: (flat state,action) pairs MAXQ [Dietterich’00] # of levels, # of tasks at each level Connections between levels Initiation set for each task Termination set for each task DANs infer from training sequences
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Advantages of Graphical Models Joint learning of multiple policy/state abstractions Continuous/hidden domains Full machinery of inference can be used Disadvantages Parameter learning with hidden variables is expensive Expectation-Maximization can get stuck in local maxima Why Graphical Models?
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Domain Dietterich’s Taxi (2000) States Taxi Location (TL): 25 Passenger Location (PL): 5 Passenger Destination (PD): 5 Actions North, South, East, West Pickup, Putdown Hand-coded policies GotoRed GotoGreen GotoYellow GotoBlue Pickup, Putdown
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 Experiments PL PD TL PL PD TL Taxi DAN Policy F Action Policy Action Policy S0S0 S0S0 F1F1 F0F0 F0F0 F1F1 F S1S1 S1S1 Phase 1 |S 1 | = 5, |S 0 | = 25, | 1 | = 6, | 0 | = sequences from SMDP Q-learner {TL, PL, PD, A} 1, …, {TL, PL, PD, A} n Bayes Net Toolbox (Murphy’01) Phase 2 SMDP Q-learning Choose policy 1 using -greedy Compute most likely abstract state s 0 given TL, PL, PD Select action 0 using Pr ( 0 1 = 1, S 0 = s 0 )
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Policy Improvement Policy learned over DAN policies performs well Each plot is average over 10 RL runs and 1 EM run
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Policy Recognition Initial Passenger Loc Passenger Dest Policy 1 Policy 6 PU PD Can (sometimes!) recognize a specific sequence of actions as composing a single policy DAN
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Summary Two-phased method for automating hierarchical RL using graphical models Advantages Limited info needed (# of levels, # of values) Permits continuous and partially observable state/actions Disadvantages EM is expensive Need mentor Abstractions learned can be hard to decipher (local maxima?)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Future Work Approximate inference in DANs Saria & Mahadevan’04: Rao-Blackwellized particle filtering for multi-agent AHMMs Johns & Mahadevan’05: variational inference for AHMMs Take advantage of ability to do inference in hierarchical RL phase Incorporate reward in DAN
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Thank You Questions?
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Abstract State Transitions: S 0 Regardless of abstract P 0 policy being executed, abstract S 0 states self-transition with high probability Depending on abstract P 0 policy, may alternatively transition to one of a few abstract S 0 states Similarly for abstract S 1 states and abstract P 1 policies
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 State Abstractions Abstract state to which agent is most likely to transition is a consequence, in part, of the learned state abstractions
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Q(s,o) Q(s,o) + [r + max o O – Q(s, o) – Q(s,o)] s Semi-MDP Q-learning Q(s,o): activity-value for state s and activity o : learning rate : discount rate raised to the number of time steps o took r: accumulated discounted reward since o began
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Abstract State S 1 Transitions Abstract state S 1 transitions under abstract policy P 1
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 19 Expectation-Maximization (EM) Hidden variables and unknown parameters E(xpectation)-step Assume parameters known and compute the conditional expected values for variables M(aximization)-step Assume variables observed and compute the argmax parameters
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 20 Abstract State S 0 Transitions Abstract state S 0 transitions under abstract policy P 0