1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010 November 15, 2010
ECE 517: Reinforcement Learning in AI 2 Introduction We will discuss different methods that heuristically approximate the dynamic programming problem Approximate Dynamic Programming Approximate Dynamic Programming Direct Neuro-Dynamic Programming Direct Neuro-Dynamic ProgrammingAssumptions: DP methods assume fully observed systems DP methods assume fully observed systems ADP rely on no model and (usually) partial observability (POMDPs) ADP rely on no model and (usually) partial observability (POMDPs) Relationship to classical control theories Optimal control – in the linear case, is a solved problem. It estimates the state vector using Kalman filter methodologies Optimal control – in the linear case, is a solved problem. It estimates the state vector using Kalman filter methodologies Adaptive control responds to the question: what can we do when the dynamics of the system are unknown Adaptive control responds to the question: what can we do when the dynamics of the system are unknown We have a model of the plant but lack its parameters (values) Often focus on stability properties rather than performance
ECE 517: Reinforcement Learning in AI 3 Reference to classic control theories (cont.) Robust control Attempt to find a controller design that guarantees stability, i.e. that the plant will not “blow up” Attempt to find a controller design that guarantees stability, i.e. that the plant will not “blow up” Regardless of what the unknown parameter values are Regardless of what the unknown parameter values are e.g. Lyapunov-based analysis (e.g. queueuing systems) e.g. Lyapunov-based analysis (e.g. queueuing systems) Adaptive control Attempt to adapt the controller in real time, based on real- time observations of how the plant actually behaves Attempt to adapt the controller in real time, based on real- time observations of how the plant actually behaves ADP may be viewed as an adaptive control framework ADP may be viewed as an adaptive control framework “Neural-observers” – used to predict the next set of observations, based on which the controller acts “Neural-observers” – used to predict the next set of observations, based on which the controller acts
ECE 517: Reinforcement Learning in AI 4 Core principals of ADP The following are three general principles that are at the core of ADP Value approximation – instead of solving for V(s) exactly, we can use a universal approximation function V(s,W) ~ V(s) Value approximation – instead of solving for V(s) exactly, we can use a universal approximation function V(s,W) ~ V(s) Alternate starting points – instead of always starting from the Bellman equation directly, we can start from related recurrence equations Alternate starting points – instead of always starting from the Bellman equation directly, we can start from related recurrence equations Hybrid design – combining multiple ADP systems in more complex hybrid designs Hybrid design – combining multiple ADP systems in more complex hybrid designs Usually in order to scale better Usually in order to scale better Mixture of continuous and discrete variables Mixture of continuous and discrete variables Multiple spatio-temporal scales Multiple spatio-temporal scales
ECE 517: Reinforcement Learning in AI 5 Direct Neuro-Dynamic Programming (Direct NDP) Motivation The intuitive appeal of Reinforcement Learning, in particular the actor/critic design The power of calculus of variation and this concept used in the form of backpropagation to solve optimal control problems Can inherently deal with POMDPs (using RNNs, for example) The method is considered “direct” in that It does not have explicit state representation It does not have explicit state representation Temporal progression – everything is a function of time and not state/action sets Temporal progression – everything is a function of time and not state/action sets It is also model-free as it does not assume a model or attempt to directly estimate model dynamics/structure
ECE 517: Reinforcement Learning in AI 6 Direct NDP Architecture Critic Network : estimate the future reward-to-go (i.e. value function) Action Network : adjust action to minimize the difference between the estimated J and the ultimate objective U c.
ECE 517: Reinforcement Learning in AI 7 Direct NDP vs. Classic RL u(t) J(t) J(t-1)-r(t) U c (t) Action Network Critic Network Environment State Reward Action Agent: direct NDP
ECE 517: Reinforcement Learning in AI 8 Inverted Helicopter Flight (Ng. / Stanford 2004)
ECE 517: Reinforcement Learning in AI 9 Solving POMDPs with RNNs Case study: framework for obtaining optimal policy in model-free POMDPs using Recurrent Neural Networks Uses NDP version of Q-Learning TRTRL is employed (efficient version of RTRL) Goal: investigate scenario in which two states have the same observation (yet different optimal actions) Method: RNNs in a TD framework (more later) Model is unknown! Model is unknown!
ECE 517: Reinforcement Learning in AI 10 Direct NDP architecture using RNNs RNN Q-Learning approx. OtOt atat Q(s t, a t ) ~ Environment Softmax Final action rtrt TD Method is good for small action sets. Q: why ? Method is good for small action sets. Q: why ?
ECE 517: Reinforcement Learning in AI 11 Simulation results – 10 neurons
ECE 517: Reinforcement Learning in AI 12 Training Robots (1 st -gen AIBOs) to walk (faster) 1 st generation AIBOs were used (internal CPU) Fundamental motor capabilities were prescribed e.g. apply torque to a given joint, turn in a given direction e.g. apply torque to a given joint, turn in a given direction In other words, finite action set In other words, finite action set Observations were limited to distance (a.k.a. radar view) Observations were limited to distance (a.k.a. radar view) The goal was to cross the field in short time Reward was growing negative as time progressed Reward was growing negative as time progressed Large positive reward when goal was met Large positive reward when goal was met Multiple robots were trained to observe variability in the learning process
ECE 517: Reinforcement Learning in AI 13 The general RL approach revisited RL will solve all of your problems, but … We need lots of experience to train from We need lots of experience to train from Taking random actions can be dangerous Taking random actions can be dangerous It can take a long time to learn It can take a long time to learn Not all problems fit into the NDP framework Not all problems fit into the NDP framework An alternative approach to RL is to reward whole policies, rather than individual actions Run whole policy, then receive a single reward Run whole policy, then receive a single reward Reward measures success of the entire policy Reward measures success of the entire policy If there are a small number of policies, we can exhaustively try them all However, this is not possible in most interesting problems However, this is not possible in most interesting problems
ECE 517: Reinforcement Learning in AI 14 Policy Gradient Methods Assume that our policy, , has a set of n real- valued parameters, = { 1, 2, 3,..., n } Running the policy with a particular results in a reward, r Running the policy with a particular results in a reward, r Estimate the reward gradient,, for each i Estimate the reward gradient,, for each i This is another learning rate
ECE 517: Reinforcement Learning in AI 15 Policy Gradient Methods (cont.) This results in hill-climbing in policy space So, it’s subject to all the problems of hill-climbing So, it’s subject to all the problems of hill-climbing But, we can also use tricks from search theory, like random restarts and momentum terms But, we can also use tricks from search theory, like random restarts and momentum terms This is a good approach if you have a parameterized policy Let’s assume we have a “reasonable” starting policy Let’s assume we have a “reasonable” starting policy Typically faster than value-based methods Typically faster than value-based methods “Safe” exploration, if you have a good policy “Safe” exploration, if you have a good policy Learns locally-best parameters for that policy Learns locally-best parameters for that policy
ECE 517: Reinforcement Learning in AI 16 An Example: Learning to Walk RoboCup 4-legged league Walking quickly is a big advantage Walking quickly is a big advantage Until recently, this was tuned manually Until recently, this was tuned manually Robots have a parameterized gait controller 12 parameters 12 parameters Controls step length, height, etc. Controls step length, height, etc. Robot walk across soccer field and is timed Reward is a function of the time taken Reward is a function of the time taken They know when to stop (distance measure) They know when to stop (distance measure)
ECE 517: Reinforcement Learning in AI 17 An Example: Learning to Walk (cont.) Basic idea 1. Pick an initial = { 1, 2,..., 12 } 2. Generate N testing parameter settings by perturbing j = { 1 + 1, 2 + 2,..., 12 + 12 }, i {- , 0, } 3. Test each setting, and observe rewards j → r j 4. For each i Calculate i +, i 0, i - and set 5. Set ← ’, and go to 2 Average reward when n i = i - i
ECE 517: Reinforcement Learning in AI 18 An Example: Learning to Walk (cont.) InitialFinal Q: Can we translate Gradient Policy into a direct policy, actor/critic Neuro-Dynamic Programming system?
ECE 517: Reinforcement Learning in AI 19 Value Function or Policy Gradient? When should I use policy gradient? When there’s a parameterized policy When there’s a parameterized policy When there’s a high-dimensional state space When there’s a high-dimensional state space When we expect the gradient to be smooth When we expect the gradient to be smooth Typically one episodic tasks (e.g. AIBO walking) Typically one episodic tasks (e.g. AIBO walking) When should I use a value-based method? When there is no parameterized policy When there is no parameterized policy When we have no idea how to solve the problem (i.e. no known structure) When we have no idea how to solve the problem (i.e. no known structure)
ECE 517: Reinforcement Learning in AI 20 Direct NDP with RNNs – Backpropagation through a model RNNs have memory and can create temporal context Applies to both actor and critic Much harder to train (time and logic/memory resources) e.g. RTRL issues e.g. RTRL issues
ECE 517: Reinforcement Learning in AI 21 Consolidated Actor-Critic Model (Z. Liu, I. Arel, 2007) Single network (FF or RNN) sufficient for both actor and critic functions Two-pass (TD-style) for both action and value estimate corrections Training via standard techniques, e.g. BP