Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki Fung On Tik Andy Li Yuk Hin Instructor: Nevin L. Zhang
Reinforcement Learning 2 Outline Introduction 3 Solving Methods Main Consideration Exploration vs. Exploitation Directed / Undirected Exploration Function Approximation Planning and Learning Directed RL vs. Undirected RL Dyna-Q and Prioritized Sweeping Conclusion on recent development
Reinforcement Learning 3 Introduction Agent interacts with environment Goal-directed learning from interaction Environment Action a AI Agent s(t) Reward r s(t + 1)
Reinforcement Learning 4 Key Features Agent is NOT told which actions to take, but learn by itself By trial-and-error From experiences Explore and exploit Exploitation = agent takes the best action based on its current knowledge Exploration = try to take NOT the best action to gain more knowledge
Reinforcement Learning 5 Elements of RL Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what
Reinforcement Learning 6 Dynamic Programming Model-based compute optimal policies given a perfect model of the environment as a Markov decision process (MDP) Bootstrap update estimates based in part on other learned estimates, without waiting for a final outcome
Reinforcement Learning 7 Dynamic Programming
Reinforcement Learning 8 Monte Carlo Model-free NOT bootstrap Entire episode included Only one choice at each state (unlike DP) Time required to estimate one state does not depend on the total number of states
Reinforcement Learning 9 Monte Carlo TT T TT T T TTT T T TT TT T TTT
Reinforcement Learning 10 Temporal Difference Model-free Bootstrap Partial episode included
Reinforcement Learning 11 Temporal Difference TT T TT T T TTT T T T T T T T TT T
Reinforcement Learning 12 Example: Driving home
Reinforcement Learning 13 Driving home Changes recommended by Monte Carlo methods Changes recommended by TD methods
Reinforcement Learning 14 N-step TD Prediction MC and TD are extreme cases!
Reinforcement Learning 15 Averaging N-step Returns n-step methods were introduced to help with TD( ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step Called a complex backup Draw each component Label with the weights for that component
Reinforcement Learning 16 Forward View of TD( ) TD( ) is a method for averaging all n-step backups weight by n-1 (time since visitation) -return: Backup using -return
Reinforcement Learning 17 Forward View of TD( ) Look forward from each state to determine update from future states and rewards:
Reinforcement Learning 18 Backward View of TD( ) The forward view was for theory The backward view is for mechanism New variable called eligibility trace On each step, decay all traces by and increment the trace for the current state by 1 Accumulating trace
Reinforcement Learning 19 Backward View Shout t backwards over time The strength of your voice decreases with temporal distance by
Reinforcement Learning 20 Forward View = Backward View The forward (theoretical) view of TD( ) is equivalent to the backward (mechanistic) view for off-line updating
Adaptive Exploration in Reinforcement Learning Relu Patrascu Department of Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada Deborah Stacey Dept. of Computing and Information Science University of Guelph Ontario, Canada
Reinforcement Learning 22 Objectives Explains the trade-off between exploitation and exploration Introduces two categories of exploration methods: Undirected Exploration -greedy exploration Directed Exploration Counter-based exploration Past-Success directed exploration Function approximation Backpropagation algorithm and Fuzzy ARTMAP
Reinforcement Learning 23 Introduction Main problem: How to make the learning process adapt to the non-stationary environment? Sub-Problems: How to balance exploitation and exploration when the environment change? How can the function approximators adapt the environment?
Reinforcement Learning 24 Exploitation and Exploration Exploit or Explore? To maximize reward, a learner must exploit the knowledge it already has Explore an action with small immediate reward, but may yield more reward in the long run An example: Choosing the job Suppose you are working at a small company with $25,000 salary You have another offer from an enterprise but only start at $12,000 Keep working on the small company may guarantee you have stable income Work on an enterprise may have more opportunities for promotion, which increase the income in long run
Reinforcement Learning 25 Undirected Exploration No biased purely random Eg. -greedy exploration it explores it chooses equally among all actions likely to choose the worst appearing action as it is to choose the next-to- best
Reinforcement Learning 26 Directed Exploration Memorize exploration-specific knowledge Biased by some features of the learning process Eg. Counter-based techniques Favor the choice of actions resulting in a transition to a state that has not been frequently visited The main idea is encourage the learner to explore : parts of the state space that have not been sampled often parts that have not been sampled recently
Reinforcement Learning 27 Past-success Directed Exploration Based on -greedy exploration Bias to adapt the environment from the learning process Increase exploitation rate if receives reward at an increasing rate Increase exploration rate when stop receiving reward Average discounted reward Reflects amount and frequency of received immediate rewards The further back in time, the less effect on average reward
Reinforcement Learning 28 Average discounted reward defined as: Apply it on -greedy algorithm Past-Success Directed Exploration where (0,1] is the discount factor r t the reward received at time t
Reinforcement Learning 29 Gradient Descent Method Why use a gradient descent method? RL applications use table to store the value functions Large number of states causes practically impossible Solution: use function approximator to predict the value Error backpropagation algorithm Catastrophic Interference cannot learn incrementally in non-stationary environment acquire new knowledge forget much of its previous knowledge
Reinforcement Learning 30 Gradient Descent Method Initialize w arbitrarily and e = 0 Repeat (for each episode): Initialize s Pass s through each network and obtain Q a a arg max a Q a With probability : a a random action A(s) Repeat (for each step of episode): e e e a e a w Q a Take action a, observe reward, r and next state, s’ r – Q a Pass s’ through each network and obtain Q’ a a’ arg max a Q’ a With probability : a a random action A(s’) + Q’ a w w + e a a’ until s’ is terminal where a’ arg max a Q’ a means a’ is set to the action for which the expression is maximal, in this case the highest Q’ is a constant step size parameter named the learning rate wQ a is the partial derivative of Q a with respect to the weights w the discount factor e the vector of eligibility traces (0, 1] is the eligibility trace parameter
Reinforcement Learning 31 Fuzzy ARTMAP ARTMAP - Adaptive Resonancy Theory mapping between input vector and output pattern a neural network specifically designed to deal with the stability/plasticity dilemma This dilemma means a neural network isn't able to learn new information without damaging what was learned previously, similar to catastrophic interference
Reinforcement Learning 32 Experiments Gridworld with non-stationary environment Learning agent can move up, down, left or right Two gates: must pass through one of them from start state to goal state First 1000 episodes, gate 1 open and gate 2 close episodes, gate 1 close and gate 2 open To test how well the algorithm adapt to the changed environment
Reinforcement Learning 33 Results Backpropagation algorithm After 1000 th episode: average discounted reward drops rapidly and monotonically Surges to maximum exploitation Fuzzy ARTMAP After 1000 th episode: Reward drops in a few episode and goes back to high values A temporary surge in exploration
34 Planning and Learning Use of environment models Integration of planning and learning methods Objectives:
35 Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution model: description of all possibilities and their probabilities e.g., Sample model: produces sample experiences e.g., a simulation model, set of data Both types of models can be used to produce simulated experience Often sample models are much easier to obtain
Reinforcement Learning 36 Planning Planning: any computational process that uses a model to create or improve a policy pWe take the following view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience Model Policy Planning Simulated Experience Model Values Policy backups
Reinforcement Learning 37 Learning, Planning, and Acting Two uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL or model- based RL. Here, we call it planning.
Reinforcement Learning 38 Direct vs. Indirect RL Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel
Reinforcement Learning 39 The Dyna-Q Architecture (Sutton 1990)
Reinforcement Learning 40 The Dyna-Q Architecture (Sutton 1990) Dyna use the experience to build the model (R, T), uses experience to adjust the policy and user the model to adjust the policy For each interaction with environment, experiencing 1. use experience to adjust the policy Q(s,a) = R(s,a) + [ r + Max a’ Q(s’, a’) – Q(s,a)] 2. use experience to update a model ( T, R ) Model (s,a) = (s’, r) 3. use model to simulate the experience to adjust the policy a Rand(a), s Rand(s) (s’, r) Model(s, a) Q(s,a) = R(s,a) + [ r + Max a’ Q(s’, a’) – Q(s,a)]
Reinforcement Learning 41 The Dyna-Q Algorithm direct RL model learning planning
Reinforcement Learning 42 Dyna-Q Snapshots: Midway in 2nd Episode
Reinforcement Learning 43 Dyna-Q Properties pDyna algorithm requires about N times the computation of Q learning per instance pBut it is typically vastly less than that for naïve model- based method pN can be determined by the relative speed of computation and of the taking action pWhat if the environment is changed ? pChange to harder or change to easier.
Reinforcement Learning 44 Blocking Maze The changed environment is harder
Reinforcement Learning 45 Shortcut Maze The changed environment is easier
Reinforcement Learning 46 What is Dyna-Q ? Uses an “exploration bonus”: Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent actually “plans” how to visit long unvisited states +
Reinforcement Learning 47 Prioritized Sweeping The updating of the model is no longer random Instead, store additional information in the model in order to make the appropriate choice of updating pStore the change of each state value V(s), and use it to modify the priority of the predecessors of s, according their transition probability T(s,a, s’) s4s4 s5s5 s1s1 s2s2 s3s3 = 10 = 5 S4S4 S5S5 S2S2 S1S1 S3S3 Priority: High Low
Reinforcement Learning 48 Prioritized Sweeping
Reinforcement Learning 49 Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction
Reinforcement Learning 50 Full and Sample (One-Step) Backups
Reinforcement Learning 51 Summary Emphasized close relationship between planning and learning Important distinction between distribution models and sample models Looked at some ways to integrate planning and learning synergy among planning, acting, model learning
52 RL Recent Development : Problem Modeling Partially Observable MDP MDP Hidden State RL Traditional RL Known Unknown Completely Observable Partially Observable Model of environment
Reinforcement Learning 53 Research topics Exploration-Exploitation tradeoff Problem of delayed reward (credit assignment) Input generalization Function Approximator Multi-Agent Reinforcement Learning Global goal vs Local goal Achieve several goals in parallel Agent cooperation and communication
Reinforcement Learning 54 RL Application TD Gammon Tesauro 1992, 1994, 1995, pieces, 24 locations implies enormous number of configurations Effective branching factor of 400 TD( ) algorithm Multi-layer Neural Network Near the level of world’s strongest grandmasters
Reinforcement Learning 55 RL Application Elevator Dispatching Crites and Barto 1996
Reinforcement Learning 56 RL Application Conservatively about states Elevator Dispatching 18 hall call buttons: 2 18 combinations positions and directions of cars: 18 4 (rounding to nearest floor) motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6 40 car buttons: 2 40 18 discretized real numbers are available giving elapsed time since hall buttons pushed Set of passengers riding each car and their destinations: observable only through the car buttons
Reinforcement Learning 57 RL Application pDynamic Channel Allocation Singh and Bertsekas 1997 pJob-Shop Scheduling Zhang and Dietterich 1995, 1996
Reinforcement Learning 58 Q & A