Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki Fung On Tik Andy

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lecture 18: Temporal-Difference Learning
RL for Large State Spaces: Value Function Approximation
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Reinforcement Learning Tutorial
Reinforcement Learning
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Chapter 1: Introduction
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 9: Planning and Learning pUse of environment models pIntegration of planning.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Formulating MDPs pFormulating MDPs Rewards Returns Values pEscalator pElevators.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 6: Optimality Criterion in MDPs Dr. Itamar Arel College of Engineering Department.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Reinforcement Learning
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD( ) and eligibility traces.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 14: Planning and Learning Dr. Itamar Arel College of Engineering Department of Electrical.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning Elementary Solution Methods
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Reinforcement Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
Reinforcement learning
Dr. Unnikrishnan P.C. Professor, EEE
Chapter 2: Evaluative Feedback
Chapter 11: Case Studies Objectives of this chapter:
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Chapter 1: Introduction
Chapter 9: Planning and Learning
Chapter 7: Eligibility Traces
October 20, 2010 Dr. Itamar Arel College of Engineering
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Chapter 11: Case Studies Objectives of this chapter:
Reinforcement Learning (2)
Presentation transcript:

Reinforcement Learning1 COMP538 Reinforcement Learning Recent Development Group 7: Chan Ka Ki Fung On Tik Andy Li Yuk Hin Instructor: Nevin L. Zhang

Reinforcement Learning 2 Outline Introduction 3 Solving Methods Main Consideration Exploration vs. Exploitation Directed / Undirected Exploration Function Approximation Planning and Learning Directed RL vs. Undirected RL Dyna-Q and Prioritized Sweeping Conclusion on recent development

Reinforcement Learning 3 Introduction Agent interacts with environment Goal-directed learning from interaction Environment Action a AI Agent s(t) Reward r s(t + 1)

Reinforcement Learning 4 Key Features Agent is NOT told which actions to take, but learn by itself By trial-and-error From experiences Explore and exploit Exploitation = agent takes the best action based on its current knowledge Exploration = try to take NOT the best action to gain more knowledge

Reinforcement Learning 5 Elements of RL Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what

Reinforcement Learning 6 Dynamic Programming Model-based compute optimal policies given a perfect model of the environment as a Markov decision process (MDP) Bootstrap update estimates based in part on other learned estimates, without waiting for a final outcome

Reinforcement Learning 7 Dynamic Programming

Reinforcement Learning 8 Monte Carlo Model-free NOT bootstrap Entire episode included Only one choice at each state (unlike DP) Time required to estimate one state does not depend on the total number of states

Reinforcement Learning 9 Monte Carlo TT T TT T T TTT T T TT TT T TTT

Reinforcement Learning 10 Temporal Difference Model-free Bootstrap Partial episode included

Reinforcement Learning 11 Temporal Difference TT T TT T T TTT T T T T T T T TT T

Reinforcement Learning 12 Example: Driving home

Reinforcement Learning 13 Driving home Changes recommended by Monte Carlo methods Changes recommended by TD methods

Reinforcement Learning 14 N-step TD Prediction MC and TD are extreme cases!

Reinforcement Learning 15 Averaging N-step Returns n-step methods were introduced to help with TD( ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step Called a complex backup Draw each component Label with the weights for that component

Reinforcement Learning 16 Forward View of TD( ) TD( ) is a method for averaging all n-step backups weight by n-1 (time since visitation) -return: Backup using -return

Reinforcement Learning 17 Forward View of TD( ) Look forward from each state to determine update from future states and rewards:

Reinforcement Learning 18 Backward View of TD( ) The forward view was for theory The backward view is for mechanism New variable called eligibility trace On each step, decay all traces by  and increment the trace for the current state by 1 Accumulating trace

Reinforcement Learning 19 Backward View Shout  t backwards over time The strength of your voice decreases with temporal distance by 

Reinforcement Learning 20 Forward View = Backward View The forward (theoretical) view of TD( ) is equivalent to the backward (mechanistic) view for off-line updating

Adaptive Exploration in Reinforcement Learning Relu Patrascu Department of Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada Deborah Stacey Dept. of Computing and Information Science University of Guelph Ontario, Canada

Reinforcement Learning 22 Objectives Explains the trade-off between exploitation and exploration Introduces two categories of exploration methods: Undirected Exploration  -greedy exploration Directed Exploration Counter-based exploration Past-Success directed exploration Function approximation Backpropagation algorithm and Fuzzy ARTMAP

Reinforcement Learning 23 Introduction Main problem: How to make the learning process adapt to the non-stationary environment? Sub-Problems: How to balance exploitation and exploration when the environment change? How can the function approximators adapt the environment?

Reinforcement Learning 24 Exploitation and Exploration Exploit or Explore? To maximize reward, a learner must exploit the knowledge it already has Explore an action with small immediate reward, but may yield more reward in the long run An example: Choosing the job Suppose you are working at a small company with $25,000 salary You have another offer from an enterprise but only start at $12,000 Keep working on the small company may guarantee you have stable income Work on an enterprise may have more opportunities for promotion, which increase the income in long run

Reinforcement Learning 25 Undirected Exploration No biased purely random Eg.  -greedy exploration it explores it chooses equally among all actions likely to choose the worst appearing action as it is to choose the next-to- best

Reinforcement Learning 26 Directed Exploration Memorize exploration-specific knowledge Biased by some features of the learning process Eg. Counter-based techniques Favor the choice of actions resulting in a transition to a state that has not been frequently visited The main idea is encourage the learner to explore : parts of the state space that have not been sampled often parts that have not been sampled recently

Reinforcement Learning 27 Past-success Directed Exploration Based on  -greedy exploration Bias to adapt the environment from the learning process Increase exploitation rate if receives reward at an increasing rate Increase exploration rate when stop receiving reward Average discounted reward Reflects amount and frequency of received immediate rewards The further back in time, the less effect on average reward

Reinforcement Learning 28 Average discounted reward defined as: Apply it on  -greedy algorithm Past-Success Directed Exploration where  (0,1] is the discount factor r t the reward received at time t

Reinforcement Learning 29 Gradient Descent Method Why use a gradient descent method? RL applications use table to store the value functions Large number of states causes practically impossible Solution: use function approximator to predict the value Error backpropagation algorithm Catastrophic Interference cannot learn incrementally in non-stationary environment acquire new knowledge forget much of its previous knowledge

Reinforcement Learning 30 Gradient Descent Method Initialize w arbitrarily and e = 0 Repeat (for each episode): Initialize s Pass s through each network and obtain Q a a  arg max a Q a With probability  : a  a random action  A(s) Repeat (for each step of episode): e   e e a  e a  w Q a Take action a, observe reward, r and next state, s’   r – Q a Pass s’ through each network and obtain Q’ a a’  arg max a Q’ a With probability  : a  a random action  A(s’)    +  Q’ a w  w +  e a  a’ until s’ is terminal where a’  arg max a Q’ a means a’ is set to the action for which the expression is maximal, in this case the highest Q’  is a constant step size parameter named the learning rate  wQ a is the partial derivative of Q a with respect to the weights w  the discount factor e the vector of eligibility traces  (0, 1] is the eligibility trace parameter

Reinforcement Learning 31 Fuzzy ARTMAP ARTMAP - Adaptive Resonancy Theory mapping between input vector and output pattern a neural network specifically designed to deal with the stability/plasticity dilemma This dilemma means a neural network isn't able to learn new information without damaging what was learned previously, similar to catastrophic interference

Reinforcement Learning 32 Experiments Gridworld with non-stationary environment Learning agent can move up, down, left or right Two gates: must pass through one of them from start state to goal state First 1000 episodes, gate 1 open and gate 2 close episodes, gate 1 close and gate 2 open To test how well the algorithm adapt to the changed environment

Reinforcement Learning 33 Results Backpropagation algorithm After 1000 th episode: average discounted reward drops rapidly and monotonically Surges to maximum exploitation Fuzzy ARTMAP After 1000 th episode: Reward drops in a few episode and goes back to high values A temporary surge in exploration

34 Planning and Learning Use of environment models Integration of planning and learning methods Objectives:

35 Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution model: description of all possibilities and their probabilities e.g., Sample model: produces sample experiences e.g., a simulation model, set of data Both types of models can be used to produce simulated experience Often sample models are much easier to obtain

Reinforcement Learning 36 Planning Planning: any computational process that uses a model to create or improve a policy pWe take the following view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience Model Policy Planning Simulated Experience Model Values Policy backups

Reinforcement Learning 37 Learning, Planning, and Acting Two uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL or model- based RL. Here, we call it planning.

Reinforcement Learning 38 Direct vs. Indirect RL Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel

Reinforcement Learning 39 The Dyna-Q Architecture (Sutton 1990)

Reinforcement Learning 40 The Dyna-Q Architecture (Sutton 1990) Dyna use the experience to build the model (R, T), uses experience to adjust the policy and user the model to adjust the policy For each interaction with environment, experiencing 1. use experience to adjust the policy Q(s,a) = R(s,a) +  [ r +  Max a’ Q(s’, a’) – Q(s,a)] 2. use experience to update a model ( T, R ) Model (s,a) = (s’, r) 3. use model to simulate the experience to adjust the policy a  Rand(a), s  Rand(s) (s’, r)  Model(s, a) Q(s,a) = R(s,a) +  [ r +  Max a’ Q(s’, a’) – Q(s,a)]

Reinforcement Learning 41 The Dyna-Q Algorithm direct RL model learning planning

Reinforcement Learning 42 Dyna-Q Snapshots: Midway in 2nd Episode

Reinforcement Learning 43 Dyna-Q Properties pDyna algorithm requires about N times the computation of Q learning per instance pBut it is typically vastly less than that for naïve model- based method pN can be determined by the relative speed of computation and of the taking action pWhat if the environment is changed ? pChange to harder or change to easier.

Reinforcement Learning 44 Blocking Maze The changed environment is harder

Reinforcement Learning 45 Shortcut Maze The changed environment is easier

Reinforcement Learning 46 What is Dyna-Q ? Uses an “exploration bonus”: Keeps track of time since each state-action pair was tried for real An extra reward is added for transitions caused by state-action pairs related to how long ago they were tried: the longer unvisited, the more reward for visiting The agent actually “plans” how to visit long unvisited states +

Reinforcement Learning 47 Prioritized Sweeping The updating of the model is no longer random Instead, store additional information in the model in order to make the appropriate choice of updating pStore the change of each state value  V(s), and use it to modify the priority of the predecessors of s, according their transition probability T(s,a, s’) s4s4 s5s5 s1s1 s2s2 s3s3  = 10  = 5 S4S4 S5S5 S2S2 S1S1 S3S3 Priority: High Low

Reinforcement Learning 48 Prioritized Sweeping

Reinforcement Learning 49 Prioritized Sweeping vs. Dyna-Q Both use N=5 backups per environmental interaction

Reinforcement Learning 50 Full and Sample (One-Step) Backups

Reinforcement Learning 51 Summary Emphasized close relationship between planning and learning Important distinction between distribution models and sample models Looked at some ways to integrate planning and learning synergy among planning, acting, model learning

52 RL Recent Development : Problem Modeling Partially Observable MDP MDP Hidden State RL Traditional RL Known Unknown Completely Observable Partially Observable Model of environment

Reinforcement Learning 53 Research topics Exploration-Exploitation tradeoff Problem of delayed reward (credit assignment) Input generalization Function Approximator Multi-Agent Reinforcement Learning Global goal vs Local goal Achieve several goals in parallel Agent cooperation and communication

Reinforcement Learning 54 RL Application TD Gammon Tesauro 1992, 1994, 1995, pieces, 24 locations implies enormous number of configurations Effective branching factor of 400 TD( ) algorithm Multi-layer Neural Network Near the level of world’s strongest grandmasters

Reinforcement Learning 55 RL Application Elevator Dispatching Crites and Barto 1996

Reinforcement Learning 56 RL Application Conservatively about states Elevator Dispatching  18 hall call buttons: 2 18 combinations  positions and directions of cars: 18 4 (rounding to nearest floor)  motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6  40 car buttons: 2 40  18 discretized real numbers are available giving elapsed time since hall buttons pushed  Set of passengers riding each car and their destinations: observable only through the car buttons

Reinforcement Learning 57 RL Application pDynamic Channel Allocation Singh and Bertsekas 1997 pJob-Shop Scheduling Zhang and Dietterich 1995, 1996

Reinforcement Learning 58 Q & A