REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Distributed Q Learning Lars Blackmore and Steve Block.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Machine Learning Lecture 11: Reinforcement Learning
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning: A survey
Instructor: Vincent Conitzer
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
Reinforcement Learning
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement learning
Reinforcement Learning in POMDPs Without Resets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Biomedical Data & Markov Decision Process
Markov Decision Processes
Markov Decision Processes
Reinforcement learning
Instructors: Fei Fang (This Lecture) and Dave Touretzky
یادگیری تقویتی Reinforcement Learning
Reinforcement Nisheeth 18th January 2019.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

RL brings a way of programming agents by reward and punishment without specifying how the task is to be achieved. (Kaelbling,1996)  Based on trial-error interactions  A set of problems rather than a set of techniques

The standard reinforcement-learning model i: input r: reward s: state a: action

The reinforcement learning model consists of : a discrete set of environment states: S ; a discrete set of environment states: S ; a discrete set of agent actions A ; and a discrete set of agent actions A ; and a set of scalar reinforcement signals; typically {0,1}, or the real numbers a set of scalar reinforcement signals; typically {0,1}, or the real numbers (different from supervised learning) (different from supervised learning)

An example dialog for agent environment relationship: Environment: You are in state 65. You have 4 possible actions. Agent: I'll take action 2. Environment: You received a reinforcement of 7 units. You are now in state 15. You have 2 possible actions. Agent: I'll take action 1. Environment: You received a reinforcement of -4 units. You are now in state 65. You have 4 possible actions. Agent: I'll take action 2. Environment: You received a reinforcement of 5 units. You are now in state 44. You have 5 possible actions..

Some Examples Bioreactor Bioreactor actions: stirring rate, temperature control states: sensory readings related to chemicals reward: instant production rate of target chemical Recycling robot Recycling robot actions: search for a can, wait, or recharge states: low battery, high battery reward: + for having a can, - for running out of battery

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

Models of Optimal Behavior: Agent tries to maximize one of the following: finite-horizon Model: infinite-horizon discounted model: average-reward model:

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

k-armed bandit k gambling machines k gambling machines h pulls are allowed h pulls are allowed Machines are not equivalent: Machines are not equivalent:  Trying to learn paying probabilities  Trying to learn paying probabilities  Tradeoff between exploitation-exploration

Solution Strategies 1: Dynamic Programming Approach A belief state : Expected pay-off remaining: Probability of action i being paid:

Dynamic Programming Approach:, then Because there are no remaining pulls. So, all values can be recursively computed: Update probabilities after each action

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

MARKOV DECISION PROCESS k-armed bandit gives immediate reward k-armed bandit gives immediate reward DELAYED REWARD? Characteristics of MDP: a set of states : S a set of actions : A a reward function : R : S x A  R A state transition function: T: S x A  ∏ ( S) T (s,a,s’): probability of transition from s to s’ using action a

MDP EXAMPLE: Transition function States and rewards Bellman Equation : (Greedy policy selection)

Value Iteration Algorithm AN ALTERNATIVE ITERATION: (Singh,1993) (Important for model free learning) Stop Iteration when V(s) differs less than є. Policy difference ratio =< 2єγ / (1-γ ) ( Williams & Baird 1993b)

Policy Iteration Algorithm Policies converge faster than values. Why faster convergence?

POLICY ITERATION ON GRID WORLD

POLICY ITERATION ON GRID WORLD

POLICY ITERATION ON GRID WORLD

POLICY ITERATION ON GRID WORLD

POLICY ITERATION ON GRID WORLD

MDP Graphical Representation β, α : T (s, action, s’ ) Similarity to HMMs

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

Model Free Methods Models of the environment: T: S x A  ∏ ( S) and R : S x A  R Do we know them? Do we have to know them? Monte Carlo Methods Monte Carlo Methods Adaptive Heuristic Critic Adaptive Heuristic Critic Q Learning Q Learning

Monte Carlo Methods Idea: Idea: Hold statistics about rewards for each state Hold statistics about rewards for each state Take the average Take the average This is the V(s) This is the V(s) Based only on experience Based only on experience Assumes episodic tasks  Assumes episodic tasks  (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.) (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.) Incremental in episode-by-episode sense not step-by- step sense. Incremental in episode-by-episode sense not step-by- step sense.

Problem: Unvisited pairs (problem of maintaining exploration) For every make sure that: P( selected as a start state and action) >0 (Assumption of exploring starts )

Monte Carlo Control How to select Policies: (Similar to policy evaluation)

ADAPTIVE HEURISTIC CRITIC & TD(λ) How the AHC learns, TD(0) algorithm: AHC : TD Algorithm

Q LEARNING Q values in Value Iteration: But we don’t know and Instead use the following : Decayed α properly? Q values will converge. (Singh 1994)

Q-LEARNING CRITICS: Simpler than AHC learning Simpler than AHC learning Q-Learning is exploration sensitive Q-Learning is exploration sensitive Analog to value iteration in MDP Analog to value iteration in MDP Most popular Model free learning algorithm Most popular Model free learning algorithm

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

Model Based Methods Model free methods do not learn the model parameters Model free methods do not learn the model parameters  Inefficient use of data! Learn the model.

Certainty Equivalent Methods: First learn the models of the environment by keeping statistics, then learn the actions to take. First learn the models of the environment by keeping statistics, then learn the actions to take. Objections: Arbitrary division between the learning phase and Arbitrary division between the learning phase and acting phase acting phase Initial data gathering. (How to choose exploration strategy without knowing the model) Initial data gathering. (How to choose exploration strategy without knowing the model) Changes in the environment? Changes in the environment?  Better to learn the model and to act simultaneously

DYNA After action a, from s to s’ with reward r: 1.Update Transition model T and reward function R 2.Update Q values: 3.Do k more random updates (random s, a pairs ):

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

POMDPs What if state information (from sensors) is noisy? Mostly the case! MDP techniques are suboptimal! Two halls are not the same.

POMDPs – A Solution Strategy SE: Belief State Estimator (Can be based on HMM) П: MDP Techniques

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Generalization Generalization Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

APPLICATIONS Juggling robot: dynamic programming Juggling robot: dynamic programming (Schaal & Atkeson 1994) (Schaal & Atkeson 1994) Box pushing robot: Q-learning Box pushing robot: Q-learning (Mahadevan& Connel 1991a) (Mahadevan& Connel 1991a) Disk collecting robot: Q-learning Disk collecting robot: Q-learning (Mataric 1994) (Mataric 1994)

ROADMAP: Introduction to the problem Introduction to the problem Models of Optimal Behavior Models of Optimal Behavior An Immediate Reward Example and Solution An Immediate Reward Example and Solution Markov Decision Process Markov Decision Process Model Free Methods Model Free Methods Model Based Methods Model Based Methods Generalization Generalization Partially Observable Markov Decision Process Partially Observable Markov Decision Process Applications Applications Conclusion Conclusion

Conclusion RL is not supervised learning RL is not supervised learning Planning rather than classification Planning rather than classification Poor performance on large problems. Poor performance on large problems. New methods neded New methods neded (i.e. shaping, imitation, reflexes) (i.e. shaping, imitation, reflexes) RL/MDPs extends HMMs. How? RL/MDPs extends HMMs. How?

MDP as a graph Is it possible to represent it as an HMM?

Relation to HMMs Recycling robot example revisited as an HMM problem Battery t-1 Battery t+1 Battery t Action t-1Action t+1Action t Battery={ low, high} Action={wait, search, recharge}

Relation to HMMs Recycling robot example revisited as an HMM problem Battery t-1 Battery t+1 Battery t Action t-1Action t+1Action t Battery={ low, high} Action={wait, search, recharge} Not representable as an HMM

HMMs vs MDPs Once we have the MDP representation of the problem, We can do inferences just like HMM by converting it to probabilistic automaton. Vice versa is not possible. Once we have the MDP representation of the problem, We can do inferences just like HMM by converting it to probabilistic automaton. Vice versa is not possible. (Actions, rewards) (Actions, rewards) Use HMMs to do probabilistic reasoning over time Use HMMs to do probabilistic reasoning over time Use MDP/RL to optimize the behavior. Use MDP/RL to optimize the behavior.