Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Markov Decision Process

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Decision Theoretic Planning

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Reinforcement Learning

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Distributed Q Learning Lars Blackmore and Steve Block.

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

Machine Learning Chapter 13. Reinforcement Learning

Reinforcement Learning

Introduction Many decision making problems in real life

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

INTRODUCTION TO Machine Learning

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Distributed Q Learning Lars Blackmore and Steve Block.

Reinforcement Learning

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Markov Decision Process (MDP)

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Making complex decisions

Reinforcement Learning

Markov Decision Processes

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

CS 188: Artificial Intelligence Fall 2007

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning Dealing with Partial Observability

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Literatur Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context ➔ Introduction The Learning Task Q Learning Nondeterministic Rewards and Actions Summary

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Introduction Situation: A robot, or an agent has a set of sensors to observe the state of its environment and a set of actions it can perform to alter this state The agent knows only the current state and the possibilities (actions) from which it can choose Learning strategy: Reward func: Assigns a numerical value to each distinct action the agent may take from each distinct state. The task of the robot is to perform sequences of actions, observe their consequences, and learn a control policy by choosing from any initial state the action that maximise the reward accumulated over time) Problem areas: Learning to control a mobile robot, learning to optimise operations in a factory or learning to play board games Teaching a robot to dock its battery onto charger whenever the battery level goes low

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction ➔ The Learning Task Classification of the Problem The Markov Decision Process (MDF) Goal of the Learning Example Q Learning Nondeterministic Rewards and Actions Summary

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Classification of the Problem Agent interacts with its environment. The agent exists in the environment described by some set of possible states S. It can perform any set of possible actions A. Each time it performs an action in state the agent receives a reward that indicates the immediate value of this state- action transaction. This produces a sequence of states actions and immediate rewards The agent's task is to learn a control policy that maximizes the expected sum of timmediate rewards and the future rewards exponentially discounted by their delay

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Classification of the Problem 2 Consider specific settings: The actions have deterministic or nondeterministic outcomes The agent has or does not have prior knowledge about the effects of its action on the enviroment (r reward) (s:state) Is the agent trained by an expert? Difference to other function approximation tasks Delayed rewards: The trainer provides only a sequence of immediate reward values as the agent executes its sequence of action => temporal credit assignment determins which of the actions in its sequence are to be credited by producing the eventual rewards Exploration: Which experimentation strategy produces most effective learning? Choosing the exploration of unknown states and actions or the exploration of states and actions that are already learned will yield high reward? Partially observed state : In practical situation sensors provide only partial information Life-long learning: Robot learns several task within the same enviroment using the same sensor => using previously obtained experience

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning The Markov Decision Process (MDF) The process is deterministic The agent can perceive a set distinct states (S) in it environment and has set of action (A) allowed to perform At each discrete time step t, the agent senses the current state, chooses a current action and performs it. The environment responds by giving the agent a reward and by producing the succeeding state. In MDP and depend only on the current state and action and the earlier ones and are not part of the environment so the agent does not know them. S and A are finite.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Goal of the Learning Goal: Learn a policy for selecting its next action based on the current observed state : Approach: require the policy that produces the greatest possible cumulative reward: Precisely the agent's learning task is to learn a policy that maximizes for all state s. Call optimal policy, denoted is the following: Simplify the notation: maximum discounted cumulative reward that the agent can obtain starting from s

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Example r(s,a) values Values One optimal policy Q(s,a) values

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction The Learning Task ➔ Q Learning The Q Function An Algorithm for Learning Q An Illustrative Example Convergence Experimental Strategies Updating Sequence Nondeterministic Rewards and Actions Summary

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning The Q Function Problem: There is no training data in the form instead the only training information available to the learner is the sequence of immediate rewards The agent can learn a numerical evaluation function like : The agent prefers state to state whenever How can the agent choose among actions Solution: The optimal action in a state s is the action a that maximizes the sum of immediate rewards r(s,a) and the value of the immediate successor state, discounted by

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning The Q Function Problem: A perfect knowledge of the immediate reward function r and the state transition function would be necessary Solution: Q is the sum of the reward received immediately upon executing an action a from state s, plus the value gained by following the optimal policy thereafter Advantage: of using Q instead, it will able to select optimal actions even when it has no knowledge of the function r and

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Algorithm for Learning Q How can Q be learned? Through transformation we get a recursive definition of Q (Watkins 1989) : refer to the learner's estimate, or hypothesis, of the actual Q function It is represented by a large table with separate entry for each state-action pair The table can be initially filled with random values The agent works as before + updates the table entry Q learning propagates the estimates backward from the new state to the oldone Episode: During each episode, the agent begins at some randomly chosen state and is allowed to execute actions until it reaches the absorbing goal state When it does, the episode ends and the agent is transported to a new, randomly chosen, initial state of the next episode.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Algorithm for Learning Q (2) Algorithm: For each (s, a) pair initialise the table entry to zero Observe the current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s' Update the table entry for

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Algorithm for Learning Q (3) Two general properties of this Q learning algorithm that hold for any deterministic MDP in which the rewards are non-negative assuming the values are initialized with zero: values never decrease during training: Every will remain in the interval between zero and its true Q

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning An Illustrative Example The diagram on the left shows the initial state of the robot and several relevant values in its initial hypotheses. ( ) When the robot executes the action it receives immediate reward r=0 and transition state It then updates its estimate based on its estimates for this state Here

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Convergence Will the algorithm converge to Q ? YES but with some conditions: Deterministic MDP Intermediate reward values are bounded: Agent select actions in such a fashion that it visits every possible state-action pair infinitely often Theorem: Consider a Q learning agent in a deterministic MDP with bounded rewards The Q learning agent uses the training rule initializes its table to arbitrary finite values and uses a discount factor such that Let the agent's hypothesis following the nth update. If each state-action pair is visited infinitely often then converges to as

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Experimentation Strategies Question: How actions are choosen during the training? The agent selects in the state s the action a that maximizes Disadvantage: Agents will prefer actions that are found to have high values during the early training and will fail to explore other actions that might have even higher values. Using probabilistic approach to selecting actions The probability of selecting action given that the agent is in state s k>0 is a constant that determines how strongly the selection favours actions with high values

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Updating Sequence Possibilities to improve the convergence: If all values initialized 0 => after the first full episode only one entry in the agent's table will have changed (the entire corresponding the final transition into the goal state. (backward change) Training on this same state-action transition but in reverse chronological order for each episode: apply the same update rules for each transition, but perform this updates in reverse order. => convergence in few iterations but higher memory usage Second strategy stores past state-actions

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction The Learning Task Q Learning ➔ Nondeterministic Rewards and Actions Summary

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Nondeterministic Rewards and Actions In such cases and r(s,a) can be viewed as first producing a probability distribution over outcomes based on s and a, and then drawing an outcome at random according to this distribution The changing in the Q algorithm, first to the expected value of the discounted cumulative reward Generalization of Q: Recursive Q function: Modify the training rule so that it takes a decaying weighted average of current values and the revised estimate Where is the total number of visits of this state-action pair up to and including the the nth iteration

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Context Introduction The Learning Task Q Learning Nondeterministic Rewards and Actions ➔ Summary

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Summary Reinforcement learning addresses the problem of learning control strategies for autonomous agents Training information is available in the form of a real-valued reward signal given for each state-action transition. The goal of the agent is to learn an action policy that maximizes the total reward received from any starting state. The Q-learning relates to a limited field of problems, named Markov decision processes. The Q-function is defined as the maximum of the expected, discounted, cumulative rewards may be achieved by the agent via applying action a in the state s. is represented by a lookup table with a distinct entry for each pair It can show the convergence in both deterministic and nondeterministic MDP's.