Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

Slides:

Advertisements

Similar presentations

RL - Worksheet -worked exercise- Ata Kaban School of Computer Science University of Birmingham.

Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Markov Decision Process

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.

Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

RL for Large State Spaces: Value Function Approximation

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Markov Decision Processes

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Reinforcement Learning

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning (1)

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Introduction Many decision making problems in real life

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Reinforcement Learning

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.

Reinforcement Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

MDPs (cont) & Reinforcement Learning

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Department of Computer Science Undergraduate Events More

CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference.

Markov Decision Process (MDP)

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement Learning

Markov Decision Processes

Policy Gradient in Continuous Time

Instructors: Fei Fang (This Lecture) and Dave Touretzky

RL for Large State Spaces: Value Function Approximation

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

September 22, 2011 Dr. Itamar Arel College of Engineering

Chapter 17 – Making Complex Decisions

Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning Dealing with Partial Observability

Reinforcement Nisheeth 18th January 2019.

Reinforcement Learning (2)

Reinforcement Learning (2)

CHAPTER 11 REINFORCEMENT LEARNING VIA TEMPORAL DIFFERENCES

Presentation transcript:

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham

Exercise The diagram below depicts an MDP model of a fierce battle.

You can move between two locations, L1 and L2, one of them being closer to the adversary. If you attack from the closest state, –then you have more chances (90%) to succeed (while only 70% from the farther location), –however you could also be detected (with 80% chance) and killed (while the chances of being detected from the farther location is 50%). You can only be detected if you stay in the same location. You need to come up with an action plan for the situation.

The arrows represent the possible actions: – ‘move’ (M) is a deterministic action –‘attack’ (A) and ‘stay’ (S) are stochastic. For the stochastic actions, the probabilities of transitioning to the next state are indicated on the arrow. All rewards are 0, except in the terminal states, where your success is represented by a reward of +50 while your adversary’s success is a reward of -50 for you. Employing a discount factor of 0.9, compute an optimal policy (action plan).

Solution The computations of action-values for all states and actions are required. Denote by In value iteration, we start with initial estimates (for all other states) Then we update all action values according to the update rule: where

Here the Q table will be updated after each iteration only (having explored all state-action pairs) In the first iteration of the algorithm we get: The values for the ‘move’ action stay the same (at 0): After this iteration, the values of the two states are and they correspond to the action of ‘attacking’ in both states.

The next iteration gives the following: The new V-values are (by computing max): These correspond to the ‘attack’ action in both states.

This process can continue until the values do not change much between successive iterations. From what we can see at this point, the best action plan seems to be attacking all the time. Note: –Designing the parameter setting for a situation according to the conditions is up to the human and not up to the machine… –In this exercise all parameters were given but in your potential future real applications, in order to use RL successfully and make a robot learn to do what you wanted it to, you need to come up with the reward values appropriately.