1 Reinforcement Learning (RL). 2 Introduction The concept of reinforcement learning incorporates an agent that solves the problem in hand by interacting.

Slides:

Advertisements

Similar presentations

University Paderborn 07 January 2009 RG Knowledge Based Systems Prof. Dr. Hans Kleine Büning Reinforcement Learning.

Advertisements

Reinforcement Learning

SADC Course in Statistics Conditional Probabilities and Independence (Session 03)

The Framework for Teaching Charlotte Danielson

Classroom Management Institute for Teaching & Learning By Dr. Amit Savkar 2.

In this section you will:

1 Machine Learning: Lecture 1 Overview of Machine Learning (Based on Chapter 1 of Mitchell T.., Machine Learning, 1997)

Learning Menus By Michele West.

Reinforcement Learning

The Intentional Teacher

Reaching Agreements II. 2 What utility does a deal give an agent? Given encounter  T 1,T 2  in task domain  T,{1,2},c  We define the utility of a.

CHAPTER 15: Tests of Significance: The Basics Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.

Introduction to Game Theory

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Alpha-Beta Search. 2 Two-player games The object of a search is to find a path from the starting position to a goal position In a puzzle-type problem,

NON - zero sum games.

ChooseMove=16  19 V =30 points. LegalMoves= 16  19, or SimulateMove = 11  15, or …. 16  19,

Tic-Tac-Toe Using the Graphing Calculator for Derivatives.

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

Game Theory Game theory is an attempt to model the way decisions are made in competitive situations. It has obvious applications in economics. But it.

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Tic Tac Toe Game Design Using OOP

Tic Tac Toe Game playing strategies

Presented by : Ashin Ara Bithi Roll : 09 Iffat Ara Roll : 22 12th Batch Department of Computer Science & Engineering University of Dhaka.

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Reinforcement Learning

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

Lesson 2-2 Example Solve. Tic Tac Toe Katrina and her friend played Tic Tac Toe. The outcomes of the games are shown in the line plot. Did Katrina.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.

Chapter 1: Introduction

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.

Reinforcement Learning Game playing: So far, we have told the agent the value of a given board position. How can agent learn which positions are important?

Reinforcement Learning (1)

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Learning to Play Blackjack Thomas Boyett Presentation for CAP 4630 Teacher: Dr. Eggen.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.

CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.

INTRODUCTION TO Machine Learning

IJCAI’07 Emergence of Norms through Social Learning Partha Mukherjee, Sandip Sen and Stéphane Airiau Mathematical and Computer Sciences Department University.

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

Introduction to State Space Search

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Chapter 5 Adversarial Search. 5.1 Games Why Study Game Playing? Games allow us to experiment with easier versions of real-world situations Hostile agents.

Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

Figure 5: Change in Blackjack Posterior Distributions over Time.

Reinforcement Learning

Chapter 6: Temporal Difference Learning

Chapter 5: Monte Carlo Methods

Pairings FIDE Arbiter Seminar.

Reinforcement Learning

Pairings FIDE Arbiter Seminar.

Dr. Unnikrishnan P.C. Professor, EEE

Chapter 6: Temporal Difference Learning

Chapter 1: Introduction

Minimax strategies, alpha beta pruning

2019 SAIMC Puzzle Challenge General Regulations

Unit II Game Playing.

Morteza Kheirkhah University College London

Presentation transcript:

1 Reinforcement Learning (RL)

2 Introduction The concept of reinforcement learning incorporates an agent that solves the problem in hand by interacting with its environment. Agent learns what action to take at what situation from signals that agent receives as a response from its environment to actions taken at a specific situation of the environment. Agent seeks to achieve its goal by interacting with its environment and receiving response signals to actions it takes. Interaction with its environment (i.e., experience) provides agent with the accumulated knowledge of whether or not (or to what extent) the environment likes the actions taken by the agent.

3 Reinforcement Learning Two actors –Learner, called the agent –Environment, the domain providing feedback to agent so it optimizes its actions. Agent learns from interaction with an unknown environment Agents interaction consists of responses emitted by the environment against actions it takes at every time instant.

4 Elements of RL Policy or rule set of agent behavior Response or reward function Value function Model of the environment

5 Elements of RL...(2) Policy: Agents way of behaving at a given time. A little more formally: a mapping from perceived states to actions. Analogy with psychology: stimulus-response rules. Response or reward function

6 Elements of RL...(3) Value functions are functions of states (V π (s)) or of state-action pairs (Q π (s,a)) that define a basis for agents decision about how good it is to be at a state or to select an action at a specific state, respectively, for a specific policy π.

7 Elements of RL...(4) Model of the environment is a representative for the behavior of the environment. Using a model of the environment the agent operates in, the resultant next state and next reward may be estimated. This information may be further used for planning. That is, the agent may form a structure of decisions considering possible future situations before they are actually experienced.

8 Examples to RL: Tavla The way we learn how to play tavla (backgammon) –the first times we play the game we just apply the rules of the game and observe how the game proceeds. (Repetitions or episodes of the game) –As we start to associate situations and actions in these situations with a good or bad end of the game we implicitly assign values to these situations and actions we take in these situations about whether or not or how favorable orunfavorable they are; –if we are experienced enough (i.e., after sufficiently many games) we start planning or building strategies.

9 Examples to RL: A Weekly Routine You are a soccer player and getting ready for your soccer game. –Following your last class that day, you are heading for the locker room. At the desk, you ask the receptionist for a locker key. You get the key, write your name, leave your student ID and go to the locker whose number matches with that on the key. You check to see and hopefully open the locker door. –You change your dresses and put on your soccer short and t-shirt. You hang your regular outfit and put your sport bag into the locker. You place the key in a secure pocket. –etc. –As you feel these routine activities are composed of many conditional behaviors and intermingled subgoal and goal relationships –For more examples you may check page 6 in reference [1].

10 An Extended Example: Tic-Tac-Toe Goal in tic-tac-toe is to –place your three marks in a row horizontally, vertically, or diagonally before your opponent does his/her. A draw: if no player has three in a row when the board is full.

11 Tic-Tac-Toe...(2) How to define states of RL problem for this game: –States: each situation of the board is a state. –Each state where your three marks are in a row is a goal state. –Each state where the opponents three marks are in a row is a state you cannot win from. Figure 1: Two winning situations for diamond

12 Tic-Tac-Toe...(3) How to set up a RL problem for this game?.. –Each state is assigned number to represent the value of the state. –Assume a table of numbers, each number representing the value of one state where the number is an estimate for the probability of winning from that state. –The table is the learned state-value function. –A better state is one with a higher current estimate of probability of winning from this state than with that from another state.

13 Tic-Tac-Toe...(4) Initialization of state values: –We assign 1 to our goal states since the probability of winning from our goal states is 1 (i.e., at these states we have already won.) –Similarly 0 is assigned to the opponents goal states since it is improbable for us to win from these states (i.e., we have already lost the game.) –All other states we may initially assign 0.5 to mean we may lose or win equally probably from these states.

14 Tic-Tac-Toe...(5) How to decide on which state to move to/what action to take? –Whatever state we are at we would like to approach one of our goal states and keep away from the opponents goal states at the same time. Why? –The goal state(s) are the ones with the highest values; hence from a neighbor state to a goal state we should select a move that leads to the goal state. –In the same fashion, from any state we should select a move that takes us to the state with the highest value. Why? The answer to both questions is we would like to move to or be at an accessible state with one of the highest possible values since to move to or to be at a higher value state makes it more probable for us to win the game.

15 Tic-Tac-Toe...(6) How to decide on what state to move to?... Contd –the move towards the accessible state(s) with the highest value is called a greedy move. –if we select to move, in contrast with this, to a state with an intermediate (i.e., not the highest) value, meaning to give an effort to find a state that is a candidate to have a higher value in the future than the state with the highest value which would not otherwise be revealed since its current value is not high due to the occurrence of what is less likely to happen. This type of move is called an exploratory move.

16 Tic-Tac-Toe...(7) How to update state values?... Contd As we move to the state (or one of the states) with highest possible value that is accessible, we update the value of the source state (i.e., the state we departed) such that it is now closer to the value of the destination state (i.e., the state we arrived) typically using the following learning rule: where s and s are the source and destination states, respectively, V(.) is the value function, and α is the learning rate parameter that specifies the contribution of the current update to the entirety of the state value. The above learning rule is called the temporal-difference learning method.

17 References [1] Sutton, R. S. and Barto A. G.,Reinforcement Learning: An introduction, MIT Press, 1998