Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

Slides:



Advertisements
Similar presentations
RL - Worksheet -worked exercise- Ata Kaban School of Computer Science University of Birmingham.
Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.
Markov Decision Process
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.
Mean, Proportion, CLT Bootstrap
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Modeling and Simulation Monte carlo simulation 1 Arwa Ibrahim Ahmed Princess Nora University.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
Parameter Estimation using likelihood functions Tutorial #1
Markov Decision Processes
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
INFINITE SEQUENCES AND SERIES
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Reinforcement Learning
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Evaluating Hypotheses
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Department of Computer Science Undergraduate Events More
Reinforcement Learning (1)
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
TH EDITION LIAL HORNSBY SCHNEIDER COLLEGE ALGEBRA.
Reinforcement Learning
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
The Integers. The Division Algorithms A high-school question: Compute 58/17. We can write 58 as 58 = 3 (17) + 7 This forms illustrates the answer: “3.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
1 Nonparametric Statistical Techniques Chapter 17.
M ONTE C ARLO SIMULATION Modeling and Simulation CS
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Thursday 29 October 2002 William.
Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.
Copyright © 2009 Cengage Learning 9.1 Chapter 9 Sampling Distributions ( 표본분포 )‏
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Department of Computer Science Undergraduate Events More
Stochastic Processes and Transition Probabilities D Nagesh Kumar, IISc Water Resources Planning and Management: M6L5 Stochastic Optimization.
INFINITE SEQUENCES AND SERIES The convergence tests that we have looked at so far apply only to series with positive terms.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
1 Nonparametric Statistical Techniques Chapter 18.
Managerial Economics & Decision Sciences Department tyler realty  old faithful  business analytics II Developed for © 2016 kellogg school of management.
Copyright © Cengage Learning. All rights reserved. Sequences and Series.
Copyright © 2009 Pearson Education, Inc LEARNING GOAL Interpret and carry out hypothesis tests for independence of variables with data organized.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning
Markov Decision Processes
Learning Sequence Motif Models Using Expectation Maximization (EM)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Reinforcement Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
Presentation transcript:

Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham

In a casino, two differently loaded but identically looking dice are thrown in repeated runs. The frequencies of numbers observed in 40 rounds of play are as follows: Dice 1, [Nr, Frequency]: [1,5], [2,3], [3,10], [4,1], [5,10], [6,11] Dice 2, [Nr, Frequency]: [1,10], [2,11], [3,4], [4,10], [5,3], [6,2] Characterize the two dice by the corresponding random sequence model they generated. That is, estimate the parameters of the random sequence model for both dice. ANSWER Die 1, [Nr, P_1(Nr)]: [1, 0.125], [2,0.075], [3,0.250], [4,0.025], [5,0.250], [6,0.275] Die 2, [Nr, P_2(Nr)]: [1,0.250], [2,0.275], [3,0.100], [4,0.250], [5,0.075], [6,0.050] Worked exercises on Sequence Models

(ii) Some time later, one of the dice has disappeared. You (as the casino owner) need to find out which one. The remaining one is now thrown 40 times and here are the observed counts: [1,8], [2,12], [3,6], [4,9], [5,4], [6,1]. Use a Bayes’ rule to decide the identity of the remaining die. ANSWER Since we have a random sequence model (i.i.d. data) D, the probability of D under the two models is Since there is no prior knowledge about either dice, we use a flat prior, i.e. the same 0.5 for both hypotheses. Because P_1(D) < P_2(D), and the prior is the same for both hypothesies, we conclude that the die in question is the die no. 2.

Seq Models - Exercise 1 Sequences: (s1): A B B A B A A A B A A B B B (s2): B B B B B A A A A A B B B B Models: (M1): a random sequence model with parameters P(A)=0.4, P(B)=0.6 (M2): a first order Markov model with initial probabilities 0.5 for both symbols and the following transition matrix: P(A|A)=0.6, P(B|A)=0.4, P(A|B)=0.1, P(B|B)=0.9. Which sequence s1 and s2 comes from which models M1 or M2?

Answer Intuitively: s2 contains more state repetitions, which is an evidence that indicates that the Markov structure of M2 is more likely than the random structure of M1. s1 is apparently more random, therefore it is more likely generated from M1. Formally: log P(s1|M1)=7*log(0.4)+7*log(0.6)= log P(s1|M2)=log(0.5)+3*log(0.6)+4*log(0.4)+3*log(0.1)+3*log(0.9)= The former of these two probabilities is larger, so s1 is more likely to be generated from M1. Similarly, for s2 we get: log P(s2|M1)=5*log(0.4)+9*log(0.6)= log P(s2|M2)=log(0.5)+4*log(0.6)+log(0.4)+log(0.1)+7*log(0.9)= The latter is larger, so s2 is more likely to be generated from M2.

RL. Exercise 2a). The figure below depicts a 4-state grid world, which’s state 2 represents the ‘gold’. Using the immediate reward values shown on the figure and employing the Q-learning algorithm, do anti-clockwise circuits on the four states updating the action-state table Note. Here, the Q-table will be updated after each cycle.

Solution Q  Initialise each entry of the table of Q values to zero Iterate:

First circuit: Q(3,  ) = max{Q(4,  ),Q(4,  )}= -2 Q(4,  ) = max{Q(2,  ),Q(2,  )}= 50 Q(2,  ) = max{Q(1,  ),Q(1,  )}= -10 Q(1,  ) = max{Q(3,  ),Q(3,  )}= -2 Q(3,  ) = max{Q(4,  ),50}=43 Q 

Second circuit: Q(4,  ) = max{Q(2,  ),Q(2,  )}= max{0,-10}=50 Q(2,  ) = max{Q(1,  ),Q(1,  )}= max{0,-2}=-10 Q(1,  ) = max{Q(3,  ),Q(3,  )}= max{0,43}= 36.7 Q(3,  ) = max{Q(4,  ), Q(4,  )}= max{0,50}=43 r  Q 

Third circuit: Q(4,  ) = max{Q(2,  ),Q(2,  )}= max{0,-10}=50 Q(2,  ) = max{Q(1,  ),Q(1,  )}= max{0,36.7}=23.03 Q(1,  ) = max{Q(3,  ),Q(3,  )}= max{0,43}= 36.7 Q(3,  ) = max{Q(4,  ), Q(4,  )}= max{0,50}=43 r  Q 

Fourth circuit: Q(4,  ) = max{Q(2,  ),Q(2,  )}= max{0,23.03}=70.73 Q(2,  ) = max{Q(1,  ),Q(1,  )}= max{0,36.7}=23.03 Q(1,  ) = max{Q(3,  ),Q(3,  )}= max{0,43}= 36.7 Q(3,  ) = max{Q(4,  ), Q(4,  )}= max{0,70.73}=61.66 r  Q 

Exercise 2b). In some RL problems, rewards are positive for goals and are either negative or zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using the standard discounted return R t below, that adding a constant C to all the elementary rewards adds a constant, K, to the values of all the states, and thus does not affect the relative values of any states under any policies. What is K in terms of C and  ?

Solution Add a constant C to all elementary rewards

Thus only intervals between rewards are important not absolute values

Exercise 2c). Imagine you are designing a robot to escape from a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. Since the task seems to break down naturally into episodes (successive runs through the maze), you decide to treat it as an episodic task, where the goal is to maximise the expected total reward:

R t = r t+1 + r t+2 + r t+3 + … + r T After running the learning agent for a while, you find that it is showing no signs of improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

Solution Imagine the following episode NE NE NE NE E t t+1 t+2 t+3 t+4 t+5 Rewards R t =1 No reward is being given for escaping in the minimum number of steps

Possible solution: reward with -1 for each NE state and 0 or 1 for the escaped state NE NE NE NE E t t+1 t+2 t+3 t+4 t+5 Rewards R t =-4 In general if it takes k steps to escape, the cumulative reward would be -k. We want to find a policy to maximise R t. The best policy would make R t = 0 (escape at next time step)

Optional material: Convergence proof of Q-learning Recall: Sketch of proof Consider the case of deterministic world, where each (s,a) is visited infinitely often. Define a full interval as an interval during which each (s,a) is visited.  Show, that during any such interval, the absolute value of the largest error in Q table is reduced by a factor of . Consequently, as  <1, then after infinitely many updates, the largest error converges to zero.

Solution Let be a table after n updates and e n be the maximum error in this table: What is the maximum error after the (n+1)-th update?

Obs. No assumption was made over the action sequence! Thus, Q-learning can learn the Q function (and hence the optimal policy) while training from actions chosen at random as long as the resulting training sequence visits every (state, action) infinitely often.