Assignment 1 Solutions. Problem 1 States : Actions: Single MDP controlling both detectives D1 (0) (1) C (2) D2 (3) (4)(5) (6)(7)(8)

Slides:



Advertisements
Similar presentations
What is motion? An object is in motion if it changes position over time compared to a reference point.
Advertisements

Average Earnings by Highest Qualification and Region 2006.
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Partially Observable Markov Decision Process (POMDP)
Maps and Globes.
Markov Decision Processes
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Reinforcement Learning Introduction Presented by Alp Sardağ.
CMPUT 551 Analyzing abstraction and approximation within MDP/POMDP environment Magdalena Jankowska (M.Sc. - Algorithms) Ilya Levner (M.Sc - AI/ML)
Markov Decision Processes
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Selected Problems from Chapter o.
EXAMPLE 4 Write a circular model Cell Phones A cellular phone tower services a 10 mile radius. You get a flat tire 4 miles east and 9 miles north of the.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
MAKING COMPLEX DEClSlONS
4.1 Distance and Midpoint Formulas
Chapter 12 Section 1 Measuring Motion Bellringer
 Explains what each symbol on the map represents.
Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
Physical Features North Africa CULTURECULTURE ECONOMYECONOMY CLIMATE & VEGETATION RESOURCES RELIGIONGOVERNMENT RESOURCES CLIMATE & VEGETATION.
Building Plans.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
6.8Compare Statistics from Samples Example 1 Compare statistics from different samples Production The data set below give the number of computers assembled.
Section 6.8 Compare Statistics from Samples. Vocabulary Quartile: The median of an ordered data set Upper Quartile: The median of the upper half of an.
© 2015 McGraw-Hill Education. All rights reserved. Chapter 19 Markov Decision Processes.
Which Point is Closest to the Origin?? Day 91 Learning Target : Students can determine which of 2 points is closer to the origin.
 Explains what each symbol on the map represents.
Integers & Absolute Value
Vectors: Displacement and Velocity. Vectors Examples of vectors: displacement velocity acceleration.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
S8P3. Students will investigate relationship between force, mass and the motion of objects.
Find that place..  Find that content  Longitude, East is to the right, West is to the left of the Prime Meridian or 0 degrees  Latitude, North is going.
Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.
Department of Computer Science Undergraduate Events More
EO Detective Observing the Earth from Space. Where would you photograph? Collecting evidence for an investigation.
Measurement and Geometry 43 North South East West South-East South-West North-West North-East
Chapter 1: Matter in Motion  Motion= a change in position over time  Reference point= an object that stays in place and shows us that something is moving.
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Extras Qs Ex 1: Two cars are travelling on intersecting roads. The first car is travelling northeast at 35 km/h. The second car begins 7 km north of the.
Physical Features North Africa GOVERNMENT RELIGION RESOURCES
Making complex decisions
Introduction to Computational Thinking
Legend/Key Explains what each symbol on the map represents.
Thatch House (Plan) Fig.4.
Grade 2 Social Studies Online
Parts of Maps.
Grade 2 Social Studies Online
Reinforcement Learning
Mapping It Out Grade 2 Concept 3.
Time Interval(Δt) Time (t) is a concept that describes when an event occurs. Initial time (ti) is when the event began. Final time (tf) is when the event.
Reinforcement Learning
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
East West Move to Table 2 N/S
MAX-DOAS#1, North Direction
Map Directions Draw an arrow to show which direction is North.
FIRST find the latitude line
Student Center A South: Inaccessible Element
START DIRECTIONS QUIZ.
Site and Context. Section from East Section E2 Section E1 Sannidhanam Malikapuram Sannidhanam Malikapuram.
Question 1. Question 1 What is 10 x 40? A 40 B 400 C 4000 D 4.0.
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Assignment 1 Solutions

Problem 1 States : Actions: Single MDP controlling both detectives D1 (0) (1) C (2) D2 (3) (4)(5) (6)(7)(8)

Problem 1 contd. Transitions: Explained by example. –For the action, the state transitions will be:  = 0.9 –0.8 for stay where you are action –0.05 for north –0.05 for east  = 0.05 –0.05 for south  = 0.05 –0.05 for west

Problem 1 contd. Goal state: states where atleast one detective has the same position as criminal –Ex: (1,2,1), (5,1,1) etc. Reward function will vary from person to person. But a reward function can be: –R(Goal state) = 100 –R(Goal state, *) = 0 –R(!(Goal state), *) = -2 –Example: R ([1,2,1]) = 100 ;R ([1,2,1], *) = 0; R([1,2,3], *) = -2

Problem 2 Implement value iteration Provide policies given only the start state. –For example, for (a) Start state is. Best action needs to be provided for (T=1). With the above reward function, the action is. At T = 2, Best action for –  ;  ;  Goal state (any action is fine) At T = 3, Best action for –  Goal state;  ;  D12 (0) (1) C (2) (3)(4)(5) (6)(7)(8)

Problem 3 Calculate all paths (for criminal) of size 5. Find the average number of moves used by the detectives to catch the thief in the paths enumerated above. In the above MDP, the average was 2.4.

Problem 4 It is not possible to define the reward (to accommodate the rule on T=4) given the above state space. State space needs to be modified to include time. Without the additional state feature for time, the problem does not have the markov property.