COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Eick: Reinforcement Learning. Topic 18: Reinforcement Learning 1. Introduction 2. Bellman Update 3. Temporal Difference Learning 4. Discussion of Project1.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Eick: Q-Learning for the PD-World COSC 6342 Project 1 Spring 2014 Q-Learning for a Pickup Dropoff World P P PD D D.
Planning under Uncertainty
Reinforcement Learning
Reinforcement learning
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.
Reinforcement Learning (1)
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Optimizing Pheromone Modification for Dynamic Ant Algorithms Ryan Ward TJHSST Computer Systems Lab 2006/2007 Testing To test the relative effectiveness.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Ch. Eick: Introduction to Search Classification of Search Problems Search Uninformed Search Heuristic Search State Space SearchConstraint Satisfaction.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning RS Sutton and AG Barto Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10
Reinforcement Learning (1)
Reinforcement Learning in POMDPs Without Resets
Reinforcement learning (Chapter 21)
Teaching Style COSC 6368 Teaching Style COSC 6368
Markov Decision Processes
Reinforcement Learning
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
PD-World Pickup: Cells: (1,1), (4,1),(3,3),(5,5)
Randomized Hill Climbing
Chapter 3: The Reinforcement Learning Problem
B- Trees D. Frey with apologies to Tom Anastasio
Randomized Hill Climbing
Instructors: Fei Fang (This Lecture) and Dave Touretzky
B- Trees D. Frey with apologies to Tom Anastasio
CS 188: Artificial Intelligence Fall 2007
Chapter 2: Evaluative Feedback
Framework: Agent in State Space
Chapter 3: The Reinforcement Learning Problem
B- Trees D. Frey with apologies to Tom Anastasio
Reinforcement Learning
Example: Simplified PD World
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning
Warm-up as You Walk In Given Set actions (persistent/static)
TOPIC: - GRID REFERENCE
Introduction to Imitation Learning
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
Morteza Kheirkhah University College London
Presentation transcript:

COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P

PD-World Pickup: Cells: (1,1), (3,3),(5,5) Terminal State: Drop off cells contain 5 blocks each Initial State: Agent is in cell (1,5) and pickup cells contain 5 blocks PD-World (1,1) (1,2) (1,3) (1,4) (1,5) Goal: Transport from pickup cells to dropoff cells! (2,1) (2,2) (2,3) (2,4) (2,5) (3,1) (3,2) (3,3) (3,4) (3,5) (4,1) (4,2) (4,3) (4,4) (4,5) (5,1) (5,2) (5,3) (5,4) (5,5) Pickup: Cells: (1,1), (3,3),(5,5) Dropoff Cells: (5,1), (5,3), (2,5)

PD-World Operators‒there are six of them: Spring 2019 PD-World P D P D D P Operators‒there are six of them: North, South, East, West are applicable in each state, and move the agent to the cell in that direction except leaving the grid is not allowed. Pickup is only applicable if the agent is in an pickup cell that contain at least one block and if the agent does not already carry a block. Dropoff is only applicable if the agent is in a dropoff cell that contains less that 5 blocks and if the agent carries a block. Initial state of the PD-World: Each pickup cell contains 5 blocks and dropoff cells contain 0 blocks; the agent always starts in position (1,5)

Rewards in the PD-World Picking up a block from a pickup state: +13 Dropping off a block in a dropoff state: +13 Applying north, south, east, west: -1.

2019 Policies PRandom: If pickup and dropoff is applicable, choose this operator; otherwise, choose an operator randomly. PExploit: If pickup and dropoff is applicable, choose this operator; otherwise, apply the applicable operator with the highest q-value (break ties by rolling a dice for operators with the same utility) with probability 0.80 and choose a different applicable operator randomly with probability 0.20. PGreedy: If pickup and dropoff is applicable, choose this the same utility).

Performance Measures Bank account of the agent Number of operators applied to reach a terminal state from the initial state—this can happen multiple times in a single experiment!

P D State Space PD-World P D D P The actual state space of the PD World is as follows: (i, j, x, a, b, c, d, e, f) with (i,j) is the position of the agent x is 1 if the agent carries a block and 0 if not (a,b,c,d,e,f) are the number of blocks in cells (1,1), (3,3),(5,5), (5,1), (5,3), and (4,5), respectively Initial State: (1,1,0,5,5,5,0,0,0) Terminal State: (*,*,0,0,0,0,5,5,5) Remark: The actual reinforcement learning approach likely will use a simplified state space that aggregates multiple states of the actual state space into a single state in the reinforcement learning state space.

Mapping State Spaces to RL State Space Most worlds have enormously large state spaces or even non-finite state spaces. Moreover, how quickly Q/TD learning learns is inversely proportional to the size of the state space. Consequently, smaller state spaces are used as RL-state spaces, and the original state space are rarely used as RL-state space. World State Space Reduction RL-State Space

Recommended Reinforcement Learning State Space In this approach reinforcement learning states have the form (i,j,x) where: (i,j) is the position of the agent x is 1 if the agent carries a block; otherwise, 0. That is the state space has only 50 states. Discussion: The algorithm initially learns paths between pickup states and dropoff states—different paths for x=1 or for x=0 Minor complication: The q-values of those paths will decrease is soon as the particular pickup state runs out of blocks or the particular dropoff state cannot store any further blocks, as it is no longer attractive to visit these locations. Suggestion: Use this Reinforcement Learning State Space for this project and no other space!

Alternative Reinforcement Learning Search Space1 Reinforcement learning states have the form (i,j,x,s,t,u) where (i,j) is the position of the agent x is 1 if the agent carries a block; otherwise, 0. g, h, i are boolean variables whose meaning depend on, if the agent carries a block or not. Case 1: x=0 (agent does not carry a block) s is 1, if cell (1,1) contains at least one block t is 1, if cell (3,3) contains at least one block u is 1, if cell (5,5) contains at least one block Case 2: x=1 (agent does carry a block) s is 1, if cell (5,1) contains less than 5 blocks t is 1, if cell (5,3) contains less than 5 blocks u is 1, if cell (4,5) contains less than 5 blocks There are 400 states total in the reinforcement learning state space1

Analysis of Attractive Paths See also: http://horstmann.com/gridworld/gridworld-manual.html http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html

TD-Q-Learning for the PD-World Remark: This is the QL approach you must use!!! TD-Q-Learning for the PD-World Goal: Measure the utility of using action a in state s, denoted by Q(a,s); the following update formula is used every time an agent reaches state s’ from s using actions a: Q(a,s)  (1-)*Q(a,s) + *[R(s’,a,s)+ γ*maxa’Q(a’,s’)]  is the learning rate; g is the discount factor a’ has to be an applicable operator in s’; e.g. pickup and drop-off are not applicable in a pickup/dropoff states if empty/full! R(s’,a ,s) is the reward of reaching s’ from s by applying a; e.g. -1 if moving, +13 if picking up or dropping blocks for the PD-World.

SARSA SARSA vs. Q-Learning Approach: SARSA selects, using the policy , the action a’ to be applied to s’ and then updates Q-values as follows: Q(a,s)  Q(a,s) + α [ R(s) + γ*Q(a’,s’) - Q(a,s) ] SARSA vs. Q-Learning SARSA uses the actually taken action for the update and is therefore more realistic as it uses the employed policy; however, it has problems with convergence. Q-Learning is an off-policy learning algorithm and geared towards the optimal behavior although this might not be realistic to accomplish in practice, as in most applications policies are needed that allow for some exploration.

4368 Project in a Nutshell RL-System RL-System Performance Policy RL- Space Learning Rate  Q-Learning/SARSA Discount Rate  Utility Update ??? What design leads to the best performance? RL-System Performance

Suggested Implementation Steps Write a function aplop: (i,j,x,a,b,c,d,e,f)2{n,s,e,w,p,d} that returns the set of applicable operators in (i,j,x,a,b,c,d,e,f) Write a function apply: (i,j,x,a,b,c,d,e,f){n,s,e,w,p,d} (i’,j’,x’,a’,b’,c’,d’,e’,f’) Implement the q-table data structure Implement the SARSA/Q-Learning q-table update Implement the 3 policies Write functions that enable an agent to act according to a policy for n steps which also computes the performance variables Develop visualization functions for Q-Tables Develop a visualization functions for the evolution of the PD-World Develop visualization functions for attractive paths Develop functions to run experiments 1-5.

S’ A SARSA Pseudo-Code S