E XPLORING M ARKOV D ECISION P ROCESS V IOLATIONS IN R EINFORCEMENT L EARNING Jordan Fryer – University of Portland Working with Peter Heeman 1.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Markov Decision Process
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
Reinforcement Learning Tutorial
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
Reinforcement Learning
Identifying "Good" Architectural Design Alternatives with Multi-Objective Optimization Strategies By Lars Grunske Presented by Robert Dannels.
Chapter 2: Algorithm Discovery and Design
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Policies and exploration and eligibility, oh my!.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Reinforcement Learning Introduction Presented by Alp Sardağ.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Policies and exploration and eligibility, oh my!.
Chapter 2: Algorithm Discovery and Design
CPSC 7373: Artificial Intelligence Lecture 11: Reinforcement Learning Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Invitation to Computer Science, Java Version, Second Edition.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Reinforcement Learning
CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
Reinforcement Learning
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Introduction to Reinforcement Learning Freek Stulp.
MDPs (cont) & Reinforcement Learning
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Markov Decision Process (MDP)
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Reinforcement Learning (1)
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Dr. Unnikrishnan P.C. Professor, EEE
Reinforcement Learning
Designing Neural Network Architectures Using Reinforcement Learning
CS 416 Artificial Intelligence
Dr. Arslan Ornek MATHEMATICAL MODELS
Presentation transcript:

E XPLORING M ARKOV D ECISION P ROCESS V IOLATIONS IN R EINFORCEMENT L EARNING Jordan Fryer – University of Portland Working with Peter Heeman 1

O UTLINE Background: Reinforcement Learning (RL) RL and symbolic reasoning to learn system dialogue policy Background: Markov Decision Processes (MDP) The Problem Attempting to find absolute convergence Simplification process Evaluation tools Discussion 2

B ACKGROUND : R EINFORCEMENT L EARNING Inputs States How the agent represents the environment at a certain time Actions How the agent interacts with the environment Cost Function A probabilistic mapping of a state-action pair to a value Most of the costs may be assigned at terminal state Simulated User So can system can try out different dialogue behaviors Outputs: Optimal Policy A mapping of a state to an action How it learns Iteratively: evaluate current policy and explore alternatives, and then update policy 3

B ACKGROUND : R EINFORCEMENT L EARNING Keep track of Q score for each state-action pair Cost to get to the end from state following that action For each dialogue simulation, take final cost and propagate it back over the state-action pairs in the run S1S2S3S4S5 Utt: 1 SQ: 10 Total: 14 a1a2a3a4 Q: 11Q: 12Q: 13Q: 14 4

B ACKGROUND : M ARKOV D ECISION PROCESSES RL guaranteed to converge for Markov Decision Processes Only use current state to decide what action to do next System + User + Environment must satisfy: Pr {s t+1 = s’ | s t,a t, s t-1,a t-1, …, s 0,a 0 } = Pr {s t+1 = s’ | s t,a t } Pr {r t+1 = r | s t,a t, s t-1,a t-1, …, s 0,a 0 } = Pr {r t+1 = r| s t,a t } How detailed should states be? Too detailed, becomes brute force and explodes state space Too vague, violates the MDP assumptions RL learns solution very fast due to “merging” of states 5

W HY RL FOR D IALOGUE ? There is a delayed cost to dialogue. The correctness of a dialogue is not really known until the end of the dialogue and the task has been performed Modeling after humans isn’t always correct Many things a computer can do that a human can’t and many things a human can do that a computer can not Hard to handcraft a policy 6

T HE P ROBLEM : F INDING A BSOLUTE C ONVERGENCE RL is guaranteed to converge for MDP in the limit How do we know if we have an MDP violation? How long do we have to wait for convergence? How do we measure convergence? We will use QLearning with ε-greedy (20%) 7

D OMAIN Ran on toy Car Domain (from CS550 course) Database of 2000 cars (differ in color, year, model …) User has one of the 2000 cars in mind System asks questions and reports list of cars that match the user’s car State: 11 Questions: Boolean (asked or not) carBucket: Number (bucketized number of cars) Done: Boolean (reported cars or not) Cost Function: 1 cost per utterance, 5 cost per extra reported car 8

F ULL V ERSION OF P ROBLEM Every 1000 epochs (100 dialogue runs) test the current policy Could not get it to converge 9

S IMPLIFIED VERSION OF PROBLEM Let’s simplify problem (common CS approach) Removed some attributes, reduced buckets from 4 to 2, force output and exit when only one car left Reduced number of state-action pairs Were able to use exact user distribution for testing Tool to examine the degree of convergence Compare results from multiple policies (existing tool) Keep track of the minimum testing score seen while training a policy Calculated the percentage of test sessions that are at the minimum testing score 10

E # AVE MIN C % SA E # AVE MIN C % SA

E # AVE MIN C % SA

S IMPLIFIED VERSION OF PROBLEM Bugs found: Would converge and then go out of convergence Alpha rounding errors Would find a minimum score within first 10 epochs that it could never find again States not seen yet in training, but seen in testing, must choose same action during the test session Got convergence Convergence achieved when all SA pairs explored 13

A BIT MORE COMPLEXITY Make domain more complex: Added back all attributes, 2 buckets, no exit constraint Absolute convergence before all SA pairs seen E # AVE MIN C % SA E # AVE MIN C % SA … … 14

Q TRAIN AND Q TEST Qtrain: The Q values of an SA pair, used by RL in training by following the policy and exploring Should converge to Q* (values for the optimal policy), but never know what Q* is Our group has also been using Qtest The Q values of an SA pair, determined through testing the optimal policy Qtest and Qtrain should converge to Q* and so should converge to same value 15

Q TRAIN AND Q TEST Helps to show absolute convergence 16

N OW THE F ULL V ERSION Moved up to 4 buckets, no absolute convergence Noticed difference in Qtest and Qtrain Can further analyze Qtest, Different states “merge” If MDP, path should not matter 17

V ISUALIZING MDP V IOLATION CYM5 CYM15 CDYM1 CDYM5 CDYM15 Action: AskDoors

MDP V IOLATION Why is there an MDP violation? CarBuckets: caused states to be treated as equal when clearly they are not. How to remove MDP violation? Keep a more accurate history Not always possible: State space explodes Just keeping track of the order in which questions are asked leads to ~40 million states Barto & Sutton admit that most problems are not perfect MDPs but that RL can deal with it 19

D ISCUSSION Does having a MDP violation hurt you? Despite non-convergence the 4 buckets did better than the 2 buckets vs RL can deal with some MDP violation Car gas mileage analogy 20

D ISCUSSION Does this mean we don’t care about MDP violation? Rueckert (REU last year) removed an MDP violation did not increase the state space dramatically improved the policy learned One should be aware of any MDP violations Our tool can find them (or some of them) Major MDP violations need to be fixed 21

A CKNOWLEDGEMENTS AND Q UESTIONS Thanks to: My fellow interns Pat Dickerson & Kim Basney Peter Heeman Rebecca Lunsford Andrew Rueckert Ethan Selfridge Everyone at OGI Questions? 22