RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Reinforcement Learning
Linear Regression.
RL for Large State Spaces: Value Function Approximation
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
1 Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: UAI-2012 Catalina Island,
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Complex Feature Recognition: A Bayesian Approach for Learning to Recognize Objects by Paul A. Viola Presented By: Emrah Ceyhan Divin Proothi Sherwin Shaidee.
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
Lecture 5: Learning models using EM
1 Learning from Behavior Performances vs Abstract Behavior Descriptions Tolga Konik University of Michigan.
Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin The Chinese.
Radial Basis Function Networks
Behaviorist Psychology R+R- P+P- B. F. Skinner’s operant conditioning.
Semi-Supervised Learning
RL for Large State Spaces: Policy Gradient
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
by B. Zadrozny and C. Elkan
Kshitij Judah EECS, OSU Dissertation Proposal Presentation.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Dr. Itamar Arel College of Engineering Electrical Engineering & Computer Science Department The University of Tennessee Fall 2009 August 24, 2009 ECE-517:
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.
1 Prasad Tadepalli Intelligent assistive systems Infer the goals of the human users and offer timely help; applications to assistance, tutoring; Learning.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Karthik Raman, Pannaga Shivaswamy & Thorsten Joachims Cornell University 1.
Skill Acquisition via Transfer Learning and Advice Taking Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Apprenticeship Learning for Robotic Control Pieter Abbeel Stanford University Joint work with: Andrew Y. Ng, Adam Coates, J. Zico Kolter and Morgan Quigley.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Fuzzy Reinforcement Learning Agents By Ritesh Kanetkar Systems and Industrial Engineering Lab Presentation May 23, 2003.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Reinforcement learning (Chapter 21)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Logistic Regression William Cohen.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Chapter 6 Neural Network.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
AAAI Spring Symposium : 23 March Brenna D. Argall : The Robotics Institute Learning Robot Motion Control from Demonstration and Human Advice Brenna.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Action Editor Storyboard
Adversarial Learning for Neural Dialogue Generation
Reinforcement learning (Chapter 21)
10701 / Machine Learning.
Daniel Brown and Scott Niekum The University of Texas at Austin
Announcements Homework 3 due today (grace period through Friday)
Reinforcement Learning with Partially Known World Dynamics
Learning a Policy for Opportunistic Active Learning
Chapter 9: Planning and Learning
Morteza Kheirkhah University College London
Presentation transcript:

RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher behavior advice Environment state action reward RESEARCH QUESTION: Can we make RL perform better with some outside help, such as critique/advice from teacher and how? DESIDERATA:  Non-technical users as teachers  Natural interaction methods  High level rules as advice for RL  In the form of programming-language constructs (Maclin and Shavlik 1996), rules about action and utility preferences (Maclin et al. 2005)  Logical rules derived from a constrained natural language (Kuhlmann et al. 2004)  Learning by Demonstration (LBD)  User provides full demonstrations of a task that the agent can learn from (Billard et al. 2008).  Recent works (Coates, Abbeel, and Ng 2008) include model learning to improve on demonstrations but does not allow users to provide feedback.  Argall, Browning, and Veloso 2007; 2008 combines LBD and human critiques on behavior (similar to our work here), but there is no autonomous practice.  Real time feedback from User  TAMER framework (Knox and Stone, 2009) uses a type of supervised learning to predict the user’s reward signal, and then select actions to maximize predicted reward.  Thomaz and Breazeal (2008) rather combine the end-user reward signal with the environmental reward, and use Q- Learning. Get Critique Get Experience Simulator Learn Act C T  Critique data:  Trajectory data: Features:  Allows feedback and guidance advice  Allows practice  Novel approach to learn from critique advice and practice Advice Interface Simulator  How to pick the value of λ?  What are the forms of U and L? Optimization using Gradient Ascent Choose Simulator Estimate Utility Problem: Given data sets T and C, how can we update the agent’s policy so as to maximize its reward? Set of all optimal actions Any action not in O(s) is suboptimal All actions are equally good  Learning Goal: find a probabilistic policy that has a high probability of returning an action in O(s) when applied to s. It is not important which action is selected as long as the probability of selecting an action in O(s) is high.  We call this problem Any Label Learning (ALL)  ALL likelihood:  The Multi-Label Learning problem (Tsoumakas and Katakis 2007) differs in that the goal is to learn a classifier that outputs all of the labels in sets and no others. Reality: there does not exist an ideal teacher!!! Ideal Teacher  Key idea: define a user model that induces a distribution over ALL problems.  User model: distributionover sets given critique data, assume: independence among different states.  We introduce two noise parameters and, and one bias parameter (probability that an unlabeled action is in O(s) ).  Expected ALL likelihood:  Closed form of likelihood:  Our Domain: RTS tactical micro-management  5 friendly footmen versus 5 enemy footmen (Wargus AI).  Difficulty: Fast pace and multiple units acting in parallel  Our setup: Provide end-users with an interface that allows to watch a battle and pause at any moment. The user can then scroll back and forth within the episode and mark any possible action of any agent as good or bad.  Available actions for each military unit are to attack any of the units on the map (enemy or friendly) giving a total of 9 actions per unit.  Two battle maps, which differed only in the initial placement of the units.  Evaluated 3 Learning Systems:  Pure RL: Only practice  Pure Supervised: Only critique  Combined System: Critique + Practice  Goal: Test learning capability with varying amount of critique and practice data  Total of 2 users per map.  For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data.  We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. Map 1 Map 2 Advice Interface  The user study involved 10 end-users  6 with CS backgrounds and 4 without a CS background.  For each user, the study consisted of teaching both the pure supervised and the combined systems, each on a different map, for a fixed amount of time. (Supervised: 30 mins, Combined: 60 mins)  These results show that the users were able to significantly outperform pure RL using both the supervised and combined systems.  The end-users had slightly greater success with the pure supervised system versus the combined system:  Large delay experienced while waiting for the practice stages to end  Policy returned by practice was sometimes poor, ignored advice  Lesson Learned:  Such behavior is detrimental to the user experience and overall performance.  Future Work:  Better behaving combined system  Studies where users are not captive during practice stages Frustrating Fraction of positive, negative and mixed advice. Supervise d Combined  Positive (or negative) advice is where the user only gives feedback on the action taken by the agent.  Mixed is where the user not only gives feedback on the agent's action but also suggests alternative actions to the agent.  We use likelihood weighting to estimate the utility U( ,T) of policy using off-policy trajectories T (Peshkin and Shelton 2002). Let be the probability of generating trajectory and let be the parameters of the policy that generated. An unbiased utility estimate is given by: where  The gradient of has a compact closed form.