Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Slides:



Advertisements
Similar presentations
Affective Facial Expressions Facilitate Robot Learning Joost Broekens Pascal Haazebroek LIACS, Leiden University, The Netherlands.
Advertisements

Reinforcement Learning
RL for Large State Spaces: Value Function Approximation
11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement learning
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.
Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
May 18,2004A Reinforcement Learning Method Based on Adaptive Simlated Annealing1 A Reinforcement Learning Method Based on Adaptive Simulated Annealing.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Ai in game programming it university of copenhagen Reinforcement Learning [Intro] Marco Loog.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.
Behaviorist Psychology R+R- P+P- B. F. Skinner’s operant conditioning.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Drones Collecting Cell Phone Data in LA AdNear had already been using methods.
History-Dependent Graphical Multiagent Models Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University of Michigan, USA.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning Dr. Itamar Arel.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Verve: A General Purpose Open Source Reinforcement Learning Toolkit Tyler Streeter, James Oliver, & Adrian Sannier ASME IDETC & CIE, September 13, 2006.
Reinforcement Learning
Curiosity-Driven Exploration with Planning Trajectories Tyler Streeter PhD Student, Human Computer Interaction Iowa State University
Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.
POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.
Neural Networks Chapter 7
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.
Design and Implementation of General Purpose Reinforcement Learning Agents Tyler Streeter November 17, 2005.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Reinforcement learning (Chapter 21)
Reinforcement Learning
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
A Comparison of Learning Algorithms on the ALE
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Transferring Instances for Model-Based Reinforcement Learning
"Playing Atari with deep reinforcement learning."
RL for Large State Spaces: Value Function Approximation
Chapter 2: Evaluative Feedback
Reinforcement Learning
Reinforcement Learning
Chapter 8: Generalization and Function Approximation
Deep Reinforcement Learning
Deep Reinforcement Learning: Learning how to act using a deep neural network Psych 209, Winter 2019 February 12, 2019.
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Presentation transcript:

Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer Sciences

Transferring human knowledge through natural forms of communication Potential benefits over purely autonomous learners: Decrease sample complexity Learn in the absence of a reward function Allow lay users to teach agents the policies that they prefer (no programming!) Learn in more complex domains

Shaping Def. - creating a desired behavior by reinforcing successive approximations of the behavior LOOK magazine, 1952

The Shaping Scenario (in this context) A human trainer observes an agent and manually delivers reinforcement (a scalar value), signaling approval or disapproval. E.g., training a dog with treats as in the previous picture

The Shaping Problem (for computational agents) Within a sequential decision making task, how can an agent harness state descriptions and occasional scalar human reinforcement signals to learn a good task policy?

Previous work on human-shapable agents Clicker training for entertainment agents (Blumberg et al., 2002; Kaplan et al., 2002) Sophie’s World (Thomaz & Breazeal, 2006) –RL with reward = environmental (MDP) reward + human reinforcement Social software agent Cobot in LambdaMoo (Isbell et al., 2006) –RL with reward = human reinforcement

MDP reward vs. Human reinforcement MDP reward (within reinforcement learning): –Key problem: credit assignment from sparse rewards Reinforcement from a human trainer: –Trainer has long-term impact in mind –Reinforcement is within a small temporal window of the targeted behavior –Credit assignment problem is largely removed

Teaching an Agent Manually via Evaluative Reinforcement (TAMER) TAMER approach: –Learn a model of human reinforcement –Directly exploit the model to determine policy If greedy:

Teaching an Agent Manually via Evaluative Reinforcement (TAMER) Learning from targeted human reinforcement is a supervised learning problem, not a reinforcement learning problem.

Teaching an Agent Manually via Evaluative Reinforcement (TAMER)

The Shaped Agent’s Perspective Each time step, agent: –receives state description –might receive a scalar human reinforcement signal –chooses an action –does not receive an environmental reward signal (if learning purely from shaping)

Tetris Drop blocks to make solid horizontal lines, which then disappear |state space| > Challenging but slow 21 features extracted from (s, a) TAMER model: –Linear model over features –Gradient descent updates Greedy action selection

TAMER in action: Tetris Training: Before training: After training:

TAMER Results: Tetris (9 subjects)

TAMER Results: Mountain Car (19 subjects)

Conjectures on how to create an agent that can be interactively shaped by a human trainer 1.For many tasks, greedily exploiting the human trainer’s reinforcement function yields a good policy. 2.Modeling a human trainer’s reinforcement is a supervised learning problem (not RL). 3.Exploration can be driven by negative reinforcement alone. 4.Credit assignment to a dense state-action history should … 5.A human trainer’s reinforcement function is not static. 6.Human reinforcement is a function of states and actions. 7.In an MDP, human reinforcement should be treated differently from environmental reward. 8.Human trainers reinforce predicted action as well as recent action.

the end.

Mountain Car Drive back and forth, gaining enough momentum to get to the goal on top of the hill Continuous state space –Velocity and position Simple but rapid actions Feature extraction: –2D Gaussian RBFs over velocity and position of car –One “grid” of RBFs per action TAMER model: –Linear model over RBF features –Gradient descent updates

TAMER in action: Mountain Car Before training: After training: Training:

TAMER Results: Mountain Car (19 subjects)

HOW TO: Convert a basic TD- Learning agent into a TAMER agent (w/o temporal credit assignment) 1.the underlying fcn approximator must be a Q-function (for state-action values) 2.set discount factor (gamma) to 0 3.make action selection fully greedy 4.human reinf. replaces environmental reward 5.if no human input is received, no update 6.remove any eligibility traces (can just change parameter lambda to 0) 7.maybe lower alpha to.01 or less

HOW TO: Convert a TD-Learning agent into a TAMER agent (cont.) With credit assignment (more frequent time steps) 1.Save (features, human reinf.) for each time step in a window from 0.2 seconds before to about 0.8 seconds 2.define a probability distribution fcn over the window (a uniform distribution is probably fine) 3.credit for each state-action pair is the integral of the pdf from the time of the next most recent timestep to the timestep for that pair - for the update, both reward prediction (in place of state-action-value prediction) used to calculate the error and the calculation of the gradient for any one weight use the the weighted sum, for each action, of the features in the window (the weights are the "credit" calculated in the last step) - time measurements used for credit assignment should be in real time, not simulation time