Behaviorist Psychology R+R- P+P- B. F. Skinner’s operant conditioning.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Markov Decision Process
RL for Large State Spaces: Value Function Approximation
Biointelligence Laboratory, Seoul National University
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
Decision Theoretic Planning
Reinforcement Learning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Reinforcement Learning
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.
Reinforcement Learning (1)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich PROBLEM: RL takes a long time to learn a good policy. Teacher.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
nJKQ#t=206 nJKQ#t=206.
MAKING COMPLEX DEClSlONS
History-Dependent Graphical Multiagent Models Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University of Michigan, USA.
EM and expected complete log-likelihood Mixture of Experts
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Detection, Classification and Tracking in a Distributed Wireless Sensor Network Presenter: Hui Cao.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Neural Networks Chapter 7
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7. Learning through Imitation and Exploration: Towards Humanoid Robots that Learn from Humans in Creating Brain-like Intelligence. Course: Robots.
Kshitij Judah, Saikat Roy Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAI-2010 Atlanta,
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Reinforcement learning (Chapter 21)
Behavioral Learning Theory : Pavlov, Thorndike & Skinner M. Borland E.P. 500 Dr. Mayton Summer 2007.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Discovering Optimal Training Policies: A New Experimental Paradigm Robert V. Lindsey, Michael C. Mozer Institute of Cognitive Science Department of Computer.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.
Lecture 1.31 Criteria for optimal reception of radio signals.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement Learning in POMDPs Without Resets
Reinforcement learning (Chapter 21)
Markov Decision Processes
Thrust IC: Action Selection in Joint-Human-Robot Teams
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chapter 9: Planning and Learning
Parametric Methods Berlin Chen, 2005 References:
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Unsupervised Perceptual Rewards For Imitation Learning
Presentation transcript:

Behaviorist Psychology R+R- P+P- B. F. Skinner’s operant conditioning

Behaviorist Psychology R+R- P+P- B. F. Skinner’s operant conditioning / behaviorist shaping Reward schedule – Frequency – Randomized

In topology, two continuous functions from one topological space to another are called homotopic if one can be “continuously deformed” into the other, such a deformation being called a homotopy between the two functions. What does shaping mean for computational reinforcement learning? T. Erez and W. D. Smart, 2008

According to Erez and Smart, there are (at least) six ways of thinking about shaping from a homotopic standpoint. What can you come up with?

Modify Reward: successive approximations, subgoaling Dynamics: physical properties of environment or of agent: changing length of pole, changing amount of noise Internal parameters: simulated annealing, schedule for learning rate, complexification in NEAT Initial state: learning from easy missions Action space: infants lock their arms, abstracted (simpler) space Extending time horizon: POMDPs, decrease discount factor, pole balancing time horizon

The infant development timeline and its application to robot shaping. J. Law+, 2011 Bio-inspired vs. Bio-mimicking vs. Computational understanding

Potential-based reward shaping Sam Devlin,

Potential-based reward shaping Sam Devlin,

Potential-based reward shaping Sam Devlin,

Potential-based reward shaping Sam Devlin,

Potential-based reward shaping Sam Devlin,

Potential-based reward shaping Sam Devlin,

Potential-based reward shaping Sam Devlin,

Other recent advances Harutyunyan, A., Devlin, S., Vrancx, P., & Nowe, A. Expressing Arbitrary Reward Functions as Potential- Based Advice. AAAI-15 Brys, T., Harutyunyan, A., Taylor, M. E., & Nowe, A. Policy Transfer using Reward Shaping. AAMAS-15 Harutyunyan, A., Brys, T., Vrancx, P., & Nowe, A. Multi- Scale Reward Shaping via an Off-Policy Ensemble. AAMAS-15 Brys, T., Nowé, A., Kudenko, D., & Taylor, M. E. Combining Multiple Correlated Reward and Shaping Signals by Measuring Confidence. AAAI-14

Learning from feedback (interactive shaping) TAMER Knox and Stone, K-CAP 2009 Key insight : trainer evaluates behavior using a model of its long- term quality

Learning from feedback (interactive shaping) TAMER Learn a model of human reinforcement Directly exploit the model to determine action If greedy: Knox and Stone, K-CAP 2009

Learning from feedback (interactive shaping) TAMER Or, could try to maximize R and H…. How would you combine them?

An a priori comparison Interface LfD interface may be familiar to video game players LfF interface is simpler and task-independent Demonstration more specifically points to the correct action Task expertise LfF - easier to judge than to control Easier for human to increase expertise while training with LfD Cognitive load - less for LfF

reward/tamer/tamer-in-action/ reward/tamer/tamer-in-action/

Other recent work Guangliang Li, Hayley Hung, Shimon Whiteson, and W. Bradley Knox. Using Informative Behavior to Increase Engagement in the TAMER Framework. AAMAS 2013

Bayesian Inference Approach TAMER and COBOT – treat feedback as numeric rewards Here, feedback is categorical Use Bayesian approach – Find maximum a posteriori (MAP) estimate of target behavior

Goal Human can give positive or negative feedback Agent tries to learn policy λ* Maps observations to actions For now: think contextual bandit

Example: Dog Training Teach dog to sit & shake λ* Mapping from observations to actions Feedback: {Bad Dog, Good Boy} “Sit” → “Shake” →

History in Dog Training Feedback history h Observation: “sit”, Action:, Feedback: “Bad Dog” Observation: “sit”, Action:, Feedback: “Good Boy” … Really make sense to assign numeric rewards to these?

Bayesian Framework Trainer desires policy λ* h t is the training history at time t Find MAP hypothesis of λ* : Model of training processPrior distribution over policies

Learning from Feedback λ* is a mapping from observations to actions At time t : Agent observes o t Agent takes action a t Trainer gives feedback

l + is positive feedback (reward) l 0 is no feedback (neutral) l - is negative feedback (punishment) # of positive feedbacks for action a, observation o : p o,a # of negative feedbacks….: n o,a # of neutral feedbacks….: µ o,a

Assumed trainer behavior Decide if action is correct – Does λ*(o)=a ? Trainer makes an error with p(ε) Decide if should give feedback – µ +, µ - are probabilities of neutral feedback – If thinks correct, give positive feedback with p(1- µ + ) – If thinks incorrect, give negative feedback with p(1- µ - ) Could depend on trainer

Feedback Probabilities Probability of feedback l t at time t is:

Fixed Parameter (Bayesian) Learning Algorithm Assume µ + = µ - and is fixed – Neutral feedback doesn’t affect inference Inference becomes:

Bayesian Algorithm Initially, assume all policies are equally probable Maximum likelihood hypothesis = maximum a posteriori hypothesis Given training history up to time t, can get to equation on previous slide from the following general statement: Action depends only on current observation Error rate: Between 0 and 0.5 to cancel out

Inferring Neutral Try to learn µ + and µ - Don’t assume they’re equal Many trainers don’t use punishment – Neutral feedback could be punishment Some don’t use reward – Neutral feedback could be reward

EM step Where λ i is ith estimate of maximum likelihood hypothesis Can simplify this (eventually) to: α has to do with the value of neutral feedback (relative to |β| ) β is negative when neutral implies punishment and positive when implies reward

User Study

Comparisons Sim-TAMER – Numerical reward function – Zero ignored – No delay assumed Sim-COBOT – Similar to Sim-TAMER – Doesn’t ignore zero rewards

Categorical Feedback outperforms Numeric Feedback

Leveraging Neutral Improves Performance

More recent work Sequential tasks Learn language simultaneously How do humans create sequence of tasks?

Poll: What’s next?