The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.

Slides:

Advertisements

Similar presentations

RL for Large State Spaces: Value Function Approximation

Advertisements

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Reinforcement Learning in Simulated Soccer with Kohonen Networks Chris White and David Brogan University of Virginia Department of Computer Science.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

ECE457 Applied Artificial Intelligence R. Khoury (2007)Page 1 Please pick up a copy of the course syllabus from the front desk.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

COSC 878 Seminar on Large Scale Statistical Machine Learning 1.

Reinforcement Learning & Apprenticeship Learning Chenyi Chen.

Reinforcement Learning Tutorial

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning of Local Shape in the Game of Atari-Go David Silver.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Reinforcement Learning (1)

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Lisa Torrey University of Wisconsin – Madison CS 540.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another Lisa Torrey, Trevor Walker, Jude Shavlik University of Wisconsin-Madison,

Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.

Skill Acquisition via Transfer Learning and Advice Taking Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin.

Transfer Learning Via Advice Taking Jude Shavlik University of Wisconsin-Madison.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction Ann Nowé By Sutton and.

Integrating Background Knowledge and Reinforcement Learning for Action Selection John E. Laird Nate Derbinsky Miller Tinkerhess.

Leveraging Human Knowledge for Machine Learning Curriculum Design Matthew E. Taylor teamcore.usc.edu/taylorm.

Fuzzy Reinforcement Learning Agents By Ritesh Kanetkar Systems and Industrial Engineering Lab Presentation May 23, 2003.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.

Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Advice Taking and Transfer Learning: Naturally-Inspired Extensions to Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik.

Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

Neural Networks Chapter 7

INTRODUCTION TO Machine Learning

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.

Reinforcement learning (Chapter 21)

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.

Matthew E. Taylor 1 Autonomous Inter-Task Transfer in Reinforcement Learning Domains Matthew E. Taylor Learning Agents Research Group Department of Computer.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

ConvNets for Image Classification

Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Transfer Learning and Intelligence: an Argument and Approach Matthew E. Taylor Joint work with: Gregory Kuhlmann and Peter Stone Learning Agents Research.

Chapter 6: Temporal Difference Learning

Reinforcement learning (Chapter 21)

Reinforcement learning (Chapter 21)

Deep reinforcement learning

Transferring Instances for Model-Based Reinforcement Learning

Announcements Homework 3 due today (grace period through Friday)

Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Dr. Unnikrishnan P.C. Professor, EEE

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Emir Zeylan Stylianos Filippou

Instructor: Vincent Conitzer

Continuous Curriculum Learning for RL

Morteza Kheirkhah University College London

Presentation transcript:

The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing referees who did not see your code or data An altruistic, collective enterprise A race to outrun your colleagues in front of the giant bear of grant funding Credit: Fernando Pycon 2014

Transfer Learning in RL Matthew E. Taylor and Peter Stone. Transfer Learning for Reinforcement Learning Domains: A Survey. Journal of Machine Learning Research, 10(1):1633–1685, 2009.

Inter-Task Transfer Learning tabula rasa can be unnecessarily slow Humans can use past information – Soccer with different numbers of players Agents leverage learned knowledge in novel tasks – Bias learning : speedup method

Primary Questions Source S SOURCE, A SOURCE Target S TARGET, A T ARGET Is it possible to transfer learned knowledge? Possible to transfer without a providing a task mapping? Only consider reinforcement learning tasks Lots of work in supervised settings

Value Function Transfer ρ( Q S (S S, A S ) ) = Q T (S T, A T ) Action-Value function transferred ρ is task-dependant: relies on inter-task mappings ρ Q not defined on S T and A T Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ

Enabling Transfer Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ

Inter-Task Mappings SourceTarget Source

Inter-Task Mappings χ x: s target→ s source – Given state variable in target task (some x from s=x 1, x 2, … x n) – Return corresponding state variable in source task χ A: a target→ a source – Similar, but for actions Intuitive mappings exist in some domains (Oracle) Used to construct transfer functional Target ⟨ x 1 …x n ⟩ {a 1 …a m } S T ARGET A T ARGET Source S SOURCE A SOURCE ⟨ x 1 …x k ⟩ {a 1 …a j } χxχx χAχA

Q-value Reuse Could be considered a type of reward shaping Directly use expected rewards from source task to bias learner in target task Not function-approximator specific No initialization step needed between learning the two tasks Drawbacks include an increased lookup time and larger memory requirements

Example Cancer Castle attack

Lazaric Transfer can – Reduce need for instances in target task – Reduce need for domain expert knowledge What changes – State space, state features – A, T, R, goal state – learning method

Threshold: 8.5 Performance Target: no Transfer Target: with Transfer Target + Source: with Transfer Target: no transfer Target: with Transfer Two distinct scenarios: 1. Target Time Metric: Successful if target task learning time reduced Transfer Evaluation Metrics Set a threshold performance Majority of agents can achieve with learning 2. Total Time Metric : Successful if total (source + target) time reduced “Sunk Cost” is ignored Source task(s) independently useful AI Goal Effectively utilize past knowledge Only care about Target Source Task(s) not useful Engineering Goal Minimize total training

Previous was “learning speed improvement” Can also have – Asymptotic improvement – Jumpstart improvement

Value Function Transfer: Time to threshold in 4 vs. 3 Total Time No Transfer Target Task Time }

Results: Scaling up to 5 vs. 4 47% of no transfer All statistically significant when compared to no transfer

Problem Statement Humans can selecting a training sequence Results in faster training / better performance Meta-planning problem for agent learning – Known vs. Unknown final task? MDP ?

TL vs. multi-task vs. lifelong learning vs. generalization vs. concept drift Learning from Easy Missions Changing length of cart pole

K2K2 K3K3 T2T2 T1T1 K1K1 Both takers move towards player with ball Goal: Maintain possession of ball 5 agents 3 (stochastic) actions 13 (noisy & continuous) state variables Keeper with ball may hold ball or pass to either teammate Keepaway [Stone, Sutton, and Kuhlmann 2005] 4 vs. 3: 7 agents 4 actions 19 state variables

Learning Keepaway Sarsa update – CMAC, RBF, and neural network approximation successful Q π (s,a): Predicted number of steps episode will last – Reward = +1 for every timestep

 ’s Effect on CMACs 4 vs. 33 vs. 2  For each weight in 4 vs. 3 function approximator: o Use inter-task mapping to find corresponding 3 vs. 2 weight

Keepaway Hand-coded χ A Hold 4v3  Hold 3v2 Pass1 4v3  Pass1 3v2 Pass2 4v3  Pass2 3v2 Pass3 4v3  Pass2 3v2 Actions in 4 vs. 3 have “similar” actions in 3 vs. 2

Value Function Transfer: Time to threshold in 4 vs. 3 Total Time No Transfer Target Task Time }

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007]

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

All tasks are drawn from the same domain o Task: An MDP o Domain: Setting for semantically similar tasks o What about Cross-Domain Transfer? o Source task could be much simpler o Show that source and target can be less similar Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

Source Task: Ringworld Ringworld Goal: avoid being tagged 2 agents 3 actions 7 state variables Fully Observable Discrete State Space (Q-table with ~8,100 s,a pairs) Stochastic Actions Opponent moves directly towards player Player may stay or run towards a pre- defined location K2K2 K3K3 T2T2 T1T1 K1K1 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Source Task: Knight’s Joust Knight’s Joust Goal: Travel from start to goal line 2 agents 3 actions 3 state variables Fully Observable Discrete State Space (Q-table with ~600 s,a pairs) Deterministic Actions K2K2 K3K3 T2T2 T1T1 K1K1 Opponent moves directly towards player Player may move North, or take a knight jump to either side 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Rule Transfer Overview 1.Learn a policy (π : S → A) in the source task – TD, Policy Search, Model-Based, etc. 2.Learn a decision list, D source, summarizing π 3.Translate (D source ) → D target (applies to target task) – State variables and actions can differ in two tasks 4.Use D target to learn a policy in target task Allows for different learning methods and function approximators in source and target tasks Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Environment Agent Action StateReward Source Task In this work we use Sarsa o Q : S × A → Return Other learning methods possible

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Use learned policy to record S, A pairs Use JRip (RIPPER in Weka) to learn a decision list Environment Agent Action StateReward Action State Action …… IF s 1 5 → a 1 ELSEIF s 1 < 3 → a 2 ELSEIF s 3 > 7 → a 1 …

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Inter-task Mappings χ x: s target →s source – Given state variable in target task (some x from s = x 1, x 2, … x n ) – Return corresponding state variable in source task χ A: a target →a source – Similar, but for actions rule rule’ translate χ A χ x

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget K2K2 K3K3 T2T2 T1T1 K1K1 StayHold Ball Run Near Pass to K 2 Run Far Pass to K 3 dist(Player, Opponent)dist(K 1,T 1 )… χxχx χAχA IF dist(Player, Opponent) > 4 → Stay IF dist(K 1,T 1 ) > 4 → Hold Ball

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Many possible ways to use D target o Value Bonus o Extra Action o Extra Variable Assuming TD learner in target task o Should generalize to other learning methods Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 D target (s) = a Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 Q(s 1, s 2, a 4 ) = 7 (take action a 2 ) Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, s 3, a 1 ) = 5 Q(s 1, s 2, s 3, a 2 ) = 3 Q(s 1, s 2, s 3, a 3 ) = 4 Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 2, a 1 ) = 5 Q(s 1, s 2, a 2, a 2 ) = 9 Q(s 1, s 2, a 2, a 3 ) = 4 (shaping) (initially force agent to select)

Comparison of Rule Transfer Methods Without Transfer Only Follow Rules Rules from 5 hours of training Extra Variable Extra Action Value Bonus

Results: Extra Action Without Transfer Ringworld: 20,000 episodes Knight’s Joust: 50,000 episodes