POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

Slides:

Advertisements

Similar presentations

Partially Observable Markov Decision Process (POMDP)

Advertisements

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Reinforcement Learning in Simulated Soccer with Kohonen Networks Chris White and David Brogan University of Virginia Department of Computer Science.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning Tutorial

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ.

Reinforcement Learning (1)

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer.

Lisa Torrey University of Wisconsin – Madison CS 540.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban

Reinforcement Learning

Lisa Torrey and Jude Shavlik University of Wisconsin Madison WI, USA.

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

Using Advice to Transfer Knowledge Acquired in One Reinforcement Learning Task to Another Lisa Torrey, Trevor Walker, Jude Shavlik University of Wisconsin-Madison,

Lisa Torrey University of Wisconsin – Madison Doctoral Defense May 2009.

Skill Acquisition via Transfer Learning and Advice Taking Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin.

Transfer Learning Via Advice Taking Jude Shavlik University of Wisconsin-Madison.

Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Reinforcement Learning 主講人：虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.

Reinforcement Learning (RL) Consider an “agent” embedded in an environmentConsider an “agent” embedded in an environment Task of the agentTask of the agent.

Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.

Relational Macros for Transfer in Reinforcement Learning Lisa Torrey, Jude Shavlik, Trevor Walker University of Wisconsin-Madison, USA Richard Maclin University.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Advice Taking and Transfer Learning: Naturally-Inspired Extensions to Reinforcement Learning Lisa Torrey, Trevor Walker, Richard Maclin*, Jude Shavlik.

Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.

INTRODUCTION TO Machine Learning

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.

Reinforcement learning (Chapter 21)

Software Multiagent Systems: CS543 Milind Tambe University of Southern California

Institute for Theoretical Computer ScienceCGAIDE, Reading UK, 10 th November 2004 Reinforcement Learning of Strategies for Settlers of Catan Michael Pfeiffer.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

The ideals reality of science The pursuit of verifiable answers highly cited papers for your c.v. The validation of our results by reproduction convincing.

Competence-Preserving Retention of Learned Knowledge in Soar’s Working and Procedural Memories Nate Derbinsky, John E. Laird University of Michigan.

Matthew E. Taylor 1 Autonomous Inter-Task Transfer in Reinforcement Learning Domains Matthew E. Taylor Learning Agents Research Group Department of Computer.

Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.

Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Transfer Learning and Intelligence: an Argument and Approach Matthew E. Taylor Joint work with: Gregory Kuhlmann and Peter Stone Learning Agents Research.

Reinforcement Learning

A Crash Course in Reinforcement Learning

Chapter 6: Temporal Difference Learning

Mastering the game of Go with deep neural network and tree search

Reinforcement learning (Chapter 21)

Reinforcement learning (Chapter 21)

Artificial Intelligence Lecture No. 5

Deep reinforcement learning

Transferring Instances for Model-Based Reinforcement Learning

Announcements Homework 3 due today (grace period through Friday)

Reinforcement learning

Continous-Action Q-Learning

Dr. Unnikrishnan P.C. Professor, EEE

Reinforcement Learning

Chapter 6: Temporal Difference Learning

Continuous Curriculum Learning for RL

Morteza Kheirkhah University College London

Presentation transcript:

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3

gcs gcs

Evaluation Metrics Asymptotic improvement Jumpstart improvement Speed improvement – Total reward – Slope of line – Time to threshold

Target: no Transfer Target: with Transfer Target + Source: with Transfer Two distinct scenarios: 1. Target Time Metric: Successful if target task learning time reduced Time to Threshold 2. Total Time Metric : Successful if total (source + target) time reduced “Sunk Cost” is ignored Source task(s) independently useful Effectively utilize past knowledge Only care about Target Source Task(s) not useful Minimize total training

K2K2 K3K3 T2T2 T1T1 K1K1 Both takers move towards player with ball Goal: Maintain possession of ball 5 agents 3 (stochastic) actions 13 (noisy & continuous) state variables Keeper with ball may hold ball or pass to either teammate Keepaway [Stone, Sutton, and Kuhlmann 2005] 4 vs. 3: 7 agents 4 actions 19 state variables

Learning Keepaway Sarsa update – CMAC, RBF, and neural network approximation successful Q π (s,a): Predicted number of steps episode will last – Reward = +1 for every timestep

 ’s Effect on CMACs 4 vs. 33 vs. 2  For each weight in 4 vs. 3 function approximator: o Use inter-task mapping to find corresponding 3 vs. 2 weight

Keepaway Hand-coded χ A Hold 4v3  Hold 3v2 Pass1 4v3  Pass1 3v2 Pass2 4v3  Pass2 3v2 Pass3 4v3  Pass2 3v2 Actions in 4 vs. 3 have “similar” actions in 3 vs. 2

Value Function Transfer ρ( Q S (S S, A S ) ) = Q T (S T, A T ) Action-Value function transferred ρ is task-dependant: relies on inter-task mappings ρ Q not defined on S T and A T Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ

Value Function Transfer: Time to threshold in 4 vs. 3 Total Time No Transfer Target Task Time }

For similar target task, the transferred knowledge … [can] significantly improve its performance. But how do we define the similar task more specifically? – Same state-action space – similar objectives

– Is transfer beneficial for a given pair of tasks? Avoid Negative Transfer? Reduce total time metric? Effects of Task Similarity Transfer trivialTransfer impossible Source identical to Target Source unrelated to Target

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007]

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

All tasks are drawn from the same domain o Task: An MDP o Domain: Setting for semantically similar tasks o What about Cross-Domain Transfer? o Source task could be much simpler o Show that source and target can be less similar Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

Source Task: Ringworld Ringworld Goal: avoid being tagged 2 agents 3 actions 7 state variables Fully Observable Discrete State Space (Q-table with ~8,100 s,a pairs) Stochastic Actions Opponent moves directly towards player Player may stay or run towards a pre- defined location K2K2 K3K3 T2T2 T1T1 K1K1 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Source Task: Knight’s Joust Knight’s Joust Goal: Travel from start to goal line 2 agents 3 actions 3 state variables Fully Observable Discrete State Space (Q-table with ~600 s,a pairs) Deterministic Actions K2K2 K3K3 T2T2 T1T1 K1K1 Opponent moves directly towards player Player may move North, or take a knight jump to either side 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Rule Transfer Overview 1.Learn a policy (π : S → A) in the source task – TD, Policy Search, Model-Based, etc. 2.Learn a decision list, D source, summarizing π 3.Translate (D source ) → D target (applies to target task) – State variables and actions can differ in two tasks 4.Use D target to learn a policy in target task Allows for different learning methods and function approximators in source and target tasks Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Environment Agent Action StateReward Source Task In this work we use Sarsa o Q : S × A → Return Other learning methods possible

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Use learned policy to record S, A pairs Use JRip (RIPPER in Weka) to learn a decision list Environment Agent Action StateReward Action State Action …… IF s 1 5 → a 1 ELSEIF s 1 < 3 → a 2 ELSEIF s 3 > 7 → a 1 …

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Inter-task Mappings χ x: s target →s source – Given state variable in target task (some x from s = x 1, x 2, … x n ) – Return corresponding state variable in source task χ A: a target →a source – Similar, but for actions rule rule’ translate χ A χ x

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget K2K2 K3K3 T2T2 T1T1 K1K1 StayHold Ball Run Near Pass to K 2 Run Far Pass to K 3 dist(Player, Opponent)dist(K 1,T 1 )… χxχx χAχA IF dist(Player, Opponent) > 4 → Stay IF dist(K 1,T 1 ) > 4 → Hold Ball

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Many possible ways to use D target o Value Bonus o Extra Action o Extra Variable Assuming TD learner in target task o Should generalize to other learning methods Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 D target (s) = a Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 Q(s 1, s 2, a 4 ) = 7 (take action a 2 ) Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, s 3, a 1 ) = 5 Q(s 1, s 2, s 3, a 2 ) = 3 Q(s 1, s 2, s 3, a 3 ) = 4 Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 2, a 1 ) = 5 Q(s 1, s 2, a 2, a 2 ) = 9 Q(s 1, s 2, a 2, a 3 ) = 4 (shaping) (initially force agent to select)

Comparison of Rule Transfer Methods Without Transfer Only Follow Rules Rules from 5 hours of training Extra Variable Extra Action Value Bonus

Inter-domain Transfer: Averaged Results Training Time (simulator hours) Episode Duration (simulator seconds) Ringworld: 20,000 episodes (~1 minute wall clock time) Success: Four types of transfer improvement!

Future Work Theoretical Guarantees / Bounds Avoiding Negative Transfer Curriculum Learning Autonomously selecting inter-task mappings Leverage supervised learning techniques Simulation to Physical Robots Humans?