POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3

https://www.youtube.com/watch?v=ek0FrCao gcs https://www.youtube.com/watch?v=ek0FrCao gcs

Evaluation Metrics Asymptotic improvement Jumpstart improvement Speed improvement – Total reward – Slope of line – Time to threshold

Target: no Transfer Target: with Transfer Target + Source: with Transfer Two distinct scenarios: 1. Target Time Metric: Successful if target task learning time reduced Time to Threshold 2. Total Time Metric : Successful if total (source + target) time reduced “Sunk Cost” is ignored Source task(s) independently useful Effectively utilize past knowledge Only care about Target Source Task(s) not useful Minimize total training

K2K2 K3K3 T2T2 T1T1 K1K1 Both takers move towards player with ball Goal: Maintain possession of ball 5 agents 3 (stochastic) actions 13 (noisy & continuous) state variables Keeper with ball may hold ball or pass to either teammate Keepaway [Stone, Sutton, and Kuhlmann 2005] 4 vs. 3: 7 agents 4 actions 19 state variables

Learning Keepaway Sarsa update – CMAC, RBF, and neural network approximation successful Q π (s,a): Predicted number of steps episode will last – Reward = +1 for every timestep

 ’s Effect on CMACs 4 vs. 33 vs. 2  For each weight in 4 vs. 3 function approximator: o Use inter-task mapping to find corresponding 3 vs. 2 weight

Keepaway Hand-coded χ A Hold 4v3  Hold 3v2 Pass1 4v3  Pass1 3v2 Pass2 4v3  Pass2 3v2 Pass3 4v3  Pass2 3v2 Actions in 4 vs. 3 have “similar” actions in 3 vs. 2

Value Function Transfer ρ( Q S (S S, A S ) ) = Q T (S T, A T ) Action-Value function transferred ρ is task-dependant: relies on inter-task mappings ρ Q not defined on S T and A T Source Task Target Task Environment Agent Action T State T Reward T Environment Agent Action S State S Reward S Q S : S S ×A S → ℜ Q T : S T ×A T → ℜ

Value Function Transfer: Time to threshold in 4 vs. 3 Total Time No Transfer Target Task Time }

For similar target task, the transferred knowledge … [can] significantly improve its performance. But how do we define the similar task more specifically? – Same state-action space – similar objectives

– Is transfer beneficial for a given pair of tasks? Avoid Negative Transfer? Reduce total time metric? Effects of Task Similarity Transfer trivialTransfer impossible Source identical to Target Source unrelated to Target

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007]

Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

All tasks are drawn from the same domain o Task: An MDP o Domain: Setting for semantically similar tasks o What about Cross-Domain Transfer? o Source task could be much simpler o Show that source and target can be less similar Example Transfer Domains Series of mazes with different goals [Fernandez and Veloso, 2006] Mazes with different structures [Konidaris and Barto, 2007] Keepaway with different numbers of players [Taylor and Stone, 2005] Keepaway to Breakaway [Torrey et al, 2005]

Source Task: Ringworld Ringworld Goal: avoid being tagged 2 agents 3 actions 7 state variables Fully Observable Discrete State Space (Q-table with ~8,100 s,a pairs) Stochastic Actions Opponent moves directly towards player Player may stay or run towards a pre- defined location K2K2 K3K3 T2T2 T1T1 K1K1 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Source Task: Knight’s Joust Knight’s Joust Goal: Travel from start to goal line 2 agents 3 actions 3 state variables Fully Observable Discrete State Space (Q-table with ~600 s,a pairs) Deterministic Actions K2K2 K3K3 T2T2 T1T1 K1K1 Opponent moves directly towards player Player may move North, or take a knight jump to either side 3 vs. 2 Keepaway Goal: Maintain possession of ball 5 agents 3 actions 13 state variables Partially Observable Continuous State Space Stochastic Actions

Rule Transfer Overview 1.Learn a policy (π : S → A) in the source task – TD, Policy Search, Model-Based, etc. 2.Learn a decision list, D source, summarizing π 3.Translate (D source ) → D target (applies to target task) – State variables and actions can differ in two tasks 4.Use D target to learn a policy in target task Allows for different learning methods and function approximators in source and target tasks Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Environment Agent Action StateReward Source Task In this work we use Sarsa o Q : S × A → Return Other learning methods possible

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Use learned policy to record S, A pairs Use JRip (RIPPER in Weka) to learn a decision list Environment Agent Action StateReward Action State Action …… IF s 1 5 → a 1 ELSEIF s 1 < 3 → a 2 ELSEIF s 3 > 7 → a 1 …

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Inter-task Mappings χ x: s target →s source – Given state variable in target task (some x from s = x 1, x 2, … x n ) – Return corresponding state variable in source task χ A: a target →a source – Similar, but for actions rule rule’ translate χ A χ x

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget K2K2 K3K3 T2T2 T1T1 K1K1 StayHold Ball Run Near Pass to K 2 Run Far Pass to K 3 dist(Player, Opponent)dist(K 1,T 1 )… χxχx χAχA IF dist(Player, Opponent) > 4 → Stay IF dist(K 1,T 1 ) > 4 → Hold Ball

Rule Transfer Details Learn π Learn Dsource Translate (Dsource) → Dtarget Use Dtarget Many possible ways to use D target o Value Bonus o Extra Action o Extra Variable Assuming TD learner in target task o Should generalize to other learning methods Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 D target (s) = a 2 + 8 Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 1 ) = 5 Q(s 1, s 2, a 2 ) = 3 Q(s 1, s 2, a 3 ) = 4 Q(s 1, s 2, a 4 ) = 7 (take action a 2 ) Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, s 3, a 1 ) = 5 Q(s 1, s 2, s 3, a 2 ) = 3 Q(s 1, s 2, s 3, a 3 ) = 4 Evaluate agent’s 3 actions in state s = s 1, s 2 Q(s 1, s 2, a 2, a 1 ) = 5 Q(s 1, s 2, a 2, a 2 ) = 9 Q(s 1, s 2, a 2, a 3 ) = 4 (shaping) (initially force agent to select)

Comparison of Rule Transfer Methods Without Transfer Only Follow Rules Rules from 5 hours of training Extra Variable Extra Action Value Bonus

Inter-domain Transfer: Averaged Results Training Time (simulator hours) Episode Duration (simulator seconds) Ringworld: 20,000 episodes (~1 minute wall clock time) Success: Four types of transfer improvement!

Future Work Theoretical Guarantees / Bounds Avoiding Negative Transfer Curriculum Learning Autonomously selecting inter-task mappings Leverage supervised learning techniques Simulation to Physical Robots Humans?

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

Similar presentations

Presentation on theme: "POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3.

Similar presentations

Presentation on theme: "POMDPs: 5 Reward Shaping: 4 Intrinsic RL: 4 Function Approximation: 3."— Presentation transcript:

Similar presentations

About project

Feedback