Top level learning Example: Pass selection using TPOT-RL.

Top level learning Example: Pass selection using TPOT-RL

Overview Team with 3 players + 1 Enemy goal Goalie, Midfield, Forward actions state generalization Q-Table action selection reduction generation (value function learning by rewards)

actions GM F E KickEPassG PassF Szenario (state s): 3 Players (opponents are invisible) 3 Actions (a) ?

action dependent features GM F E  3=0.2  1=0.9  2=0.7 e : S x A -> U  (s,a)... Pass Evaluation Function (Decision Tree, Heuristic,..) short term efficiency sufficent for achievig goal?

state generalization GM F E  3=0.2  1=0.9  2=0.7 create a small feature vector describing situation f : S -> V f(s) = v = v =

state generalization II GM F E 0.2 (F)0.9 (T) 0.7 (T) discretize  values: e.g. >= 0.7... Pass successfull (v i = True) < 0.7... Pass missed (v i = False) v =

state generalization III GM F E original statespace space (10 198? ) reduced statespace: 2 3.3 = 24 („real“ robosoccer 2 (11+8). 11 = ca. 5.7 Mio.) v =

action selection GM F E optimal action to score a goal (long term) ???? ? assume a wise Q-Table max. Q-value => best action !

Q-Table action selection M PassF v1v1 v2v2 v3v3 PassGPassFKickE Actions a state v FFF... TTF TTT 022 4128... 41020 v = Q-Table for Player M Take action with max. Q-value (Q max = 100)

Q-Table action selection II v1v1 v2v2 v3v3 PassGPassFKickE Q-Table for Player M FFF... TTF TTT 022 4128... 81020 Problem: size of Q-Table * Teamsize * Formations e.g. „real“ robosoccer: #Q-values = 2 (11+8). (11+8). 11. 9 = ca. 10 9 ASSUME independency Q(v,a i ) depends only on v i

Q-Table reduction M v1v1 v2v2 v3v3 FFF... TTF TTT e.g. Q-Values for PassF depend „only“ on the FeatureValue for PassF PassGPassFKickE 022... 4128... 81020 PassF:  2=0.7,T

Q-Table reduction II M e.g. „real“ robosoccer: #Q-values = 2.(11+8).(11+8).11.9 = 71478 v1v1 PassGPassFKickE F T... v1v1 v2v2 PassGPassFKickE F T 022 4128 v2v2 v3v3 PassGPassFKickE F T... v3v3 Q-Table for Player M

Q-Table reduction III further reduction possible: action filter B(s) -> no Q-Values for useless actions (e.g. in front of enemy goal: q-value for passing back to own goalie) v1v1 PassGPassFKickE F T... v1v1 v2v2 PassGPassFKickE F T 022 4128 v2v2 v3v3 PassGPassFKickE F T... v3v3 M Q-Table for Player M

generating a Q-Table Value function learning by rewards G M F E PassF time t = 7 M passes to F +100 G M F E KickE t = 15: F kicks to goal t = 17: goal scored!!

generating a Q-Table II Every Agent reminds of Last Action M PassF +100 F KickE t = 15 F kicks to goal t = 17 goal scored!! -> Q = 100 t = 7 M passes to F r = Q/(k.  t*) = 100/(0.5.10) = 20 Rewards r = Q =100 = 100  t = 0-2.. intermediate

generating a Q-Table III Update Q-Tables M PassF Q(v,a) = Q(v,a) +  *(r–Q(v,a)) = 12 + 0.1*(20-12) = 12.8 Reward r = 20 v2v2 PassGPassFKickE F T 022 4128 v2v2 v2v2 PassGPassFKickE F T 022 412.88 v2v2

future-value PassF is senseful M PassF F KickE Q(v,a) = Q(v,a) + .(r + y.Q(vf,af) – Q(v,a)) = PassF is very senseful, if F can kick

Top level learning Example: Pass selection using TPOT-RL.

Similar presentations

Presentation on theme: "Top level learning Example: Pass selection using TPOT-RL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Top level learning Example: Pass selection using TPOT-RL.

Similar presentations

Presentation on theme: "Top level learning Example: Pass selection using TPOT-RL."— Presentation transcript:

Similar presentations

About project

Feedback