Distributed Q Learning Lars Blackmore and Steve Block.

Distributed Q Learning Lars Blackmore and Steve Block

Contents – to be removed What is Q-learning? –MDP framework –Q-learning per se Distributed Q-learning  discuss how Sharing Q-values  why interesting? Simple averaging (no good) Expertness based distributed Q-learning Expertness w/ specialised agents (optional)

Markov Decision Processes Framework: MDP –States S –Actions A –Rewards R(s,a) –Transition Function T(s,a,s’) Goal: find optimal policy  *(s) that maximises lifetime reward G 100 0 0 G

Reinforcement Learning Want to find  * through experience –Reinforcement Learning –Intuitively similar to human/animal learning –Use some policy  for motion –Converge to the optimal policy  * An algorithm for reinforcement learning…

Q-Learning Define Q*(s,a): –“Total reward if agent is in state s, takes action a, then acts optimally at all subsequent time steps” Optimal policy:  *(s)=argmax a Q*(s,a) Q(s,a) is an estimate of Q*(s,a) Q-learning motion policy:  (s)=argmax a Q(s,a) Update Q recursively:

Q-learning Update Q recursively: r=100 s s’ a G

Q-learning Update Q recursively: Q(s’,a 1 ’)=25 Q(s’,a 2 ’)=50 r=100

Q-Learning Define Q*(s,a): –“Total reward if agent is in state s, takes action a, then acts optimally at all subsequent time steps” Optimal policy:  *(s)=argmax a Q*(s,a) Q(s,a) is an estimate of Q*(s,a) Q-learning motion policy:  (s)=argmax a Q(s,a) Update Q recursively: Optimality theorem: –“If each (s,a) pair is updated an infinite number of times, Q converges to Q* with probability 1”

Distributed Q-Learning Problem formulation An example situation: –Mobile robots Learning framework –Individual learning for t i trials –Each trial starts from a random state and ends when robot reaches goal –Next, all robots switch to cooperative learning

Distributed Q-learning How should information be shared? Three fundamentally different approaches: –Expanding state space –Sharing experiences –Share Q-values Methods for sharing state space and experience are straightforward –These showed some improvement in testing Best method for sharing Q-values is not obvious –This area offers the greatest challenge and the greatest potential for cunning

Sharing Q-values First approach: Simple Averaging This was shown to yield some improvement What are some of the problems?

Problems with Simple Averaging All agents have the same Q table after sharing and hence the same policy: –Different policies allow agents to explore the state space differently Convergence rate may be reduced

Problems with Simple Averaging Convergence rate may be reduced –Without co-operation Trial #Q(s,a) Agent 1Agent 2 000 1100 2 3 G

Problems with Simple Averaging Convergence rate may be reduced –With simple averaging Trial #Q(s,a) Agent 1Agent 2 000 155 27.5 38.725 ………  10 G

Problems with Simple Averaging All agents have the same Q table after sharing and hence the same policy: –Different policies allow agents to explore the state space differently Convergence rate may be reduced –Highly problem specific –Slows adaptation in dynamic environment Overall performance is highly problem specific

Expertness Idea: value more highly the knowledge of agents who are ‘experts’ –Expertness based cooperative Q-learning New Q-sharing equation: Agent i assigns an importance weight W ij to the Q data held by agent j These weights are based on the agents’ relative expertness values e i and e j

Expertness Measures Need to define expertness of agent i –Based on the rewards agent i has received Various definitions: –Algebraic Sum –Abs –Positive –Negative Different interpretations

Weighting Strategies How do we come up with weights based on the expertnesses? Alternative strategies: –‘Learn from all’: –‘Learn from experts’:

Experimental Setup Mobile robots in hunter-prey scenario Individual learning phase: 1.All robots carry out same number of trials 2.Robots carry out different number of trials Then cooperative learning

Results Compare cooperative learning to individual –Agents with equal experience –Agents with different experience Different weight assigning mechanisms Different expertness measures

Results

Conclusion Without expertness measures, cooperation is detrimental –Simple averaging shows decrease in performance Expertness based cooperative learning is shown to be superior to individual learning –Only true when agents have significantly different expertness values (necessary but not sufficient) Expertness measures Abs, P and N show best performance –Of these three, Abs provides the best compromise ‘Learning from Experts’ weighting strategy shown to be superior to ‘Learning from All’

What about this situation? Both agents have accumulated the same rewards and punishments Which is the most expert? R R R R R R

Specialised Agents An agent may have explored one area a lot but another area very little –The agent is an expert in one area but not in another Idea – Specialised agents –Agents can be experts in certain areas of the world –Learnt policy more valuable if an agent is more expert in that particular area

Results

Conclusion Expertness based cooperative learning can be detrimental –Demonstrated when considering global expertness Improvement due to cooperation increases with resolution over which expertness is assessed Correct choice of expertness measure crucial –Test case highlights robustness of Abs to problem- specific nature of reinforcement signals

More material Transmission noise Spaccy robot Penalties associated with sharing

G 0 0 0 100 r=100 s s’ a -How should information be shared? ? Open question up to audience, fill in gaps in their answers with examples such as scouts etc. -BE CLEAR ABOUT WHETHER STATE IS EXPANDED IN SUBSEQUENT PAPERS -Initial testing? state space and experience sharing improve as expected but exponential space, high comm cost respectively, sharing Q-values did improve, but will talk about this later -Sharing Q-values is most interesting because the knowledge of the agent is embodied in the Q-values, so we are directly sharing knowledge. Sharing state space is just giving more info at each step, which is clearly going to improve, and sharing experiences is just increasing the number of trials. Q-values is where cunning is needed.

Possible questions Optimality and convergence of sharing How often is Q shared?

Distributed Q Learning Lars Blackmore and Steve Block.

Similar presentations

Presentation on theme: "Distributed Q Learning Lars Blackmore and Steve Block."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Q Learning Lars Blackmore and Steve Block.

Similar presentations

Presentation on theme: "Distributed Q Learning Lars Blackmore and Steve Block."— Presentation transcript:

Similar presentations

About project

Feedback