Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Decision Theoretic Planning

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Reinforcement Learning

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Reinforcement learning

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

12 June, STD( ): learning state temporal differences with TD( ) Lex Weaver Department of Computer Science Australian National University Jonathan.

Reinforcement Learning

CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.

Blackjack as a Test Bed for Learning Strategies in Neural Networks A. Perez-Uribe and E. Sanchez Swiss Federal Institute of Technology IEEE IJCNN'98.

Latent Learning in Agents iCML 03 Robotics/Vision Workshop Rati Sharma.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Distributed Q Learning Lars Blackmore and Steve Block.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Machine Learning Lecture 11: Reinforcement Learning

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Soar-RL: Reinforcement Learning and Soar Shelley Nason.

Reinforcement Learning (1)

1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Conference Paper by: Bikramjit Banerjee University of Southern Mississippi From the Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence.

Reinforcement Learning

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Introduction Many decision making problems in real life

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Distributed Q Learning Lars Blackmore and Steve Block.

Reinforcement learning (Chapter 21)

Reinforcement Learning

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Reinforcement Learning (1)

Reinforcement Learning

"Playing Atari with deep reinforcement learning."

Markov Decision Processes

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning

Reinforcement Learning (2)

Presentation transcript:

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems, Man and Cybernetics Part B, Volume 32, Issue 1, Feb. 2002, Pages 66–76 An Extension of Weighted Strategy Sharing in Cooperative Q-Learning for Specialized Agents Eshgh, S.M.; Ahmadabadi, M.N. Proceedings of the 9th International Conference on Neural Information Processing, Volume 1, Pages Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the 10th International Conference on Machine Learning, 1993

Overview Single agent reinforcement learning –Markov Decision Processes –Q-learning Cooperative Q-learning –Sharing state, sharing experiences and sharing policy Sharing policy through Q-values –Simple averaging Expertness based distributed Q-learning –Expertness measures and weighting strategies –Experimental results Expertness with specialised agents –Scope of specialisation –Experimental results

Markov Decision Processes G Goal: find optimal policy  *(s) that maximises lifetime reward G Framework –States S –Actions A –Rewards R(s,a) –Probabilistic transition Function T(s,a,s’)

Reinforcement Learning Want to find  * through experience –Reinforcement Learning –Intuitive approach; similar to human and animal learning –Use some policy  for motion –Converge to the optimal policy  * An algorithm for reinforcement learning…

Q-Learning Define Q*(s,a): –“Total reward if agent is in state s, takes action a, then acts optimally at all subsequent time steps” Optimal policy:  *(s)=argmax a Q*(s,a) Q(s,a) is an estimate of Q*(s,a) Q-learning motion policy:  (s)=argmax a Q(s,a) Update Q recursively:

Q-learning Update Q recursively: s s’ a

Q-learning Update Q recursively: Q(s’,a 1 ’)=25 r=100 Q(s’,a 2 ’)=50

Q-Learning Define Q*(s,a): –“Total reward if agent is in state s, takes action a, then acts optimally at all subsequent time steps” Optimal policy:  *(s)=argmax a Q*(s,a) Q(s,a) is an estimate of Q*(s,a) Q-learning motion policy:  (s)=argmax a Q(s,a) Update Q recursively: Optimality theorem: –“If each (s,a) pair is updated an infinite number of times, Q converges to Q* with probability 1”

Distributed Q-Learning Problem formulation An example situation: –Mobile robots Learning framework –Individual learning for t i trials –Each trial starts from a random state and ends when robot reaches goal –Next, all robots switch to cooperative learning

Distributed Q-learning How should information be shared? Three fundamentally different approaches: –Expanding state space –Sharing experiences –Share Q-values Methods for sharing state space and experience are straightforward –These showed some improvement in testing Best method for sharing Q-values is not obvious –This area offers the greatest challenge and the greatest potential for innovation

Sharing Q-values An obvious approach? – Simple Averaging This was shown to yield some improvement What are some of the problems?

Problems with Simple Averaging All agents have the same Q table after sharing and hence the same policy: –Different policies allow agents to explore the state space differently Convergence rate may be reduced

Problems with Simple Averaging Convergence rate may be reduced –Without co-operation: Trial #Q(s,a) Agent 1Agent G

Problems with Simple Averaging Convergence rate may be reduced –With simple averaging: Trial #Q(s,a) Agent 1Agent ………  10 G

Problems with Simple Averaging All agents have the same Q table after sharing and hence the same policy: –Different policies allow agents to explore the state space differently Convergence rate may be reduced –Highly problem specific –Slows adaptation in dynamic environment Overall performance is task specific

Expertness Idea: value more highly the knowledge of agents who are ‘experts’ –Expertness based cooperative Q-learning New Q-sharing equation: Agent i assigns an importance weight W ij to the Q data held by agent j These weights are based on the agents’ relative expertness values e i and e j

Expertness Measures Need to define expertness of agent i –Based on the reinforcement signals agent i has received Various definitions: –Algebraic Sum –Absolute Value –Positive –Negative Different interpretations

Weighting Strategies How do we come up with weights based on the expertnesses? Alternative strategies: –‘Learn from all’: –‘Learn from experts’:

Experimental Setup Mobile robots in hunter-prey scenario Individual learning phase: 1.All robots carry out same number of trials 2.Robots carry out different number of trials Followed by cooperative learning Parameters to investigate: –Cooperative learning vs individual –Similar vs different initial expertise levels –Different expertness measures –Different weight assigning mechanisms

Results

Conclusions Without expertness measures, cooperation is detrimental –Simple averaging shows decrease in performance Expertness based cooperative learning is shown to be superior to individual learning –Only true when agents have significantly different expertness values (necessary but not sufficient) Expertness measures Abs, P and N show best performance –Of these three, Abs provides the best compromise ‘Learning from Experts’ weighting strategy shown to be superior to ‘Learning from All’

What about this situation? Both agents have accumulated the same rewards and punishments Which is the most expert? R R R R R R

Specialised Agents An agent may have explored one area a lot but another area very little –The agent is an expert in one area but not in another Idea – Specialised agents –Agents can be experts in certain areas of the world –Learnt policy more valuable if an agent is more expert in that particular area

Specialised Agents Scope of specialisation –Global –Local –State R R R R R R

Experimental Setup Mobile robots in a grid world World is approximately segmented into three regions by obstacles –One goal per region Same individual followed by cooperative learning as before R R R R R R

Results

Conclusions Expertness based cooperative learning without specialised agents can improve performance but can also be detrimental Cooperative learning with specialised agents greatly improved performance Correct choice of expertness measure is crucial –Test case highlights robustness of Abs to problem- specific nature of reinforcement signals

Future Directions Communication errors –Very sensitive to corrupted information –Can we distinguish bad advice? What are the costs associated with cooperative Q-learning? –Computation –Communication How can the scope of specialisation be selected? –Dynamic scoping Dynamic environments –Discounted expertness

Any questions …