Distributed Q Learning Lars Blackmore and Steve Block.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

Effective Reinforcement Learning for Mobile Robots Smart, D.L and Kaelbing, L.P.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Reinforcement learning

Latent Learning in Agents iCML 03 Robotics/Vision Workshop Rati Sharma.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Distributed Q Learning Lars Blackmore and Steve Block.

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Soar-RL: Reinforcement Learning and Soar Shelley Nason.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning (1)

1 Quality of Experience Control Strategies for Scalable Video Processing Wim Verhaegh, Clemens Wüst, Reinder J. Bril, Christian Hentschel, Liesbeth Steffens.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.

Introduction Many decision making problems in real life

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

Session 2a, 10th June 2008 ICT-MobileSummit 2008 Copyright E3 project, BUPT Autonomic Joint Session Admission Control using Reinforcement Learning.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

Learning to Navigate Through Crowded Environments Peter Henry 1, Christian Vollmer 2, Brian Ferris 1, Dieter Fox 1 Tuesday, May 4, University of.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Purposive Behavior Acquisition for a real robot by vision based Reinforcement Learning Minuru Asada,Shoichi Noda, Sukoya Tawarasudia, Koh Hosoda Presented.

Reinforcement Learning

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO AUTOMATICO Lezione 12 - Reinforcement Learning Prof. Giancarlo Mauri.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.

Reinforcement Learning (1)

Markov Decision Processes

Markov Decision Processes

UAV Route Planning in Delay Tolerant Networks

Markov Decision Processes

Announcements Homework 3 due today (grace period through Friday)

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Reinforcement Learning Dealing with Partial Observability

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

Distributed Q Learning Lars Blackmore and Steve Block

Contents – to be removed What is Q-learning? –MDP framework –Q-learning per se Distributed Q-learning  discuss how Sharing Q-values  why interesting? Simple averaging (no good) Expertness based distributed Q-learning Expertness w/ specialised agents (optional)

Markov Decision Processes Framework: MDP –States S –Actions A –Rewards R(s,a) –Transition Function T(s,a,s’) Goal: find optimal policy  *(s) that maximises lifetime reward G G

Reinforcement Learning Want to find  * through experience –Reinforcement Learning –Intuitively similar to human/animal learning –Use some policy  for motion –Converge to the optimal policy  * An algorithm for reinforcement learning…

Q-Learning Define Q*(s,a): –“Total reward if agent is in state s, takes action a, then acts optimally at all subsequent time steps” Optimal policy:  *(s)=argmax a Q*(s,a) Q(s,a) is an estimate of Q*(s,a) Q-learning motion policy:  (s)=argmax a Q(s,a) Update Q recursively:

Q-learning Update Q recursively: r=100 s s’ a G

Q-learning Update Q recursively: Q(s’,a 1 ’)=25 Q(s’,a 2 ’)=50 r=100

Q-Learning Define Q*(s,a): –“Total reward if agent is in state s, takes action a, then acts optimally at all subsequent time steps” Optimal policy:  *(s)=argmax a Q*(s,a) Q(s,a) is an estimate of Q*(s,a) Q-learning motion policy:  (s)=argmax a Q(s,a) Update Q recursively: Optimality theorem: –“If each (s,a) pair is updated an infinite number of times, Q converges to Q* with probability 1”

Distributed Q-Learning Problem formulation An example situation: –Mobile robots Learning framework –Individual learning for t i trials –Each trial starts from a random state and ends when robot reaches goal –Next, all robots switch to cooperative learning

Distributed Q-learning How should information be shared? Three fundamentally different approaches: –Expanding state space –Sharing experiences –Share Q-values Methods for sharing state space and experience are straightforward –These showed some improvement in testing Best method for sharing Q-values is not obvious –This area offers the greatest challenge and the greatest potential for cunning

Sharing Q-values First approach: Simple Averaging This was shown to yield some improvement What are some of the problems?

Problems with Simple Averaging All agents have the same Q table after sharing and hence the same policy: –Different policies allow agents to explore the state space differently Convergence rate may be reduced

Problems with Simple Averaging Convergence rate may be reduced –Without co-operation Trial #Q(s,a) Agent 1Agent G

Problems with Simple Averaging Convergence rate may be reduced –With simple averaging Trial #Q(s,a) Agent 1Agent ………  10 G

Problems with Simple Averaging All agents have the same Q table after sharing and hence the same policy: –Different policies allow agents to explore the state space differently Convergence rate may be reduced –Highly problem specific –Slows adaptation in dynamic environment Overall performance is highly problem specific

Expertness Idea: value more highly the knowledge of agents who are ‘experts’ –Expertness based cooperative Q-learning New Q-sharing equation: Agent i assigns an importance weight W ij to the Q data held by agent j These weights are based on the agents’ relative expertness values e i and e j

Expertness Measures Need to define expertness of agent i –Based on the rewards agent i has received Various definitions: –Algebraic Sum –Abs –Positive –Negative Different interpretations

Weighting Strategies How do we come up with weights based on the expertnesses? Alternative strategies: –‘Learn from all’: –‘Learn from experts’:

Experimental Setup Mobile robots in hunter-prey scenario Individual learning phase: 1.All robots carry out same number of trials 2.Robots carry out different number of trials Then cooperative learning

Results Compare cooperative learning to individual –Agents with equal experience –Agents with different experience Different weight assigning mechanisms Different expertness measures

Results

Conclusion Without expertness measures, cooperation is detrimental –Simple averaging shows decrease in performance Expertness based cooperative learning is shown to be superior to individual learning –Only true when agents have significantly different expertness values (necessary but not sufficient) Expertness measures Abs, P and N show best performance –Of these three, Abs provides the best compromise ‘Learning from Experts’ weighting strategy shown to be superior to ‘Learning from All’

What about this situation? Both agents have accumulated the same rewards and punishments Which is the most expert? R R R R R R

Specialised Agents An agent may have explored one area a lot but another area very little –The agent is an expert in one area but not in another Idea – Specialised agents –Agents can be experts in certain areas of the world –Learnt policy more valuable if an agent is more expert in that particular area

Results

Conclusion Expertness based cooperative learning can be detrimental –Demonstrated when considering global expertness Improvement due to cooperation increases with resolution over which expertness is assessed Correct choice of expertness measure crucial –Test case highlights robustness of Abs to problem- specific nature of reinforcement signals

More material Transmission noise Spaccy robot Penalties associated with sharing

G r=100 s s’ a -How should information be shared? ? Open question up to audience, fill in gaps in their answers with examples such as scouts etc. -BE CLEAR ABOUT WHETHER STATE IS EXPANDED IN SUBSEQUENT PAPERS -Initial testing? state space and experience sharing improve as expected but exponential space, high comm cost respectively, sharing Q-values did improve, but will talk about this later -Sharing Q-values is most interesting because the knowledge of the agent is embodied in the Q-values, so we are directly sharing knowledge. Sharing state space is just giving more info at each step, which is clearly going to improve, and sharing experiences is just increasing the number of trials. Q-values is where cunning is needed.

Possible questions Optimality and convergence of sharing How often is Q shared?