September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Markov Decision Process

Joint Strategy Fictitious Play Sherwin Doroudi. “Adapted” from J. R. Marden, G. Arslan, J. S. Shamma, “Joint strategy fictitious play with inertia for.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Learning in games Vincent Conitzer

Decision Theoretic Planning

1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

An Introduction to Markov Decision Processes Sarah Hickmott

Planning under Uncertainty

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Learning in Games. Fictitious Play Notation! For n Players we have: n Finite Player’s Strategies Spaces S 1, S 2, …, S n n Opponent’s Strategies Spaces.

XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.

AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

CPS Learning in games Vincent Conitzer

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Changing Perspective… Common themes throughout past papers Repeated simple games with small number of actions Mostly theoretical papers Known available.

Exponential Moving Average Q- Learning Algorithm By Mostafa D. Awheda Howard M. Schwartz Presented at the 2013 IEEE Symposium Series on Computational Intelligence.

Derivative Action Learning in Games Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to.

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Introduction Many decision making problems in real life

OBJECT FOCUSED Q-LEARNING FOR AUTONOMOUS AGENTS M. ONUR CANCI.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.

Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Top level learning Pass selection using TPOT-RL. DT receiver choice function DT is trained off-line in artificial situation DT used in a heuristic, hand-coded.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

MDPs (cont) & Reinforcement Learning

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

Designing Games for Distributed Optimization Na Li and Jason R. Marden IEEE Journal of Selected Topics in Signal Processing, Vol. 7, No. 2, pp ,

Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.

Reinforcement Learning

Software Multiagent Systems: CS543 Milind Tambe University of Southern California

Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10

Vincent Conitzer CPS Learning in games Vincent Conitzer

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Making complex decisions

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Multiagent Systems Repeated Games © Manfred Huber 2018.

CS 188: Artificial Intelligence Spring 2006

Presentation transcript:

September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence, Vol. 136, 2002, pp

University of Waterloo Page 2 Agenda Introduction Motivation to multi-agent learning MDP framework Stochastic game framework Reinforcement Learning: single-agent, multi-agent Related work: Multiagent learning with a variable learning rate Theoretical analysis of the replicator dynamics WoLF Incremental Gradient Ascent algorithm WoLF Policy Hill Climbing algorithm Results Concluding remarks

September 15September 15 Introduction Motivation to multi-agent learning

University of Waterloo Page 4 MAL is a Challenging and Interesting Task Research goal is to enable an agent effectively learn to act (cooperate, compete) in the presence of other learning agents in complex domains. Equipping MAS with learning capabilities permits the agent to deal with large, open, dynamic, and unpredictable environments Multi-agent learning (MAL) is a challenging problem for developing intelligent systems. Multiagent environments are non-stationary, violating the traditional assumption underlying single-agent learning

University of Waterloo Page 5 Reinforcement Learning Papers: Statistics Google Scholar

University of Waterloo Page 6 Various Approaches to Learning / Related Work Y. Shoham et al., 2003

September 15September 15 Preliminaries MDP and Stochastic Game Frameworks

University of Waterloo Page 8 Single-agent Reinforcement Learning Independent learners act ignoring the existence of others Stationary environment Learn policy that maximizes individual utility (“trial-error”) Perform their actions, obtain a reward and update their Q- values without regard to the actions performed by others Policy World, State Learning Algorithm Actions Observations, Sensations Rewards R. S. Sutton, 1997

University of Waterloo Page 9 Markov Decision Processes / MDP Framework T. M. Mitchell, 1997 Environment is a modeled as an MDP, defined by (S, A, R, T) S – finite set of states of the environment A(s) – set of actions possible in state s  S T: S×A → P – set transition function from state-action pairs to states R(s,s',a) – expected reward on transition ( s to s‘) P(s,s',a) – probability of transition from s to s'  – discount rate for delayed reward Each discrete time t = 0, 1, 2,... agent: observes state S t  S chooses action a t  A receives immediate reward r t, state changes to S t+1 r t +f = 0... s t t a r t +1 s a r t +2 s a r t +3 s... t +3 a t+f-1 a s t +f

University of Waterloo Page 10 Agent’s learning task – find optimal action selection policy Execute actions in environment, observe results, and learn to construct an optimal action selection policy that maximizes the agent's performance - the long-term Total Discounted Reward Find a policy   s   S  a  A(s) that maximizes the value (expected future reward) of each s : and each s,a pair: V (s) = E { r +  r +  r + s =s,  } rewards t +1 t +2 t +3 t... 2  Q (s,a) = E { r +  r +  r + s =s, a =a,  } t +1 t +2 t +3 t t... 2  T. M. Mitchell, 1997

University of Waterloo Page 11 Agent’s Learning Strategy – Q-Learning method Q-function - iterative approximation of Q values with learning rate β: 0≤ β<1 Q-Learning incremental process 1.Observe the current state s 2.Select an action with probability based on the employed selection policy 3.Observe the new state s′ 4.Receive a reward r from the environment 5.Update the corresponding Q-value for action a and state s 6.Terminate the current trial if the new state s′ satisfies a terminal condition; otherwise let s′→ s and go back to step 1

University of Waterloo Page 12 Multi-agent Framework Learning in multi-agent setting all agents simultaneously learning environment not stationary (other agents are evolving) problem of a “moving target”

University of Waterloo Page 13 Stochastic Game Framework for addressing MAL From the perspective of sequential decision making: Markov decision processes one decision maker multiple states Repeated games multiple decision makers one state Stochastic games (Markov games) extension of MDPs to multiple decision makers multiple states

University of Waterloo Page 14 Stochastic Game / Notation S : Set of states (n-agent stage games) R i (s,a) : Reward to player i in state s under joint action a T(s,a,s) : Probability of transition from s to state s on a  a1a1 R 1 (s,a), R 2 (s,a), … a2a2  s [ ] s T(s,a,s)T(s,a,s) From dynamic programming approach: Q i (s,a): Long-run payoff to i from s on a then equilibrium

September 15September 15 Approach Multiagent learning using a variable learning rate

University of Waterloo Page 16 Evaluation criteria for multi-agent learning Use of convergence to NE is problematic: Terminating criterion: Equilibrium identifies conditions under which learning can or should stop Easier to play in equilibrium as opposed to continued computation Nash equilibrium strategy has no “prescriptive force”: say anything prior to termination Multiple potential equilibria Opponent may not wish to play an equilibria Calculating a Nash Equilibrium can be intractable for large games New criteria: rationality and convergence in self-play Converge to stationary policy: not necessarily Nash Only terminates once best response to play of other agents is found During self play, learning is only terminated in a stationary NE

University of Waterloo Page 17 Contributions and Assumptions Contributions: Criterion for multi-agent learning algorithms A simple Q-learning algorithm that can play mixed strategies The WoLF PHC (Win or Lose Fast Policy Hill Climber) Assumptions - gets both properties given that: The game is two-player, two-action Players can observe each other’s mixed strategies (not just the played action) Can use infinitesimally small step sizes

University of Waterloo Page 18 Opponent Modeling or Joint-Action Learners C. Claus, C. Boutilier, 1998

University of Waterloo Page 19 Joint-Action Learners Method Maintains an explicit model of the opponents for each state. Q-values are maintained for all possible joint actions at a given state The key assumption is that the opponent is stationary Thus, the model of the opponent is simply frequencies of actions played in the past Probability of playing action a -i : where C(a −i ) is the number of times the opponent has played action a −i. n(s) is the number of times state s has been visited.

University of Waterloo Page 20 Opponent modeling FP-Q learning algorithm

University of Waterloo Page 21 WoLF Principles The idea is to use two different strategy update steps, one for winning and another one for loosing situations “Win or Learn Fast”: agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing To distinguish between those situations, the player keeps track of two policies. Winning is considered if the expected utility of the actual policy is greater than the expected utility of the equilibrium (or average) policy. If winning, the smaller of two strategy update steps is chosen by the winning agent.

University of Waterloo Page 22 Incremental Gradient Ascent Learners (IGA) IGA: incrementally climbs on the mixed strategy space for 2-player 2-action general sum games guarantees convergence to a Nash equilibrium or guarantees convergence to an average payoff that is sustained by some Nash equilibrium WoLF IGA: based on WoLF principle converges guarantee to a Nash equilibrium for all 2 player 2 action general sum games

University of Waterloo Page 23 Information passing in the PHC algorithm

University of Waterloo Page 24 Simple Q-Learner that plays mixed strategies Problems: guarantees rationality against stationary opponents does not converge in self-play Updating a mixed strategy by giving more weight to the action that Q-learning believes is the best

University of Waterloo Page 25 WoLF Policy Hill Climbing algorithm Maintaining average policy Determination of “W” and “L”: by comparing the expected value of the current policy to that of the average policy agent only need to see its own payoff converges for two player two action SG’s in self-play Probability of playing action

September 15September 15 Theoretical analysis Analysis of the replicator dynamics

University of Waterloo Page 27 Replicator Dynamics – Simplification Case Best response dynamics for Paper-Rock-Scissors Circular shift from one agent’s policy to the other’s average reward

University of Waterloo Page 28 A winning strategy against PHC If winning play probability 1 for current preferred action in order to maximize rewards while winning If losing play a deceiving policy until we are ready to take advantage of them again Probability we play heads Probability opponent plays heads

University of Waterloo Page 29 Ideally we’d like to see this: winning losing

University of Waterloo Page 30 Ideally we’d like to see this: winning losing

University of Waterloo Page 31 Convergence dynamics of strategies Iterated Gradient Ascent: Again does a myopic adaptation to other players’ current strategy. Either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles Vary learning rates to be optimal while satisfying both properties

September 15September 15 Results

University of Waterloo Page 33 Experimental testbeds Matrix Games Matching pennies Three-player matching pennies Rock-paper-scissors Gridworld Soccer

University of Waterloo Page 34 Matching pennies

University of Waterloo Page 35 Rock-paper-scissors: PHC

University of Waterloo Page 36 Rock-paper-scissors: WoLF PHC

University of Waterloo Page 37 Summary and Conclusion Criterion for multi-agent learning algorithms: rationality and convergence A simple Q-learning algorithm that can play mixed strategies The WoLF PHC (Win or Lose Fast Policy Hill Climber) to satisfy rationality and convergence

University of Waterloo Page 38 Disadvantages Analysis for two-player, two-action games: pseudoconvergence Avoidance of exploitation guaranteeing that the learner cannot be deceptively exploited by another agent Chang and Kaelbling (2001) demonstrated that the best- response learner PHC (Bowling & Veloso, 2002) could be exploited by a particular dynamic strategy.

University of Waterloo Page 39 Pseudoconvergence

University of Waterloo Page 40 Future Work by Authors Exploring learning outside of self-play: whether WoLF techniques can be exploited by a malicious (not rational) “learner”. Scaling to large problems: combining single-agent scaling solutions (function approximators and parameterized policies) with the concepts of a variable learning rate and WoLF. Online learning List other algorithms of authors: GIGA-WoLF, normal form games

University of Waterloo Page 41 Discussion / Open Questions Investigation other evaluation criteria: No-regret criteria Negative non-convergence regret (NNR) Fast reaction (tracking) [Jensen] Performance: maximum time for reaching a desired performance level Incorporating more algorithms into testing: deeper comparison with more simple and more complex algorithms (e.g. AWESOME [Conitzer and Sandholm 2003]) Classification of situations (games) with various values of the delta and alpha variables: what values are good in what situations. Extending work to have more players. Online learning and exploration policy in stochastic games (trade-off) Currently the formalism is presented in two dimensional state-space: possibility for extension of the formal model (geometrical ?)? What does make Minimax-Q irrational? Application of WoLF to multi-agent evolutionary algorithms (e.g. to control the mutation rate) or to learning of neural networks (e.g. to determine a winner neuron)? Connection with control theory and learning of Complex Adaptive Systems: manifold-adaptive learning?

September 15September 15 Questions Thank you