Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Slides:



Advertisements
Similar presentations
Vincent Conitzer CPS Repeated games Vincent Conitzer
Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Infinitely Repeated Games. In an infinitely repeated game, the application of subgame perfection is different - after any possible history, the continuation.
Game Theory Lecture 8.
IN SEARCH OF VALUE EQUILIBRIA By Christopher Kleven & Dustin Richwine xkcd.com.
An Introduction to Markov Decision Processes Sarah Hickmott
Reinforcement Learning & Apprenticeship Learning Chenyi Chen.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
Markov Decision Processes
INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.
XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.
Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.
Distributed Q Learning Lars Blackmore and Steve Block.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Static Games of Complete Information: Subgame Perfection
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Extending Implicit Negotiation to Repeated Grid Games Robin Carnow Computer Science Department Rutgers University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.
Minimax strategies, Nash equilibria, correlated equilibria Vincent Conitzer
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
Exponential Moving Average Q- Learning Algorithm By Mostafa D. Awheda Howard M. Schwartz Presented at the 2013 IEEE Symposium Series on Computational Intelligence.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Rutgers University A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote Michael L. Littman.
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Equilibrium transitions in stochastic evolutionary games Dresden, ECCS’07 Jacek Miękisz Institute of Applied Mathematics University of Warsaw.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Making complex decisions
Markov Decision Processes
Vincent Conitzer CPS Repeated games Vincent Conitzer
Game Theory in Wireless and Communication Networks: Theory, Models, and Applications Lecture 10 Stochastic Game Zhu Han, Dusit Niyato, Walid Saad, and.
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Multiagent Systems Repeated Games © Manfred Huber 2018.
Markov Decision Problems
Vincent Conitzer Repeated games Vincent Conitzer
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
Collaboration in Repeated Games
Vincent Conitzer CPS Repeated games Vincent Conitzer
Presentation transcript:

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel

Outline Stochastic Games and Markov Perfect Equilibria Bellman’s Operator as a Contraction Mapping Stochastic Approximation of a Contraction Mapping Application to Zero-Sum Markov Games Minimax-Q Learning Theory of Nash-Q Learning Empirical Testing of Nash-Q Learning

How do we model games that evolve over time ? Stochastic Games ! Current Game = State Ingredients: – Agents (N) – States (S) – Payoffs (R) – Transition Probabilities (P) – Discount Factor ( δ)

Example of a Stochastic Game 1,23,4 5,67,8 -1,2-3,4 -5,6-7,8 A B CD 0,0 -10,10 A B CDE Move with 30% probability when (B,D) Move with 50% probability when (A,C) or (A,D) δ = 0.9

Markov Game is a Generalization of… Repeated Games Markov Games Add States

Markov Game is a Generalization of… Repeated Games Markov Games Add States MDP Add Agents

Markov Perfect Equilibrium (MPE) Strategy maps states into randomized actions – π i : S Δ(A) No agent has an incentive to unilaterally change her policy.

Cons & Pros of MPEs Cons: – Can’t implement everything described by the Folk Theorems (i.e., no trigger strategies) Pros: – MPEs always exist in finite Markov Games (Fink, 64) – Easier to “search for”

Learning in Stochastic Games Learning is specially important in Markov Games because MPE are hard to compute. Do we know: – Our own payoffs ? – Others’ rewards ? – Transition probabilities ? – Others’ strategies ?

Learning in Stochastic Games Adapted from Reinforcement Learning: – Minimax-Q Learning (zero-sum games) – Nash-Q Learning – CE-Q Learning

Zero-Sum Stochastic Games Nice properties: – All equilibria have the same value. – Any equilibrium strategy of player 1 against any equilibrium strategy of player 2 produces an MPE. – It has a Bellman’s-type equation.

Bellman’s Equation in DP Bellman Operator: T Bellman’s Equation Rewritten:

Contraction Mapping The Bellman Operator has the contraction property: Bellman’s Equation is a direct consequence of the contraction.

The Shapley Operator for Zero-Sum Stochastic Games The Shapley Operator is a contraction mapping. (Shapley, 53) Hence, it also has a fixed point, which is an MPE:

Value Iteration for Zero-Sum Stochastic Games Direct consequence of contraction. Converges to fixed point of operator.

Q-Learning Another consequence of a contraction mapping: – Q-Learning converges ! Q-Learning can be described as an approximation of value iteration: – Value iteration with noise.

Q-Learning Convergence Q-Learning is called a Stochastic Iterative Approximation of Bellman’s operator: – Learning Rate of 1/t. – Noise is zero-mean and has bounded variance. It converges if all state-action pairs are visited infinitely often. (Neuro-Dynamic Programming – Bertsekas, Tsitsiklis)

Minimax-Q Learning Algorithm For Zero-Sum Stochastic Games Initialize your Q 0 (s,a 1,a 2 ) for all states, actions. Update rule: Player 1 then chooses action u 1 in the next stage s k+1.

Minimax-Q Learning It’s a Stochastic Iterative Approximation of Shapley Operator. It converges to a Nash Equilibrium if all state- action-action triplets are visited infinitely often. (Littman, 96)

Can we extend it to General-Sum Stochastic Games ? Yes & No. Nash-Q Learning is such an extension. However, it has much worse computational and theoretical properties.

Nash-Q Learning Algorithm Initialize Q 0 j (s,a 1,a 2 ) for all states, actions and for every agent. – You must simulate everyone’s Q-factors. Update rule: Choose the randomized action generated by the Nash operator.

The Nash Operator and The Principle of Optimality Nash Operator finds the Nash of a stage game. Find Nash of stage game with Q-factors as your payoffs. Payoffs for Rest of the Markov Game Current Reward

The Nash Operator Unkown complexity even for 2 players. In comparison, the minimax operator can be solved in polynomial time. (there’s a linear programming formulation) For convergence, all players must break ties in favor of the same Nash Equilibrium. Why not go model-based if computation is so expensive ?

Convergence Results If every stage game encountered during learning has a global optimum, Nash-Q converges. If every stage game encountered during learning has a saddle point, Nash-Q converges. Both of these are VERY strong assumptions.

Convergence Result Analysis The global optimum assumption implies full cooperation between agents. The saddle point assumption implies no cooperation between agents. Are these equivalent to DP Q-Learning and minimax-Q Learning, respectively ?

Empirical Testing: The Grid-world WORLD 1 Some Nash Equilibria

Empirical Testing: Nash Equilibria WORLD 2 All Nash Equilibria (97%) (3%)

Empirical Performance In very small and simple games, Nash-Q learning often converged even though theory did not predict so. In particular, if all Nash Equilibria have the same value Nash-Q did better than expected.

Conclusions Nash-Q is a nice step forward: – It can be used for any Markov Game. – It uses the Principle of Optimality in a smart way. But there is still a long way to go: – Convergence results are weak. – There are no computational complexity results.