Download presentation
Presentation is loading. Please wait.
Published bySheena Montgomery Modified over 9 years ago
1
Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004
2
Markov Games as a Framework for Multi-agent Reinforcement Learning2 Overview MDP is capable of describing only single-agent environments. New mathematical framework is needed to support multi-agent reinforcement learning. Markov Games A single step in this direction is explored. 2-player zero-sum Markov Games
3
Markov Games as a Framework for Multi-agent Reinforcement Learning3 Definitions Markov Decision Process (MDP)
4
Markov Games as a Framework for Multi-agent Reinforcement Learning4 Definitions (cont.) Markov Game (MG)
5
Markov Games as a Framework for Multi-agent Reinforcement Learning5 Definitions (cont.) Two-player zero-sum Markov Game (2P-MG)
6
Markov Games as a Framework for Multi-agent Reinforcement Learning6 2P-MG Is Capable? Precludes cooperation! Generalizes MDPs (when |O|=1) The opponent has a constant behavior, which may be viewed as part of the environment. Matrix Games (when |S|=1) The environment doesn’t hold any information and rewards are totally decided by the actions. Yes
7
Markov Games as a Framework for Multi-agent Reinforcement Learning7 Matrix Games Example – “rock, paper, scissors”
8
Markov Games as a Framework for Multi-agent Reinforcement Learning8 What does ‘optimality’ exactly mean? MDP A stationary, deterministic, and undominated optimal policy always exists. MG The performance of a policy depends on the opponent’s policy, so we cannot evaluate them without context. New definition of ‘optimality’ in game theory Performs best at its worst case compared with others At least one optimal policy exists, which may or may not be deterministic because the agent is uncertain of its opponent’s move.
9
Markov Games as a Framework for Multi-agent Reinforcement Learning9 Finding Optimal Policy - Matrix Games The optimal agent’s minimum expected reward should be as large as possible. Use V to express the minimum value, then consider how to maximize it
10
Markov Games as a Framework for Multi-agent Reinforcement Learning10 Finding Optimal Policy - MDP Value of a state Quality of a state-action pair
11
Markov Games as a Framework for Multi-agent Reinforcement Learning11 Finding Optimal Policy – 2P-MG Value of a state Quality of a s-a-o triple
12
Markov Games as a Framework for Multi-agent Reinforcement Learning12 Learning Optimal Polices Q-learning minimax-Q learning
13
Markov Games as a Framework for Multi-agent Reinforcement Learning13 Minimax-Q Algorithm
14
Markov Games as a Framework for Multi-agent Reinforcement Learning14 Experiment - Problem Soccer
15
Markov Games as a Framework for Multi-agent Reinforcement Learning15 Experiment - Training 4 agents trained through 10 6 steps minimax-Q learning vs. random opponent - MR vs. itself - MM Q-learning vs. random opponent - QR vs. itself - QQ
16
Markov Games as a Framework for Multi-agent Reinforcement Learning16 Experiment - Testing Test 3 QR, QQ – 100% loser? Test 1 QR > MR? Test 2 QR<<QQ?
17
Markov Games as a Framework for Multi-agent Reinforcement Learning17 Contributions A solution to 2-player Markov games with a modified Q-learning method in which minimax is in place of max Minimax can also be used in single-agent environments to avoid risky behavior.
18
Markov Games as a Framework for Multi-agent Reinforcement Learning18 Future work Possible performance improvement of the minimax-Q learning method Linear programming caused large computational complexity. Iterative methods may be used to get approximate solutions to minimax much faster, which is sufficiently satisfactory.
19
Markov Games as a Framework for Multi-agent Reinforcement Learning19 Discussions The paper claims that the training is not sufficient for attaining the optimal policy for MR and MM. Then how soon will it possible for them to do so? It is claimed that MR and MM should break even with even the strongest opponent. Why? After training and before testing, the policies in agents are fixed. How about not fixing it and leaving learning abilities there? Thus we can examine how they adapt themselves over the long run, say how their winning rate changes. What is a “slow enough exponentially weighted average”?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.