Exponential Moving Average Q- Learning Algorithm By Mostafa D. Awheda Howard M. Schwartz Presented at the 2013 IEEE Symposium Series on Computational Intelligence.

Slides:

Advertisements

Similar presentations

Reinforcement Learning

Advertisements

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.

Game Theory Assignment For all of these games, P1 chooses between the columns, and P2 chooses between the rows.

Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.

Chapter 6 Game Theory © 2006 Thomson Learning/South-Western.

Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

IN SEARCH OF VALUE EQUILIBRIA By Christopher Kleven & Dustin Richwine xkcd.com.

IN SEARCH OF VALUE EQUILIBRIA By Christopher Kleven & Dustin Richwine xkcd.com.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Reinforcement learning

INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico.

Reinforcement Learning Rafy Michaeli Assaf Naor Supervisor: Yaakov Engel Visit project’s home page at: FOR.

Learning in Games. Fictitious Play Notation! For n Players we have: n Finite Player’s Strategies Spaces S 1, S 2, …, S n n Opponent’s Strategies Spaces.

XYZ 6/18/2015 MIT Brain and Cognitive Sciences Convergence Analysis of Reinforcement Learning Agents Srinivas Turaga th March, 2004.

AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Optimal Tuning of Continual Online Exploration in Reinforcement Learning Youssef Achbany, Francois Fouss, Luh Yen, Alain Pirotte & Marco Saerens Information.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Multiple timescales for multiagent learning David Leslie and E. J. Collins University of Bristol David Leslie is supported by CASE Research Studentship.

A Multi-Agent Learning Approach to Online Distributed Resource Allocation Chongjie Zhang Victor Lesser Prashant Shenoy Computer Science Department University.

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

Simple search methods for finding a Nash equilibrium Ryan Porter, Eugene Nudelman, and Yoav Shoham Games and Economic Behavior, Vol. 63, Issue 2. pp ,

CPS Learning in games Vincent Conitzer

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Learning dynamics,genetic algorithms,and corporate takeovers Thomas H. Noe,Lynn Pi.

September 15September 15 Multiagent learning using a variable learning rate Igor Kiselev, University of Waterloo M. Bowling and M. Veloso. Artificial Intelligence,

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Derivative Action Learning in Games Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to.

Reinforcement Learning

Games People Play. 4. Mixed strategies In this section we shall learn How to not lose a game when it appears your opponent has a counter to all your moves.

Department of Electrical Engineering, Southern Taiwan University Robotic Interaction Learning Lab 1 The optimization of the application of fuzzy ant colony.

Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)

Stock Price Prediction Using Reinforcement Learning

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Software Multiagent Systems: CS543 Milind Tambe University of Southern California

1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 21: Dynamic Multi-Criteria RL problems Dr. Itamar Arel College of Engineering Department.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Reinforcement Learning for 3 vs. 2 Keepaway P. Stone, R. S. Sutton, and S. Singh Presented by Brian Light.

Q 2.1 Nash Equilibrium Ben

Chapter 5: Monte Carlo Methods

Communication Complexity as a Lower Bound for Learning in Games

Vincent Conitzer CPS Repeated games Vincent Conitzer

Policy Gradient in Continuous Time

Multiagent Systems Game Theory © Manfred Huber 2018.

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Ising game: Equivalence between Exogenous and Endogenous Factors

Vincent Conitzer Repeated games Vincent Conitzer

Chapter 17 – Making Complex Decisions

Deep Reinforcement Learning

CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29

Chapter 7: Eligibility Traces

Chapter 4: Dynamic Programming

These neural networks take a description of the Go board as an input and process it through 12 different network layers containing millions of neuron-like.

Game Theory: The Nash Equilibrium

Vincent Conitzer CPS Repeated games Vincent Conitzer

Presentation transcript:

Exponential Moving Average Q- Learning Algorithm By Mostafa D. Awheda Howard M. Schwartz Presented at the 2013 IEEE Symposium Series on Computational Intelligence 16 Tue -19 Fri April 2013, Singapore Grand Copthorne Waterfront Hotel

Slide Name Outline Objective. The Matrix games. The Stochastic Games. The Multi-Agent learning Algorithms - PGA-APP: Policy Gradient Ascent with Approximate Policy Prediction - WPL: Weighted Policy - WoLF-PHC: Win-or-Loss-Fast Policy-Hill Climbing The EMA Q-Learning Algorithm Simulation Results

Slide Name Objective To develop a multi-agent learning algorithm that will converge to the Nash Equilibrium in a wide variety of games. The goal is to develop a learning algorithm that can provide a better convergence performance than the performance provided by the state-of-the-art multi-agent reinforcement learning (MARL) algorithms.

Slide Name The Matrix Games Matching Pennies Game Nash equilibrium (0.5,0.5) Dilemma Game Nash equilibrium (0,1)

Slide Name More Matrix Games Shapley’s Game Nash equilibrium (1/3,1/3,1/3) Biased Game Nash equilibrium (0.15,0.85) & (0.85,0.15)

Slide Name More Matrix Games Three-Player Matching Pennies Game Player 1 gets a reward of one for matching player 2; Player 2 gets a reward of one for matching player 3; Player 3 gets a reward of one for not matching player 1; Otherwise, players get zero rewards.

Slide Name The Stochastic Games

Slide Name It is important to mention here that we are only concerned about the first movement of both players from the initial state. Therefore, the figures that will be shown later on will represent the probabilities of players’ actions at the initial state.

Slide Name

Gradient Ascent Methods Gradient Ascent with Policy Prediction, Zhang and Lesser 2010

Slide Name The Policy Gradient Ascent with Approximate Policy Prediction (PGA-APP) algorithm combines the basic gradient-ascent algorithm with the idea of the policy prediction. The key idea behind this algorithm is that a player adjusts its strategy in response to forecasted strategies of the other players, instead of their current ones. PGA-APP only requires an agent to observe its reward when choosing a given action. PGA-APP uses Q-learning to learn the expected value of each action in each state to estimate the partial derivative with respect to the current strategies.

Slide Name Policy Gradient Ascent with Approximate Policy Prediction (PGA-APP) Algorithm Where the term represents the gradient

Slide Name Weighted Policy Learner (WPL) Abdallah and Lesser 2008, AI Res. The WPL algorithm was shown to converge to Nash equilibrium in benchmark two-player-two-action games as well as some larger games. On the other hand, the algorithm that we are proposing uses the exponential moving average mechanism as a basis to update the policy of the learning agent. Our proposed algorithm also enables agents to converge to Nash equilibrium in a variety of matrix and stochastic games. It also outperforms state-of-the-art MARL algorithms in terms of the convergence to Nash equilibrium and speed of convergence.

Slide Name Weighted Policy Learner (WPL) Algorithm

Slide Name Win-or-Learn Fast Policy Hill Climbing (WoLF-PHC),This is a gradient learning with a variable learning rate depending on whether the agent is winning or losing.,The Algorithm is:,Where,

Slide Name Win-or-Learn-Fast Policy Hill-Climbing (WoLF-PHC) Algorithm

Slide Name OR Exponential Moving Average Q-learning (EMA Q-learning) Algorithm

Slide Name Exponential Moving Average Q-learning (EMA Q- learning) Algorithm The proposed EMA Q-learning algorithm uses the exponential moving average mechanism as a basis to update the policy of the learning agent. This proposed algorithm enables the learning agents to converge to Nash equilibrium in a variety of matrix and stochastic games. It also outperforms state-of-the-art MARL algorithms in terms of the convergence to Nash equilibrium and speed of convergence.

Slide Name Exponential Moving Average Q-learning (EMA Q-learning) Algorithm (Cont…)

Slide Name Matching Pennies Game Nash equilibrium (0.5,0.5) EMA Q-LEARNING ALGORITHM

Slide Name EMA Q-Learning When the optimal policy π*(s,a) is a pure strategy, the EMA mechanism in the EMA Q-learning algorithm will enforce the strategy iterate π(s,a) of the learning agent to converge to the optimal policy π*(s; a). On the other hand, when the optimal policy π*(s,a) is a mixed strategy, the EMA mechanism in the EMA Q-learning algorithm will enforce the strategy iterate π(s,a) of the learning agent to oscillate around the optimal policy p π*(s,a). When time approaches infinity, the strategy iterate π(s,a) converges to the optimal policy π*(s,a).

Slide Name Shapley’s Game SIMULATION AND RESULTS Nash equilibrium (1/3,1/3,1/3)

Slide Name Shapley’s Game

Slide Name Shapley’s Game

Slide Name Shapley’s Game Results,These figures show the probability of choosing Player 1’s actions in the shapley’s game while learning with the EMA Q-learning, the PGA-APP, and the WPL algorithms. As shown in these figures, the players’ strategies successfully converge to Nash equilibrium in this game when learning with the EMA Q- learning, the PGA-APP, and the WPL algorithms. NOTE: 1 - GIGA-WoLF algorithm doesn’t converge to Nash equilibrium in Shapley’s game. 2 - WoLF-IGA algorithm doesn’t converge to Nash equilibrium in Shapley’s game. 3 - IGA and GIGA were proved to converge in games with pure Nash equilibria. However, both algorithms fail to converge in games with mixed Nash equilibria.

Slide Name Dilemma Game SIMULATION AND RESULTS Nash equilibrium (0,1)

Slide Name Results of Dilemma Game  These figures show the probability of choosing the second action for both players in the dilemma game. The EMA Q-learning, the PGA-APP, and the WPL algorithms are shown. As shown in these figures, the players’ strategies successfully converge to Nash equilibrium in this game when learning with the EMA Q-learning, the PGA-APP, and the WPL algorithms.

Slide Name SIMULATION AND RESULTS Nash equilibrium (0.5,0.5) Three-Player Matching Pennies Game

Slide Name Three Player Matching Pennies Game  These figures show the probability of choosing the first action for the three players in the three-player matching pennies game while learning with the EMA  Q-learning, the PGA-APP, and the WPL algorithms. As shown in these figures, the players’ strategies successfully converge to Nash equilibrium in this game when learning with the EMA Q-learning, the PGA-APP, and the WPL algorithms. It is important to mention here that the WPL algorithm successfully converges to Nash equilibrium in the three-player matching pennies game although it was presented to diverge in this game in (Zhang & Lesser, 2010).

Slide Name Biased Game SIMULATION AND RESULTS Nash equilibrium (0.15,0.85) & (0.85,0.15)

Slide Name Biased Game Results  These figures show the probability of choosing the first action for both players in the biased game while learning with the EMA Q- learning, the PGA-APP, and the WPL algorithms. Recall that the biased game has two mixed Nash equilibria, (0.15,0.85) and (0.85,0.15). In this game, we see from the figures that both of the PGA-APP and the WPL algorithms fail to converge to a Nash equilibrium in the biased game; only the EMA Q-learning algorithm succeeds to converge to a Nash equilibrium in the biased game.

Slide Name SIMULATION AND RESULTS

Slide Name SIMULATION AND RESULTS

Slide Name Results of Grid Game 1,Fig. (a) shows the probability of selecting action North by Player 1 at the initial state when learning with the EMA Q-learning algorithm with different values of the constant gain k. Fig (a) shows that Player 1’s speed of convergence to the optimal action (North) increases as the value of the constant gain k increases. Fig. (a) shows that the probability of selecting the optimal action North by Player 1 requires almost 80 episodes to converge to one when k=5 and 320 episodes when k=3. However, when k = 1, many more episodes are still required for the probability of selecting the action North to converge to one.,Fig. (b) and Fig. (c) show the probabilities of taking action North at the initial state by both players when learning with the EMA Q-learning, the PGA-APP, and the WPL algorithms. Fig. (b) and Fig. (c) show that the probabilities of taking action North by both players converge to the Nash equilibria (converge to one) when learning with the EMA Q-learning algorithm. However, the PGA-APP and the WPL algorithms fail to make the players’ strategies converge to the Nash equilibria.  These figures show that the EMA Q-learning algorithm outperforms the PGA-APP and the WPL algorithms in terms of the convergence to Nash equilibria. It also shows that the EMA Q-learning algorithm can converge to Nash equilibria with a small number of episodes by adjusting the value of the constant gain k. This will give the EMA Q-learning algorithm an empirical advantage over the PGA-APP and the WPL algorithms.

Slide Name SIMULATION AND RESULTS

Slide Name Results of Grid Game 2,Fig. (a) shows the probability of selecting action North by Player 1 when learning with the EMA Q-learning, the PGA-APP, and the WPL algorithms. Fig. (a) illustrates that the probability of selecting the action North by Player 1 successfully converges to one (Nash equilibrium) when Player 1 learns with the EMA Q-learning algorithm. However, the PGA-APP and the WPL algorithms fail to make Player 1 choose the action North with a probability of one. Fig. (b) shows the probability of selecting action West by Player 2 when learning with the EMA Q-learning, the PGA-APP, and the WPL algorithms. As can be seen from Fig. (b), the probability of selecting action West by Player 2 successfully converges to one (Nash equilibrium) when Player 2 learns with the EMA Q- learning algorithm. The PGA-APP and the WPL algorithms, on the other hand, fail to make Player 2 choose action West with a probability of one.  These figures show that the EMA Q-learning algorithm outperforms the PGA- APP and the WPL algorithms in terms of the convergence to Nash equilibria. This will give the EMA Q-learning algorithm an empirical advantage over the PGA-APP and the WPL algorithms.

Slide Name Conclusion A new multi-agent policy iteration learning algorithm is proposed in this work. It uses the exponential moving average (EMA) estimation technique to update the players’ strategies. This algorithm is applied to a variety of matrix and stochastic games. The results also show that the EMA Q-learning algorithm outperforms the PGA-APP and the WPL algorithms in terms of the convergence to Nash equilibria.