R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Slides:

Advertisements

Similar presentations

Dialogue Policy Optimisation

Advertisements

Markov Decision Process

Reinforcement Learning

Partially Observable Markov Decision Process (POMDP)

TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Decision Theoretic Planning

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Convergent Learning in Unknown Graphical Games Dr Archie Chapman, Dr David Leslie, Dr Alex Rogers and Prof Nick Jennings School of Mathematics, University.

Markov Decision Processes

Planning under Uncertainty

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Game Playing CSC361 AI CSC361: Game Playing.

To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.

Reinforcement Learning

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.

Near-Optimal Network Design with Selfish Agents By Elliot Anshelevich, Anirban Dasgupta, Eva Tardos, Tom Wexler STOC’03 Presented by Mustafa Suleyman CIFTCI.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.

Algorithms For Inverse Reinforcement Learning Presented by Alp Sardağ.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.

Reinforcement Learning (1)

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Simple search methods for finding a Nash equilibrium Ryan Porter, Eugene Nudelman, and Yoav Shoham Games and Economic Behavior, Vol. 63, Issue 2. pp ,

CPS Learning in games Vincent Conitzer

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

CS584 - Software Multiagent Systems Lecture 12 Distributed constraint optimization II: Incomplete algorithms and recent theoretical results.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Monte-Carlo methods for Computation and Optimization Spring 2015 Based on “N-Grams and the Last-Good-Reply Policy Applied in General Game Playing” (Mandy.

Tetris Agent Optimization Using Harmony Search Algorithm

MDPs (cont) & Reinforcement Learning

2/theres-an-algorithm-for-that-algorithmia- helps-you-find-it/ 2/theres-an-algorithm-for-that-algorithmia-

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

Markov Decision Process (MDP)

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Comparison Value vs Policy iteration

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Distributed Algorithms (22903) Lecturer: Danny Hendler Approximate agreement This presentation is based on the book “Distributed Computing” by Hagit attiya.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Markov Decision Process (MDP)

István Szita & András Lőrincz

Reinforcement Learning in POMDPs Without Resets

Policy Gradient in Continuous Time

James B. Orlin Presented by Tal Kaminker

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Instructors: Fei Fang (This Lecture) and Dave Touretzky

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Markov Decision Processes

Markov Decision Processes

Presentation transcript:

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen

 Exploration versus Exploitation  Should we follow our best known path for a guaranteed reward, or take an unknown path for a chance at a greater reward?  Explicit policies do not work well in multiagent settings  R-Max uses an implicit exploration/exploitation policy  How does it do this?  Why should we believe it works?

 Background  R-Max algorithm  Proof of near-optimality  Variations on the algorithm  Conclusions

 A set of (normal form) games  Each game has a set of possible actions and rewards associated with those actions as usual  We add a probabilistic transfer function that maps the current game and the players’ actions onto the next game

(2,2)(3, 1) (2, 3)(1, 1) (4,2)(1, 1) (2, 2)(1, 0) (0,2)(5, 1) (2, 4)(3, 1) (4,3)(3, 3) (2, 3)(1, 1) (2,0)(3, 0) (2, 1)(1, 1) (5,2)(0, 1) (4, 3)(5, 1)

 History: the set of all past states, the actions taken in those states, and the rewards received (plus the current state)  (S x A 2 x R) t x S  Policy: a mapping from the set of histories to the set of possible actions in the current state  P: (S x A 2 x R) t x S  A, for all t >= 0  Value of a policy: expected average reward if we follow the policy for T steps  Denoted U(s, p a, p o, T)  U(p a ) = min po min s lim T..∞ U(s, p a, p o, T)

 ε-return mixing time: how many steps do we need to take before the expected average reward is within ε of the stable average  Min T such that U(s, p a, p o, T) > U(p a ) – ε for all s and p o  Optimal policy: the policy with the highest U(p a )  We will always parameterize this in ε and T (i.e. the policy with the highest U(p a ) whose ε-return mixing time is T)

 The agent always knows what state it is in  After each stage the agent knows what actions were taken and what rewards were received  We know the maximum possible reward, R max  We know the ε-return mixing time of any policy

 When faced with the choice between a known and unknown reward, always try the unknown reward

 Maintain an internal model of the stochastic game  Calculate an optimal policy according to model and carry it out  Update model based on observations  Calculate a new optimal policy and repeat

 Input  N: number of games  k: number of actions in each game  ε: the error bound  δ: the probability of failure  R max : the maximum reward value  T: the ε-return mixing time of an optimal policy

 Initializing the internal model  Create states {G 1...G n } to represent the stages in the stochastic game  Create a fictitious game G 0  Initialize all rewards to (R max, 0)  Set all transfer functions to point to G 0  Associate a boolean known/unknown variable with each entry in each game, initialized to unknown  Associate a list of states reached with each entry, which is initially empty

(2,2)(3, 1) (2, 3)(1, 1) (4,2)(1, 1) (2, 2)(1, 0) (0,2)(5, 1) (2, 4)(3, 1) (4,3)(3, 3) (2, 3)(1, 1) (2,0)(3, 0) (2, 1)(1, 1) (5,2)(0, 1) (4, 3)(5, 1) G1G1 G2G2 G3G3 G4G4 G5G5 G6G6

R max,0 unknown {} unknown {} unknown {} unknown {} G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G0G0

 Repeat  Compute an optimal policy for T steps based on the current internal model  Execute that policy for T steps  After each step: ▪ If an entry was visited for the first time, update the rewards based on observations ▪ Update the list of states reached from that entry ▪ If the list of states reached now contains c+1 elements ▪ mark that entry as known ▪ update the transition function ▪ compute a new policy

R max,0 unknown {} unknown {} unknown {} unknown {} G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G0G0

2,2R max,0 unknown {(G 4,1)} unknown {} unknown {} unknown {} G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G0G0

2,2R max,0 3, 3 R max,0 unknown {(G 4,1)} unknown {} unknown {} unknown {} G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G0G0 unknown {(G 1,1)}

2,2R max,0 3, 3 R max,0 unknown {(G 4,1)(G 2,1)} unknown {} unknown {} unknown {} G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G0G0 unknown {(G 1, 1)}

2,2R max,0 3, 3 R max,0 unknown {} unknown {} unknown {} G1G1 G2G2 G3G3 G4G4 G5G5 G6G6 G0G0 unknown {(G 1, 1)} 0.5 known {(G 4,1)(G 2,1)}

 Goal: prove that if the agent follows the R-Max algorithm for T steps the average reward will be within ε of the optimum  Outline of proof:  After T steps the agent will have either obtained near-optimum average reward or it will have learned something new  There are a polynomial number of parameters to learn, therefore the agent can completely learn its model in polynomial time  The adversary can block learning, but if it does so then the agent will obtain near-optimum reward  Either way the agent wins!

 Lemma 1: If the agent’s model is a sufficiently close approximation of the true game, then an optimal policy in the model will be near- optimal in the game  We guarantee that the model is sufficiently close by waiting until we have c + 1 samples before marking an entry known

 Lemma 2: the difference between the expected reward based on the model and the actual reward will be less than the exploration probability times R max  |V model – V actual | < eR max  large |V model – V actual |  large exploration probability

 Combining Lemma 1 and 2  In unknown states the agent has an unrealistically high expectation of reward (R max ), so Lemma 2 tells us that the probability of exploration is high  In known states Lemma 1 tells us that we will obtain near-optimal reward  We will always be in either a known or unknown state, therefore we will always explore with high probability or obtain near-optimal reward  If we explore for long enough then (almost) all states will be known, and we are guaranteed near-optimal reward

 Remove assumption that we know the ε- return mixing time of an optimal policy, T  Remove assumption that we know R max  Simplified version for repeated games

 R-Max is a model-based reinforcement learning algorithm that is guaranteed to converge on the near-optimal average reward in polynomial time  Works on zero-sum stochastic games, MDPs, and repeated games  The authors provide a formal justification for optimism under uncertainty heuristic  Guarantee that the agent either obtains near-optimal reward or learns efficiently

 For complicated games N, k, and T are all likely to be high, so polynomial time will still not be computationally feasible  We do not consider how the agent’s behaviour might impact the adversary’s