MEAN FIELD FOR MARKOV DECISION PROCESSES: FROM DISCRETE TO CONTINUOUS OPTIMIZATION Jean-Yves Le Boudec, Nicolas Gast, Bruno Gaujal July 26, 2011 1.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Unità di Perugia e di Roma “Tor Vergata” "Uncertain production systems: optimal feedback control of the single site and extension to the multi-site case"
Markov Decision Process
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially-Observable Markov Decision Processes Tom Dietterich MCAI
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
ODE and Discrete Simulation or Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL MLQA, Aachen, September
Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL July
Mean Field Methods for Computer and Communication Systems: A Tutorial Jean-Yves Le Boudec EPFL Performance 2010 Namur, November
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
1 ENS, June 21, 2007 Jean-Yves Le Boudec, EPFL joint work with David McDonald, U. of Ottawa and Jochen Mundinger, EPFL …or an Art ? Is Mean Field a Technology…
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
1 A Generic Mean Field Convergence Result for Systems of Interacting Objects From Micro to Macro Jean-Yves Le Boudec, EPFL Joint work with David McDonald,
Markov Decision Processes
Nov 14 th  Homework 4 due  Project 4 due 11/26.
Reinforcement Learning
1 A Class Of Mean Field Interaction Models for Computer and Communication Systems Jean-Yves Le Boudec EPFL – I&C – LCA Joint work with Michel Benaïm.
Discretization Pieter Abbeel UC Berkeley EECS
1 Mean Field Interaction Models for Computer and Communication Systems and the Decoupling Assumption Jean-Yves Le Boudec EPFL – I&C – LCA Joint work with.
1 A Generic Mean Field Convergence Result for Systems of Interacting Objects From Micro to Macro Jean-Yves Le Boudec, EPFL Joint work with David McDonald,
1 A Class Of Mean Field Interaction Models for Computer and Communication Systems Jean-Yves Le Boudec EPFL – I&C – LCA Joint work with Michel Benaïm.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Modeling and Simulating Networking Systems with Markov Processes Tools and Methods of Wide Applicability ? Jean-Yves Le Boudec
MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $
Instructor: Vincent Conitzer
Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL ACCESS Distinguished Lecture Series, Stockholm, May 28,
Performance Evaluation Lecture 2: Epidemics Giovanni Neglia INRIA – EPI Maestro 9 January 2014.
Performance Evaluation Lecture 4: Epidemics Giovanni Neglia INRIA – EPI Maestro 12 January 2013.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Borgan and Henderson:. Event History Methodology
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 CS 224S W2006 CS 224S LING 281 Speech Recognition and Synthesis Lecture 15: Dialogue and Conversational Agents (III) Dan Jurafsky.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
14 th INFORMS Applied Probability Conference, Eindhoven July 9, 2007 Yoni Nazarathy Gideon Weiss University of Haifa Yoni Nazarathy Gideon Weiss University.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
Performance Evaluation Lecture 2: Epidemics Giovanni Neglia INRIA – EPI Maestro 16 December 2013.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL Network Science Workshop Hong Kong July
CS 182 Reinforcement Learning. An example RL domain Solitaire –What is the state space? –What are the actions? –What is the transition function? Is it.
Making complex decisions
Markov Decision Processes
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
RL methods in practice Alekh Agarwal.
Reinforcement learning
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Jean-Yves Le Boudec EPFL – I&C – LCA Joint work with Michel Benaïm
Markov Decision Problems
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
Reinforcement Learning Dealing with Partial Observability
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

MEAN FIELD FOR MARKOV DECISION PROCESSES: FROM DISCRETE TO CONTINUOUS OPTIMIZATION Jean-Yves Le Boudec, Nicolas Gast, Bruno Gaujal July 26,

Contents 1.Mean Field Interaction Model 2.Mean Field Interaction Model with Central Control 3.Convergence and Asymptotically Optimal Policy 2

MEAN FIELD INTERACTION MODEL 1 3

Mean Field Interaction Model Time is discrete N objects, N large Object n has state X n (t) (X N 1 (t), …, X N N (t)) is Markov Objects are observable only through their state “Occupancy measure” M N (t) = distribution of object states at time t Example [Khouzani 2010 ]: M N (t) = (S(t), I(t), R(t), D(t)) with S(t)+ I(t) + R(t) + D(t) =1 S(t) = proportion of nodes in state `S’ 4 S I R D βIβI α b q

Mean Field Interaction Model Time is discrete N objects, N large Object n has state X n (t) (X N 1 (t), …, X N N (t)) is Markov Objects are observable only through their state “Occupancy measure” M N (t) = distribution of object states at time t Theorem [Gast (2011)] M N (t) is Markov Called “Mean Field Interaction Models” in the Performance Evaluation community [McDonald(2007), Benaïm and Le Boudec(2008)] 5

Intensity I(N) I(N) = expected number of transitions per object per time unit A mean field limit occurs when we re-scale time by I(N) i.e. we consider X N (t/I(N)) I(N) = O(1): mean field limit is in discrete time [Le Boudec et al (2007)] I(N) = O(1/N): mean field limit is in continuous time [Benaïm and Le Boudec (2008)] 6

Virus Infection [Khouzani 2010] N nodes, homogeneous, pairwise meetings One interaction per time slot, I(N) = 1/N; mean field limit is an ODE Occupancy measure is M(t) = (S(t), I(t), R(t), D(t)) with S(t)+ I(t) + R(t) + D(t) =1 S(t) = proportion of nodes in state `S’ 7 S I R D βIβI α b q α = 0.1 α = 0.7 S+R or S I S+R or S I (S+R, I) (S, I) mean field limit N = 100, q=b =0.1, β =0.6 dead nodes

The Mean Field Limit Under very general conditions (given later) the occupancy measure converges, in law, to a deterministic process, m(t), called the mean field limit Finite State Space => ODE 8

Sufficient Conditions for Convergence [Kurtz 1970], see also [Bordenav et al 2008], [Graham 2000] Sufficient conditon verifiable by inspection: Example: I(N) = 1/N Second moment of number of objects affected in one timeslot = o(N) Similar result when mean field limit is in discrete time [Le Boudec et al 2007] 9

The Importance of Being Spatial Mobile node state = (c, t) c = 1 … 16 (position) t ∊ R + (age of gossip) Time is continuous, I(N) = 1 Occupancy measure is F c (z,t) = proportion of nodes that at location c and have age ≤ z [Age of Gossip, Chaintreau et al.(2009)] 10

MEAN FIELD INTERACTION MODEL WITH CENTRAL CONTROL 2 11

Markov Decision Process Central controller Action state A (metric, compact) Running reward depends on state and action Goal: maximize expected reward over horizon T Policy π selects action at every time slot Optimal policy can be assumed Markovian (X N 1 (t), …, X N N (t)) -> action Controller observes only object states => π depends on M N (t) only 12

Example 13 θ = 0.68 θ = 0. 8 θ = 0.65

Optimal Control Optimal Control Problem Find a policy π that achieves (or approaches) the supremum in m is the initial condition of occupancy measure 14 Can be found by iterative methods State space explosion (for m)

Can We Replace MDP By Mean Field Limit ? Assume the mean field model converges to fluid limit for every action E.g. mean and std dev of transitions per time slot is O(1) Can we replace MDP by optimal control of mean field limit ? 15

Controlled ODE Mean field limit is an ODE Control = action function α(t) Example: 16 α α if t > t 0 α(t) = 1 else α(t) = 0

Optimal Control for Fluid Limit Optimal function α(t) Can be obtained with Pontryagin’s maximum principle or Hamilton Jacobi Bellman equation. 17 t 0 =1 t 0 =5.6 t 0 =25

CONVERGENCE, ASYMPTOTICALLY OPTIMAL POLICY 3 18

Convergence Theorem Theorem [Gast 2011] Under reasonable regularity and scaling assumptions: 19 Optimal value for system with N objects (MDP) Optimal value for fluid limit

Convergence Theorem Does this give us an asymptotically optimal policy ? Optimal policy of system with N objects may not converge 20 Theorem [Gast 2011] Under reasonable regularity and scaling assumptions:

Asymptotically Optimal Policy Let be an optimal policy for mean field limit Define the following control for the system with N objects At time slot k, pick same action as optimal fluid limit would take at time t = k I(N) This defines a time dependent policy. Let = value function when applying to system with N objects Theorem [Gast 2011] 21 Value of this policy Optimal value for system with N objects (MDP)

22

Conclusions Optimal control on mean field limit is justified A practical, asymptotically optimal policy can be derived 23

24 Questions ? [Khouzani 2010]