A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles.

Slides:

Advertisements

Similar presentations

Introduction to Algorithms 6.046J/18.401J/SMA5503

Advertisements

Uri Zwick Tel Aviv University Simple Stochastic Games Mean Payoff Games Parity Games.

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Directed acyclic graphs with the unique dipath property J.-C. Bermond, M. Cosnard, S. Perennes Inria and CNRS 24 Nov Disco Workshop - Valparaiso.

Markov Decision Process

Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.

Tight bounds on sparse perturbations of Markov Chains Romain Hollanders Giacomo Como Jean-Charles Delvenne Raphaël Jungers UCLouvain University of Lund.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.

Mini-course on algorithmic aspects of stochastic games and related models Marcin Jurdziński (University of Warwick) Peter Bro Miltersen (Aarhus University)

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 

Decision Theoretic Planning

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

1 s-t Graph Cuts for Binary Energy Minimization  Now that we have an energy function, the big question is how do we minimize it? n Exhaustive search is.

Uri Zwick – Tel Aviv Univ. Randomized pivoting rules for the simplex algorithm Lower bounds TexPoint fonts used in EMF. Read the TexPoint manual before.

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.

Kuang-Hao Liu et al Presented by Xin Che 11/18/09.

Announcements Homework 3: Games Project 2: Multi-Agent Pacman

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Limitations of VCG-Based Mechanisms Shahar Dobzinski Joint work with Noam Nisan.

Markov Decision Processes

Beyond selfish routing: Network Formation Games. Network Formation Games NFGs model the various ways in which selfish agents might create/use networks.

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Network Formation Games. Netwok Formation Games NFGs model distinct ways in which selfish agents might create and evaluate networks We’ll see two models:

Network Formation Games. Netwok Formation Games NFGs model distinct ways in which selfish agents might create and evaluate networks We’ll see two models:

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

1 Efficiency and Nash Equilibria in a Scrip System for P2P Networks Eric J. Friedman Joseph Y. Halpern Ian Kash.

Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.

Linear Programming Data Structures and Algorithms A.G. Malamos References: Algorithms, 2006, S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani Introduction.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Simulated Annealing.

Uri Zwick Tel Aviv University Simple Stochastic Games Mean Payoff Games Parity Games TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© 2015 McGraw-Hill Education. All rights reserved. Chapter 19 Markov Decision Processes.

Beyond selfish routing: Network Games. Network Games NGs model the various ways in which selfish agents strategically interact in using a network They.

Beyond selfish routing: Network Games. Network Games NGs model the various ways in which selfish users (i.e., players) strategically interact in using.

Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.

COMS Network Theory Week 5: October 6, 2010 Dragomir R. Radev Wednesdays, 6:10-8 PM 325 Pupin Terrace Fall 2010.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Network Formation Games. NFGs model distinct ways in which selfish agents might create and evaluate networks We’ll see two models: Global Connection Game.

Network Formation Games. NFGs model distinct ways in which selfish agents might create and evaluate networks We’ll see two models: Global Connection Game.

Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Benelux Meeting in Systems and Control 2015 About.

Complexity Classes.

Markov Decision Processes

Computability and Complexity

Markov Decision Processes

Markov Decision Processes

Uri Zwick Tel Aviv University

Secular session of 2nd FILOFOCS April 10, 2013

Thomas Dueholm Hansen – Aarhus Univ. Uri Zwick – Tel Aviv Univ.

Oliver Friedmann – Univ. of Munich Thomas Dueholm Hansen – Aarhus Univ

Uri Zwick Tel Aviv University

Network Formation Games

Network Formation Games

Markov Decision Processes

Locality In Distributed Graph Algorithms

Markov Decision Processes

Presentation transcript:

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015

Policy Iteration to solve Markov Decision Processes Two powerful tools for the analysis Acyclic Unique Sink OrientationsOrder-Regular matrices

starting state How much will we pay ?

starting state Total-cost criterion horizon cost vector

How much will we pay ? starting state Total-cost criterion Average-cost criterion horizon cost vector

How much will we pay ? starting state Total-cost criterion Average-cost criterion Discounted-cost criterion horizon discount factor cost vector

Markov chains

Markov Decision Processes one action per state in general

action action cost transition probability Goal: find the optimal policy Evaluate a policy using an objective function Total-cost Average-cost Discounted-cost Proposition: there always exists what we aim for !

How do we solve a Markov Decision Process ? Policy Iteration

P OLICY I TERATION

Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION

Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION

Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to Stop ! We found the optimal policy P OLICY I TERATION

Markov Decision Processes

Turn Based Stochastic Games one player two players Markov Decision Processes

minimizer versus maximizer S TRATEGY I TERATION

minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against

minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against find the best response using P OLICY I TERATION against

minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against Repeat until nothing changes find the best response using P OLICY I TERATION against

What is the complexity of Policy Iteration ?

Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential [Friedmann ‘09, Fearnley ‘10]

Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential [H. et al. ‘12] [Friedmann ‘09, Fearnley ‘10]

Exponential in general ! But…

Fearnley’s example is pathological

Deterministic MDPs MDPs with only positive costs Polynomial for a close variant ??? [Ye ‘10, Hansen et al. ‘11, Scherrer ‘13] [Post & Ye ‘12, Scherrer ‘13] Discounted-cost criterion with a fixed discount rate Polynomial

Let us find upper bounds for the general case !

sink Every subcube has a unique sink The orientation is acyclic Let us find the sink with P OLICY I TERATION Acyclic Unique Sink Orientation

Initial policy Let us find the sink with P OLICY I TERATION

: the set of dimensions of the improvement edges Let us find the sink with P OLICY I TERATION

Convergence in 5 vertex evaluations is the PI-sequence Let us find the sink with P OLICY I TERATION

Two properties to derive an upper bound

There exists a path connecting the policies of the PI-sequence Two properties to derive an upper bound 1. 2.

A new upper bound total number of policies Therefore we cannot have too many large ’s in a PI-sequence We prove Therefore

Can we do even better?

The matrix is “Order-Regular”

How large are the largest Order-Regular matrices that we can build?

The answer of exhaustive search ?? Conjecture (Hansen & Zwick, 2012) the Fibonacci number the golden ratio

The answer of exhaustive search Theorem (H. et al., 2014) for (Proof: a “smart” exhaustive search)

How large are the largest Order-Regular matrices that we can build?

A constructive approach

Iterate and build matrices of size

Can we do better ?

Yes! We can build matrices of size

So, what do we know about Order-Regular matrices ? Order-Regular matrixAcyclic Unique Sink Orientation

Let’s recap’ !

P ART 1Policy Iteration for Markov Decision Processes Efficient in practice but not in the worst case P ART 2The Acyclic Unique Sink Orientations point of view Leads to a new upper bound P ART 3Order-Regular matrices towards new bounds The Fibonacci conjecture fails

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015