A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles.

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015

Policy Iteration to solve Markov Decision Processes Two powerful tools for the analysis Acyclic Unique Sink OrientationsOrder-Regular matrices

starting state How much will we pay ?

starting state Total-cost criterion............... horizon cost vector

How much will we pay ? starting state Total-cost criterion Average-cost criterion............................ horizon cost vector

How much will we pay ? starting state Total-cost criterion Average-cost criterion Discounted-cost criterion....................................... horizon discount factor cost vector

Markov chains

Markov Decision Processes one action per state in general

action action cost transition probability Goal: find the optimal policy Evaluate a policy using an objective function Total-cost Average-cost Discounted-cost Proposition: there always exists what we aim for !

How do we solve a Markov Decision Process ? Policy Iteration

P OLICY I TERATION

Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION

Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to Stop ! We found the optimal policy P OLICY I TERATION

Markov Decision Processes

Turn Based Stochastic Games one player two players Markov Decision Processes

minimizer versus maximizer S TRATEGY I TERATION

minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against

minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against find the best response using P OLICY I TERATION against

minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against Repeat until nothing changes find the best response using P OLICY I TERATION against

What is the complexity of Policy Iteration ?

Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential...................... [Friedmann ‘09, Fearnley ‘10]

Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential.......................................................... [H. et al. ‘12] [Friedmann ‘09, Fearnley ‘10]

Exponential in general ! But…

Fearnley’s example is pathological

Deterministic MDPs MDPs with only positive costs Polynomial for a close variant ???....................... [Ye ‘10, Hansen et al. ‘11, Scherrer ‘13] [Post & Ye ‘12, Scherrer ‘13] Discounted-cost criterion with a fixed discount rate Polynomial....................

Let us find upper bounds for the general case !

sink Every subcube has a unique sink The orientation is acyclic Let us find the sink with P OLICY I TERATION Acyclic Unique Sink Orientation

Initial policy Let us find the sink with P OLICY I TERATION

: the set of dimensions of the improvement edges Let us find the sink with P OLICY I TERATION

Convergence in 5 vertex evaluations is the PI-sequence Let us find the sink with P OLICY I TERATION

Two properties to derive an upper bound

There exists a path connecting the policies of the PI-sequence Two properties to derive an upper bound 1. 2.

A new upper bound total number of policies Therefore we cannot have too many large ’s in a PI-sequence We prove Therefore

Can we do even better?

The matrix is “Order-Regular”

How large are the largest Order-Regular matrices that we can build?

The answer of exhaustive search ?? Conjecture (Hansen & Zwick, 2012) the Fibonacci number the golden ratio

The answer of exhaustive search Theorem (H. et al., 2014) for (Proof: a “smart” exhaustive search)

How large are the largest Order-Regular matrices that we can build?

A constructive approach

Iterate and build matrices of size

Can we do better ?

Yes! We can build matrices of size

So, what do we know about Order-Regular matrices ? Order-Regular matrixAcyclic Unique Sink Orientation

Let’s recap’ !

P ART 1Policy Iteration for Markov Decision Processes Efficient in practice but not in the worst case P ART 2The Acyclic Unique Sink Orientations point of view Leads to a new upper bound P ART 3Order-Regular matrices towards new bounds The Fibonacci conjecture fails

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles.

Similar presentations

Presentation on theme: "A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles.

Similar presentations

Presentation on theme: "A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles."— Presentation transcript:

Similar presentations

About project

Feedback