A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015
Policy Iteration to solve Markov Decision Processes Two powerful tools for the analysis Acyclic Unique Sink OrientationsOrder-Regular matrices
starting state How much will we pay ?
starting state Total-cost criterion horizon cost vector
How much will we pay ? starting state Total-cost criterion Average-cost criterion horizon cost vector
How much will we pay ? starting state Total-cost criterion Average-cost criterion Discounted-cost criterion horizon discount factor cost vector
Markov chains
Markov Decision Processes one action per state in general
action action cost transition probability Goal: find the optimal policy Evaluate a policy using an objective function Total-cost Average-cost Discounted-cost Proposition: there always exists what we aim for !
How do we solve a Markov Decision Process ? Policy Iteration
P OLICY I TERATION
Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION
Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION
Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to Stop ! We found the optimal policy P OLICY I TERATION
Markov Decision Processes
Turn Based Stochastic Games one player two players Markov Decision Processes
minimizer versus maximizer S TRATEGY I TERATION
minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against
minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against find the best response using P OLICY I TERATION against
minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against Repeat until nothing changes find the best response using P OLICY I TERATION against
What is the complexity of Policy Iteration ?
Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential [Friedmann ‘09, Fearnley ‘10]
Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential [H. et al. ‘12] [Friedmann ‘09, Fearnley ‘10]
Exponential in general ! But…
Fearnley’s example is pathological
Deterministic MDPs MDPs with only positive costs Polynomial for a close variant ??? [Ye ‘10, Hansen et al. ‘11, Scherrer ‘13] [Post & Ye ‘12, Scherrer ‘13] Discounted-cost criterion with a fixed discount rate Polynomial
Let us find upper bounds for the general case !
sink Every subcube has a unique sink The orientation is acyclic Let us find the sink with P OLICY I TERATION Acyclic Unique Sink Orientation
Initial policy Let us find the sink with P OLICY I TERATION
: the set of dimensions of the improvement edges Let us find the sink with P OLICY I TERATION
Convergence in 5 vertex evaluations is the PI-sequence Let us find the sink with P OLICY I TERATION
Two properties to derive an upper bound
There exists a path connecting the policies of the PI-sequence Two properties to derive an upper bound 1. 2.
A new upper bound total number of policies Therefore we cannot have too many large ’s in a PI-sequence We prove Therefore
Can we do even better?
The matrix is “Order-Regular”
How large are the largest Order-Regular matrices that we can build?
The answer of exhaustive search ?? Conjecture (Hansen & Zwick, 2012) the Fibonacci number the golden ratio
The answer of exhaustive search Theorem (H. et al., 2014) for (Proof: a “smart” exhaustive search)
How large are the largest Order-Regular matrices that we can build?
A constructive approach
Iterate and build matrices of size
Can we do better ?
Yes! We can build matrices of size
So, what do we know about Order-Regular matrices ? Order-Regular matrixAcyclic Unique Sink Orientation
Let’s recap’ !
P ART 1Policy Iteration for Markov Decision Processes Efficient in practice but not in the worst case P ART 2The Acyclic Unique Sink Orientations point of view Leads to a new upper bound P ART 3Order-Regular matrices towards new bounds The Fibonacci conjecture fails
A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015