Download presentation
Presentation is loading. Please wait.
Published byAlexander Lloyd Modified over 9 years ago
1
A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015
2
Policy Iteration to solve Markov Decision Processes Two powerful tools for the analysis Acyclic Unique Sink OrientationsOrder-Regular matrices
11
starting state How much will we pay ?
12
starting state Total-cost criterion............... horizon cost vector
13
How much will we pay ? starting state Total-cost criterion Average-cost criterion............................ horizon cost vector
14
How much will we pay ? starting state Total-cost criterion Average-cost criterion Discounted-cost criterion....................................... horizon discount factor cost vector
15
Markov chains
16
Markov Decision Processes one action per state in general
18
action action cost transition probability Goal: find the optimal policy Evaluate a policy using an objective function Total-cost Average-cost Discounted-cost Proposition: there always exists what we aim for !
19
How do we solve a Markov Decision Process ? Policy Iteration
20
P OLICY I TERATION
21
Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION
22
Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION
23
Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to Stop ! We found the optimal policy P OLICY I TERATION
24
Markov Decision Processes
25
Turn Based Stochastic Games one player two players Markov Decision Processes
27
minimizer versus maximizer S TRATEGY I TERATION
28
minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against
29
minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against find the best response using P OLICY I TERATION against
30
minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against Repeat until nothing changes find the best response using P OLICY I TERATION against
31
What is the complexity of Policy Iteration ?
32
Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential...................... [Friedmann ‘09, Fearnley ‘10]
34
Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential.......................................................... [H. et al. ‘12] [Friedmann ‘09, Fearnley ‘10]
35
Exponential in general ! But…
36
Fearnley’s example is pathological
37
Deterministic MDPs MDPs with only positive costs Polynomial for a close variant ???....................... [Ye ‘10, Hansen et al. ‘11, Scherrer ‘13] [Post & Ye ‘12, Scherrer ‘13] Discounted-cost criterion with a fixed discount rate Polynomial....................
38
Let us find upper bounds for the general case !
44
sink Every subcube has a unique sink The orientation is acyclic Let us find the sink with P OLICY I TERATION Acyclic Unique Sink Orientation
45
Initial policy Let us find the sink with P OLICY I TERATION
46
: the set of dimensions of the improvement edges Let us find the sink with P OLICY I TERATION
49
Convergence in 5 vertex evaluations is the PI-sequence Let us find the sink with P OLICY I TERATION
50
Two properties to derive an upper bound
51
There exists a path connecting the policies of the PI-sequence Two properties to derive an upper bound 1. 2.
52
A new upper bound total number of policies Therefore we cannot have too many large ’s in a PI-sequence We prove Therefore
53
Can we do even better?
54
The matrix is “Order-Regular”
71
How large are the largest Order-Regular matrices that we can build?
72
The answer of exhaustive search ?? Conjecture (Hansen & Zwick, 2012) the Fibonacci number the golden ratio
73
The answer of exhaustive search Theorem (H. et al., 2014) for (Proof: a “smart” exhaustive search)
74
How large are the largest Order-Regular matrices that we can build?
75
A constructive approach
79
Iterate and build matrices of size
80
Can we do better ?
81
Yes! We can build matrices of size
82
So, what do we know about Order-Regular matrices ? Order-Regular matrixAcyclic Unique Sink Orientation
83
Let’s recap’ !
84
P ART 1Policy Iteration for Markov Decision Processes Efficient in practice but not in the worst case P ART 2The Acyclic Unique Sink Orientations point of view Leads to a new upper bound P ART 3Order-Regular matrices towards new bounds The Fibonacci conjecture fails
85
A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.