Presentation is loading. Please wait.

Presentation is loading. Please wait.

A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles.

Similar presentations


Presentation on theme: "A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles."— Presentation transcript:

1 A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015

2 Policy Iteration to solve Markov Decision Processes Two powerful tools for the analysis Acyclic Unique Sink OrientationsOrder-Regular matrices

3

4

5

6

7

8

9

10

11 starting state How much will we pay ?

12 starting state Total-cost criterion............... horizon cost vector

13 How much will we pay ? starting state Total-cost criterion Average-cost criterion............................ horizon cost vector

14 How much will we pay ? starting state Total-cost criterion Average-cost criterion Discounted-cost criterion....................................... horizon discount factor cost vector

15 Markov chains

16 Markov Decision Processes one action per state in general

17

18 action action cost transition probability Goal: find the optimal policy Evaluate a policy using an objective function Total-cost Average-cost Discounted-cost Proposition: there always exists what we aim for !

19 How do we solve a Markov Decision Process ? Policy Iteration

20 P OLICY I TERATION

21 Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION

22 Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to P OLICY I TERATION

23 Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state according to Stop ! We found the optimal policy P OLICY I TERATION

24 Markov Decision Processes

25 Turn Based Stochastic Games one player two players Markov Decision Processes

26

27 minimizer versus maximizer S TRATEGY I TERATION

28 minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against

29 minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against find the best response using P OLICY I TERATION against

30 minimizer versus maximizer S TRATEGY I TERATION find the best response using P OLICY I TERATION against Repeat until nothing changes find the best response using P OLICY I TERATION against

31 What is the complexity of Policy Iteration ?

32 Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential...................... [Friedmann ‘09, Fearnley ‘10]

33

34 Total-cost criterion Average-cost criterion Discounted-cost criterion Exponential.......................................................... [H. et al. ‘12] [Friedmann ‘09, Fearnley ‘10]

35 Exponential in general ! But…

36 Fearnley’s example is pathological

37 Deterministic MDPs MDPs with only positive costs Polynomial for a close variant ???....................... [Ye ‘10, Hansen et al. ‘11, Scherrer ‘13] [Post & Ye ‘12, Scherrer ‘13] Discounted-cost criterion with a fixed discount rate Polynomial....................

38 Let us find upper bounds for the general case !

39

40

41

42

43

44 sink Every subcube has a unique sink The orientation is acyclic Let us find the sink with P OLICY I TERATION Acyclic Unique Sink Orientation

45 Initial policy Let us find the sink with P OLICY I TERATION

46 : the set of dimensions of the improvement edges Let us find the sink with P OLICY I TERATION

47

48

49 Convergence in 5 vertex evaluations is the PI-sequence Let us find the sink with P OLICY I TERATION

50 Two properties to derive an upper bound

51 There exists a path connecting the policies of the PI-sequence Two properties to derive an upper bound 1. 2.

52 A new upper bound total number of policies Therefore we cannot have too many large ’s in a PI-sequence We prove Therefore

53 Can we do even better?

54 The matrix is “Order-Regular”

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71 How large are the largest Order-Regular matrices that we can build?

72 The answer of exhaustive search ?? Conjecture (Hansen & Zwick, 2012) the Fibonacci number the golden ratio

73 The answer of exhaustive search Theorem (H. et al., 2014) for (Proof: a “smart” exhaustive search)

74 How large are the largest Order-Regular matrices that we can build?

75 A constructive approach

76

77

78

79 Iterate and build matrices of size

80 Can we do better ?

81 Yes! We can build matrices of size

82 So, what do we know about Order-Regular matrices ? Order-Regular matrixAcyclic Unique Sink Orientation

83 Let’s recap’ !

84 P ART 1Policy Iteration for Markov Decision Processes Efficient in practice but not in the worst case P ART 2The Acyclic Unique Sink Orientations point of view Leads to a new upper bound P ART 3Order-Regular matrices towards new bounds The Fibonacci conjecture fails

85 A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Seminar at Loria – Inria, Nancy, February 2015


Download ppt "A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles."

Similar presentations


Ads by Google