Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Benelux Meeting in Systems and Control 2015 About.

Slides:



Advertisements
Similar presentations
Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.
Advertisements

Markov Decision Process
On the Synchronizing Probability Function and the Triple Rendezvous Time: New Approaches to Černý's Conjecture François Gonze Prof. Raphaël Jungers 2 March.
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Partially Observable Markov Decision Process (POMDP)
Tight bounds on sparse perturbations of Markov Chains Romain Hollanders Giacomo Como Jean-Charles Delvenne Raphaël Jungers UCLouvain University of Lund.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
10.3 Absorbing Markov Chains
Matrices, Digraphs, Markov Chains & Their Use. Introduction to Matrices  A matrix is a rectangular array of numbers  Matrices are used to solve systems.
Markov Chains Ali Jalali. Basic Definitions Assume s as states and s as happened states. For a 3 state Markov model, we construct a transition matrix.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Jointly Optimal Transmission and Probing Strategies for Multichannel Systems Saswati Sarkar University of Pennsylvania Joint work with Sudipto Guha (Upenn)
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
Modeling the Covering Test Problem Brahim Hnich, Steven Prestwich, and Evgeny Selensky Cork Constraint Computation Center UCC Supported by SFI.
Quantum Algorithms II Andrew C. Yao Tsinghua University & Chinese U. of Hong Kong.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
1 Efficiency and Nash Equilibria in a Scrip System for P2P Networks Eric J. Friedman Joseph Y. Halpern Ian Kash.
Planning and Execution with Phase Transitions Håkan L. S. Younes Carnegie Mellon University Follow-up paper to Younes & Simmons’ “Solving Generalized Semi-Markov.
Approximation Algorithms Pages ADVANCED TOPICS IN COMPLEXITY THEORY.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Uri Zwick Tel Aviv University Simple Stochastic Games Mean Payoff Games Parity Games TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
Day 3 Markov Chains For some interesting demonstrations of this topic visit: 2005/Tools/index.htm.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© 2015 McGraw-Hill Education. All rights reserved. Chapter 19 Markov Decision Processes.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
Restless Multi-Arm Bandits Problem (RMAB): An Empirical Study Anthony Bonifonte and Qiushi Chen ISYE8813 Stochastic Processes and Algorithms 4/18/2014.
A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
New Characterizations in Turnstile Streams with Applications
Van Laarhoven, Aarts Version 1, October 2000
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Systems, Markov Decision Processes, and Dynamic Programming
POMDPs Logistics Outline No class Wed
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Haim Kaplan and Uri Zwick
Markov Decision Processes
Spectral Clustering.
Markov Decision Processes
Markov Decision Processes
Uri Zwick Tel Aviv University
Hidden Markov Models Part 2: Algorithms
 Real-Time Scheduling via Reinforcement Learning
Secular session of 2nd FILOFOCS April 10, 2013
How Hard Can It Be?.
determinant does not exist
Oliver Friedmann – Univ. of Munich Thomas Dueholm Hansen – Aarhus Univ
Discrete-time markov chain (continuation)
Numerical Algorithms Quiz questions
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Raphael Yuster Haifa University Uri Zwick Tel Aviv University
Markov Decision Problems
 Real-Time Scheduling via Reinforcement Learning
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
CSCI 235, Spring 2019, Lecture 25 Dynamic Programming
Solutions Markov Chains 6
Reinforcement Nisheeth 18th January 2019.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Benelux Meeting in Systems and Control 2015 About the latest complexity bounds for Policy Iteration

Policy Iteration to solve Markov Decision Processes Order-Regular matrices: a powerful tool for the analysis

starting state How much will we pay in the long run?

starting state How much will we pay in the long run? cost vector

starting state How much will we pay in the long run? discount factor

Markov chains

Markov Decision Processes one action per state in general

action action cost transition probability Goal: find the optimal policy The value of a policy = the long term costs of the corresponding Markov chain Proposition: there always exists what we aim for !

How do we solve a Markov Decision Process ? Policy Iteration

P OLICY I TERATION

Choose an initial policy0. end while 1. Evaluate 2. Improve is the best action in each state based on P OLICY I TERATION compute

Choose an initial policy0. end while 1. Evaluate 2. Improve P OLICY I TERATION compute is the best action in each state based on

Choose an initial policy0. end while 1. Evaluate 2. Improve Stop ! We found the optimal policy P OLICY I TERATION compute is the best action in each state based on

Policy Iteration has exponential complexity Bad news: But we still aim for upper bounds… At least in general… [Fearnley 2010, Friedmann 2009, H. et al. 2012]

Policy Iteration needs at most iterations

Policy Iteration needs at most iterations [Mansour & Singh 1999]

Policy Iteration needs at most iterations [H. et al. 2014] not possible to improve using « standard » tools

Can we do even better?

The matrix is “Order-Regular”

How large are the largest Order-Regular matrices that we can build?

The answer of exhaustive search ?? Conjecture (Hansen & Zwick, 2012) the Fibonacci number the golden ratio

The answer of exhaustive search Theorem (H. et al., 2014) for (Proof: a “smart” exhaustive search)

How large are the largest Order-Regular matrices that we can build?

A constructive approach

Iterate and build matrices of size

Can we do better ?

Yes! We can build matrices of size

So, what do we know about Order-Regular matrices ?

currently the best bounds for MDPs

For papers and slides …and much more perso.uclouvain.be/romain.hollanders/

Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Benelux Meeting in Systems and Control 2015 About the latest complexity bounds for Policy Iteration