Presentation is loading. Please wait.

Presentation is loading. Please wait.

R L 3 Introduction to Planning Under Uncertainty Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real Life.

Similar presentations


Presentation on theme: "R L 3 Introduction to Planning Under Uncertainty Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real Life."— Presentation transcript:

1 R L 3 Introduction to Planning Under Uncertainty Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real Life Reinforcement Learning

2 R L 3 Planning? Selecting –explicit decision making and Executing –influences an environment a Sequence of Actions –outcome depends on multiple steps to Accomplish some Objective –performance measured

3 R L 3 Uncertainty? Four principle types: outcome uncertainty don’t know which outcome will occur (ex.: MDPs). effect uncertainty don’t know possible outcomes (ex.: RL). state uncertainty don’t know current state (ex.: POMDPs). agent uncertainty don’t know what other agents will do (ex.: games). Classical planning: no uncertainty Real-world problems: one or more of the above.

4 R L 3 A Model-based Viewpoint Types of uncertainty can be combined: outcome: MDPs + state: POMDPs + agent: partially observable SGs + effect: RL in partially observable SGs What uncertainty does your problem have? What model most directly captures it? What algorithms most appropriate?

5 R L 3 Example Planning Problem X

6 R L 3 MDPs Markov decision process: finite set of states s in S finite set of actions a in A transition probabilities T(s,a,s’) rewards (or costs) R(s,a) Objective: select actions to maximize total expected reward (often discounted by  ).

7 R L 3 About MDPs Control problems: Outcome uncertainty. Specified via a set of matrices. Complexity: P-complete. Bellman equations: V(s) = max a (R(s,a) +  s’ T(s,a,s’) V(s’)) Q(s,a) = R(s,a) +  s’ T(s,a,s’) max a’ Q(s’,a’) Algorithms: value iteration (workhorse), V = F(V’). policy iteration (often fast, loose analysis) modified policy iteration (compromise) linear programming…

8 R L 3 LP for MDP min  s V s s.t. V s ≥ R(s,a) +  s’ T(s,a,s’) V s’ Vs polynomial time, worst case (only known) key to extensions (constraints, approx.)

9 R L 3 Propositional Representations Instead of big matrices, more AI-like operators. states: assignments to propositional vars. transitions –dynamic Bayesian networks –tree-based representations –circuits –STRIPS operators extended for probabilities –PPDDL 1.0…

10 R L 3 PPDDL 1.0 Example (:action drive :parameters (?from - location ?to - location) :precondition (and (at-vehicle ?from) (road ?from ?to) (not (has-flat-tire))) :effect (probabilistic.85 (and (at-vehicle ?to) (not (at-vehicle ?from)) (decrease reward 1)).15 (and (has-flat-tire) (decrease reward 1)) ))

11 R L 3 Planning Ahead (:action pick-up-tire :parameters (?loc - location) :precondition (and (at-vehicle ?loc) (spare-at ?loc) (not (vehicle-has-spare))) :effect (and (vehicle-has-spare) (decrease reward 1))) (:action change-tire :precondition (has-flat-tire) :effect (and (when (vehicle-has-spare) (and (not (has-flat-tire)) (not (vehicle-has-spare)) (decrease reward 1))) (when (not (vehicle-has-spare)) (and (not (has-flat-tire)) (decrease reward 100)))))

12 R L 3 Implications PPDDL 1.0: First probabilistic planning competition Part of IPC-4 in ICAPS-04 (Vancouver in June). Worst-case complexity (propositional) Polynomial horizon: PSPACE-complete Infinite-horizon: EXP-complete Known representations interconvertible Worst-case complexity (relational) Not sure, probably worse PLUG

13 R L 3 Propositional MDP Algorithms Many have been proposed: variations of POP search V(s) = max a (R(s,a) +  s’ T(s,a,s’) V’(s’)) VI with structured V V’(s’) = VI with parameterized V V’(s’) = w ·  (s’) Conversion to probabilistic SAT

14 R L 3 Reinforcement Learning Add effect uncertainty to outcome uncertainty actions described by unknown probabilities. Given opportunity to interact with environment s, a, r, s, a, r, s, a, r, s, a, r, s, a, r, … Want to find a policy to maximize sum of rs. Don’t miss out too badly while learning. exploration vs. exploitation Real-Life Reinforcement Learning Symposium: Fall 2004 in DC. PLUG #2

15 R L 3 Algorithmic Approaches Model-free (Q-learning et al.) Iteratively estimate V, Q. Methods typically converge in the limit. Model-based Use experience to estimate T, R. Becomes a planning problem. Can be made polytime approximation These methods can interact badly with function approximation for V or Q.

16 R L 3 Planning and Learning

17 R L 3 Polynomial Time RL Let M be a Markov decision process over N states. Let P(T, ,M) be the set of all policies that get within  of their true return in the first T steps, and that opt(P(T, ,M)) is the optimal asymptotic expected undiscounted return achievable in P(T, ,M). There exists an algorithm A, taking inputs , N, T and opt(P(T, ,M)) such that after a total number of actions and computation time bounded by a polynomial in 1/ , N, T, and R max, with probability at least 1- , the total undiscounted return will be at least opt(P(T, ,M))- 

18 R L 3 Explicit Explore Exploit (Initialization) Initially, the set S of known states is empty. (Balanced Wandering) Any time the current state is not in S, the algorithm performs balanced wandering. (Discovery of New Known States) Any time a state i has been visited m known times during balanced wandering, it enters the known set S, and no longer participates in balanced wandering. (Off-line optimizations) Compute optimal policies for M r (maximize reward, avoiding unknown states) and M d (minimize steps to unknown state). Execute M r if it is within  /2 of optimal, otherwise M d is likely to quickly discover a state out of S.

19 R L 3 POMDPs Partially observable Markov decision process: MDP plus: finite set of observations z in Z observation probabilites O(s,z) Decision maker see observations, not states. Outcome uncertainty, state uncertainty. Information state is Markov

20 R L 3 Sondik’s Observation b is belief state or information state (vector) V(b) = max a (R(b,a) +  b’ T(s,b,s’) V’(b’)) V(b) = max a (R(a)·b +  z O(a)·V’(b’)) Closure property: If V’ is piecewise linear and convex function of b’, then V is piecewise linear and convex function of b! Maximum value achieved by a choice of vector for each b’ (function of z). VI approach.

21 R L 3 POMDP Value Functions action a action b Pr(i 2 ) i2i2 i1i1  100 0 100 action aaction b Value function (finite horizon): finite rep. Piecewise-linear and convex (Sondik 71) animation by Thrun baba i2i2 i1i1  0  40 80  100

22 R L 3 Algorithms If V’ has n vectors, V has at most |A| n |Z|. But, most are dominated. Witness algorithm: Each needed combo testified by some b. Search for bs for which V(b) not equal to one- step lookahead and add the combos. Point-based approximations: Quit early. Bound complexity of V.

23 R L 3 POMDPs Complexity: Exact poly horizon: PSPACE-complete. Exact infinite horizon: incomputable! Discounted approximation: finite. Single step of exact VI: If poly, RP=NP. Witness, incremental pruning: polynomial if intermediate Q functions stay small. Pet Peeve: It’s not about the number of states!

24 R L 3 RL in POMDPs If we add effects uncertainty, harder still. Active approaches: model-based –EM, PSR, instance-based memoryless –multistep updates key (TD( >0)), can do well. policy search –don’t bother with values, smart generate and test

25 R L 3 Stochastic Games Stochastic (or Markov) Games: MDP plus: set of n players finite set of actions a i in A i for the players transition probabilities T(s,a 1,…,a n,s’) rewards (or costs) R i (s,a 1,…,a n ) Objective: select actions to maximize total expected reward. But, how handle other players?

26 R L 3 Dealing with Agent Uncertainty Assume agent follows fixed strategy can model as MDP Assume agent selected from fixed population can model as POMDP Paranoid: assume agent minimizes reward can model as zero-sum game Find “equilibrium” (no incentive to change) game theoretic approach

27 R L 3 Example: Grid Game 3 AB U, D, R, L, X No move on collision Semiwalls (50%) -1 for step, -10 for collision, +100 for goal, 0 if back to initial config. Both can get goal.

28 R L 3 Complexity, Algorithms Zero sum, one state: Equivalent to LP. Zero sum, multistate: Approximate via VI, but optimal values can be irrational (not exact). General sum, one state (terminal): Open. General sum, one state (repeated): Polytime using threats. General sum, multistate: Only approximations.

29 R L 3 Beyond RL in stochastic games convergent, polytime algs for zero sum objective still debated for general sum Partially observable stochastic games Oy. Modeling, complexity, and approximation work. Relevant but tough. RL in partially observable SGs Well, that’s life, isn’t it?

30 R L 3 Discussion Complexity tends to rise when handling more uncertainty. Use a model that is appropriate for your problem: not too rich! Very active area of research in machine learning, AI, planning Let’s be precise about what we are solving!


Download ppt "R L 3 Introduction to Planning Under Uncertainty Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real Life."

Similar presentations


Ads by Google