Planning Under Uncertainty. 573 Core Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning Supervised.

Slides:

Advertisements

Similar presentations

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.

Advertisements

Partially Observable Markov Decision Process (POMDP)

Department of Computer Science Undergraduate Events More

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Dynamic Bayesian Networks (DBNs)

Decision Theoretic Planning

Lirong Xia Hidden Markov Models Tue, March 28, 2014.

Optimal Policies for POMDP Presented by Alp Sardağ.

An Introduction to Markov Decision Processes Sarah Hickmott

Infinite Horizon Problems

Planning under Uncertainty

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.

Markov Decision Processes

Planning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning Supervised.

Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Planning II CSE 473. © Daniel S. Weld 2 Logistics Tournament! PS3 – later today Non programming exercises Programming component: (mini project) SPAM detection.

Markov Decision Processes CSE 573. Add concrete MDP example No need to discuss strips or factored models Matrix ok until much later Example key (e.g.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.

Learning Bayesian Networks

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.

Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Planning II CSE 573. © Daniel S. Weld 2 Logistics Reading for Wed Ch 18 thru 18.3 Office Hours No Office Hour Today.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.

Slides for “Data Mining” by I. H. Witten and E. Frank.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Bayesian Networks CSE 473. © D. Weld and D. Fox 2 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential.

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.

Department of Computer Science Undergraduate Events More

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Sequential Stochastic Models

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Still More Uncertainty

Markov Decision Processes

Planning CSE 573 A handful of GENERAL SEARCH TECHNIQUES lie at the heart of practically all work in AI We will encounter the SAME PRINCIPLES again and.

CS 188: Artificial Intelligence Fall 2007

Chapter 17 – Making Complex Decisions

Knowledge Representation I (Propositional Logic)

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

Bayesian Networks CSE 573.

Probabilistic Reasoning

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

Planning Under Uncertainty

573 Core Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning Supervised Learning Logic-Based Probabilistic

Administrivia Reading for today’s class: ch 17 thru 17.3 Reading for Thursday: ch 21 Reinforcement learning Problem Set Extension until Monday 11/10 8am One additional problem Tues 11/11 – no class Thurs 11/13 – midterm In class Closed book May bring 1 sheet of 8.5x11” paper

Semantics Syntax: a description of the legal arrangements of symbols (Def “sentences”) Semantics: what the arrangement of symbols means in the world Sentences Models Sentences Representation World Semantics Inference

Propositional Logic: SEMANTICS “Interpretation” (or “possible world”) Specifically, TRUTH TABLES Assignment to each variable either T or F Assignment of T or F to each connective Think “function mapping from P to T (or F)” P T T F F Q P  Q T FF F Does P  Q |= P  Q ?

First Order Logic Syntax more complex Semantics more complex Specifically, the mappings are more complex (And the range of the mappings is more complex)

Models Depiction of one possible “real-world” model

Interpretations=Mappings syntactic tokens  model elements Depiction of one possible interpretation, assuming Constants: Functions: Relations: Richard JohnLeg(p,l)On(x,y) King(p)

Interpretations=Mappings syntactic tokens  model elements Another interpretation, same assumptions Constants: Functions: Relations: Richard JohnLeg(p,l)On(x,y) King(p)

Satisfiability, Validity, & Entailment S is valid if it is true in all interpretations S is satisfiable if it is true in some interp S is unsatisfiable if it is false all interps S1 entails S2 if forall interps where S1 is true, S2 is also true |=

11 Simple proof from def of conditional probability Bayes rules! Previously, in 573…

© Daniel S. Weld 12 Joint Distribution All you need - Can answer any question Inference by enumeration But… exponential Both time & space Solution: exploit conditional independence

© Daniel S. Weld 13 Joint Distribution All you need to know Can answer any question Inference by enumeration But… exponential Both time & space Solution: exploit conditional independence

© Daniel S. Weld 14 Sample Bayes Net Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

© Daniel S. Weld 15 Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X)  Childs(X)  Par(Childs(X)) Burglary Alarm Sounded Earthquake

© Daniel S. Weld 16 Inference in BNs We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence The graphical independence representation Efficient inference, organized by network shape Two simple algorithms: Variable elimination (VE) Junction trees Approximate: Markov Chain Sampling

© Daniel S. Weld 17 Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks Naïve Bayes Models Hidden Variables(later)

Parameter Estimation Summary PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior + data= Beta(2,3) (3,4)

Structure Learning as Search Local Search 1.Start with some network structure 2.Try to make a change (add or delete or reverse edge) 3.See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld 20

Naïve Bayes F 2F N-2F N-1F NF 1F 3 Class Value … Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1

Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997] F 2F N-2F N-1F NF 1F 3 Class Value … Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time

Naïve Bayes for Text P(spam | w 1 … w n ) Independence assumption? Spam? appledictatorNigeria …

24 Naïve Bayes for Text Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w 1, w 2,…w m } based on the probabilities P(w j | c i ). Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V| Equivalent to a virtual sample of seeing each word in each category exactly once.

25 Text Naïve Bayes Algorithm (Train) Let V be the vocabulary of all words in the documents in D For each category c i  C Let D i be the subset of documents in D in category c i P(c i ) = |D i | / |D| Let T i be the concatenation of all the documents in D i Let n i be the total number of word occurrences in T i For each word w j  V Let n ij be the number of occurrences of w j in T i Let P(w i | c i ) = (n ij + 1) / (n i + |V|)

26 Text Naïve Bayes Algorithm (Test) Given a test document X Let n be the number of word occurrences in X Return the category: where a i is the word occurring the ith position in X

27 Naïve Bayes Time Complexity Training Time: O(|D|L d + |C||V|)) where L d is the average length of a document in D. Assumes V and all D i, n i, and n ij pre-computed in O(|D|L d ) time during one pass through all of the data. Generally just O(|D|L d ) since usually |C||V| < |D|L d Test Time: O(|C| L t ) where L t is the average length of a test document. Very efficient overall, linearly proportional to the time needed to just read in all the data.

28 Easy to Implement But… If you do… it probably won’t work…

Probabilities: Important Detail! Any more potential problems here? P(spam | E 1 … E n ) =  P(spam | E i ) i  We are multiplying lots of small numbers Danger of underflow!  = 7 E -18  Solution? Use logs and add!  p 1 * p 2 = e log(p1)+log(p2)  Always keep in log form

30 Naïve Bayes Posterior Probabilities Classification results of naïve Bayes I.e. the class with maximum posterior probability… Usually fairly accurate (?!?!?) However, due to the inadequacy of the conditional independence assumption… Actual posterior-probability estimates not accurate. Output probabilities generally very close to 0 or 1.

573 Core Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning Supervised Learning Logic-Based Probabilistic

Planning Percepts Actions What action next? Static Fully Observable Stochastic Instantaneous Full Perfect Planning under uncertainty Environment

Models of Planning ClassicalContingent MDP ???Contingent POMDP ???Conformant POMDP Complete Observation Partial None Uncertainty Deterministic Disjunctive Probabilistic

Awkward Never any defn of what an MDP is! Missed prob strips defn Should have done that before the 2DBN All the asumptions about a Markov model could be simplified and eliminated – took too long!

Defn: Markov Model Q: set of states  init prob distribution A: transition probability distribution

E.g. Predict Web Behavior Q: set of states (Pages)  init prob distribution (Likelihood of site entry point) A: transition probability distribution (User navigation model) When will visitor leave site?

E.g. Predict Robot’s Behavior Q: set of states  init prob distribution A: transition probability distribution Will it attack Dieter?

Probability Distribution, A Forward Causality The probability of s t does not depend directly on values of future states. Probability of new state could depend on The history of states visited. Pr(s t |s t-1,s t-2,…, s 0 ) Markovian Assumption Pr(s t |s t-1,s t-2,…s 0 ) = Pr(s t |s t-1 ) Stationary Model Assumption Pr(s t |s t-1 ) = Pr(s k |s k-1 ) for all k.

What is the cost of representing a stationary probability distribution? Can we do something better?

Representing A Q: set of states  init prob distribution A: transition probabilities how can we represent these? s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s 0 s 1 s 2 s 3 s 4 s 5 s 6 s0s1s2s3s4s5s6s0s1s2s3s4s5s6 p 12 Probability of transitioning from s 1 to s 2 ∑ ?

Factoring Q Represent Q simply as a set of states? Is there internal structure? Consider a robot domain What is the state space? s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6

A Factored domain Six Boolean Variables : has_user_coffee (huc), has_robot_coffee (hrc), robot_is_wet (w), has_robot_umbrella (u), raining (r), robot_in_office (o) How many states? 2 6 = 64

Representing  Compactly Q: set of states  init prob distribution How represent this efficiently? s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 r u hrc w With a Bayes net (of course!)

Representing A Compactly Q: set of states  init prob distribution A: transition probabilities s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s 0 s 1 s 2 s 3 s 4 s 5 s 6 … s 35 s 0 s 1 s 2 s 3 s 4 s 5 s 6 … s 35 p 12 How big is matrix version of A?4096

2-Dynamic Bayesian Network huc hrc w u r o o r u w huc Total values required to represent transition probability table = 36 Vs required for a complete state probablity table? T T+1

Dynamic Bayesian Network huc hrc w u r o o r u w huc T T+1 Also known as a Factored Markov Model Defined formally as * Set of random vars * BN for initial state * 2-layer DBN for transitions huc hrc w u r o 0

STRIPS Action Schemata (:operator pick-up :parameters ((block ?ob1)) :precondition (and (clear ?ob1) (on-table ?ob1) (arm-empty)) :effect (and (not (clear ?ob1)) (not (on-table ?ob1)) (not (arm-empty)) (holding ?ob1))) Instead of defining ground actions: pickup-A and pickup-B and … Define a schema: Note: strips doesn’t allow derived effects; you must be complete! }

Deterministic “STRIPS”? pick_up_A arm_empty clear_A on_table_A - arm_empty - on_table_A + holding_A

Probabilistic STRIPS? pick_up_A arm_empty clear_A on_table_A P<0.9 - arm_empty - on_table_A + holding_A

Observability Full Observability Partial Observability No Observability

Reward/cost Each action has an associated cost. Agent may accrue rewards at different stages. A reward may depend on The current state The (current state, action) pair The (current state, action, next state) triplet Additivity assumption : Costs and rewards are additive. Reward accumulated = R(s 0 )+R(s 1 )+R(s 2 )+…

Horizon Finite : Plan till t stages. Reward = R(s 0 )+R(s 1 )+R(s 2 )+…+R(s t ) Infinite : The agent never dies. The reward R(s 0 )+R(s 1 )+R(s 2 )+… Could be unbounded. Discounted reward : R(s 0 )+γR(s 1 )+ γ 2 R(s 2 )+… Average reward : lim n  ∞ (1/n)[Σ i R(s i )] ?

Goal for an MDP Find a policy which: maximizes expected discounted reward over an infinite horizon for a fully observable Markov decision process. Why shouldn’t the planner find a plan?? What is a policy??

Optimal value of a state Define V*(s) `value of a state’ as the maximum expected discounted reward achievable from this state. Value of state if we force it to do action “a” right now, but let it act optimally later: Q*(a,s)=R(s) + c(a) + γΣ s’εS Pr(s’|a,s)V*(s’) V* should satisfy the following equation: V*(s) = max aεA {Q*(a,s)} = R(s) + max aεA {c(a) + γΣ s’εS Pr(s’|a,s)V*(s’)}

Value iteration Assign an arbitrary assignment of values to each state (or use an admissible heuristic). Iterate over the set of states and in each iteration improve the value function as follows: V t+1 (s)=R(s) + max aεA {c(a)+γΣ s’εS Pr(s’|a,s) V t (s’)} `Bellman Backup’ Stop the iteration appropriately. V t approaches V* as t increases.

Max Bellman Backup a1a1 a2a2 a3a3 s VnVn VnVn VnVn VnVn VnVn VnVn VnVn Q n+1 (s,a) V n+1 (s)

Stopping Condition ε-convergence : A value function is ε –optimal if the error (residue) at every state is less than ε. Residue(s)=|V t+1 (s)- V t (s)| Stop when max sεS R(s) < ε

Complexity of value iteration One iteration takes O(|S| 2 |A|) time. Number of iterations required : poly(|S|,|A|,1/(1-γ)) Overall, the algo is polynomial in state space! Thus exponential in number of state vars.

Computation of optimal policy Given the value function V*(s), for each state, do Bellman backups and the action which maximises the inner product term is the optimal action.  Optimal policy is stationary (time independent) – intuitive for infinite horizon case.

Policy evaluation Given a policy Π:S  A, find value of each state using this policy. V Π (s) = R(s) + c(Π(s)) + γ[Σ s’εS Pr(s’| Π(s),s)V Π (s’)] This is a system of linear equations involving |S| variables.

Bellman’s principle of optimality A policy Π is optimal if V Π (s) ≥ V Π’ (s) for all policies Π’ and all states s є S. Rather than finding the optimal value function, we can try and find the optimal policy directly, by doing a policy space search.

Policy iteration Start with any policy (Π 0 ). Iterate Policy evaluation : For each state find V Π i (s). Policy improvement : For each state s, find action a* that maximises Q Π i (a,s). If Q Π i (a*,s) > V Π i (s) let Π i+1 (s) = a* else let Π i+1 (s) = Π i (s) Stop when Π i+1 = Π i Converges faster than value iteration but policy evaluation step is more expensive.

Modified Policy iteration Rather than evaluating the actual value of policy by solving system of equations, approximate it by using value iteration with fixed policy.

RTDP iteration Start with initial belief and initialize value of each belief as the heuristic value. For current belief Save the action that minimises the current state value in the current policy. Update the value of the belief through Bellman Backup. Apply the minimum action and then randomly pick an observation. Go to next belief assuming that observation. Repeat until goal is achieved.

Fast RTDP convergence What are the advantages of RTDP? What are the disadvantages of RTDP? How to speed up RTDP?

Other speedups Heuristics Aggregations Reachability Analysis

Going beyond full observability In execution phase, we are uncertain where we are, but we have some idea of where we can be. A belief state = ?

Models of Planning ClassicalContingent MDP ???Contingent POMDP ???Conformant POMDP Complete Observation Partial None Uncertainty Deterministic Disjunctive Probabilistic

Speedups Reachability Analysis More informed heuristic

Mathematical modelling Search space : finite/infinite state/belief space. Belief state = some idea of where we are Initial state/belief. Actions Action transitions (state to state / belief to belief) Action costs Feedback : Zero/Partial/Total

Algorithms for search A* : works for sequential solutions. AO* : works for acyclic solutions. LAO* : works for cyclic solutions. RTDP : works for cyclic solutions.

Full Observability Modelled as MDPs. (also called fully observable MDPs) Output : Policy (State  Action) Bellman Equation V*(s)=max aεA(s) [c(a)+Σ s’εS V*(s’)P(s’|s,a)]

Partial Observability Modelled as POMDPs. (partially observable MDPs). Also called Probabilistic Contingent Planning. Belief = probabilistic distribution over states. What is the size of belief space? Output : Policy (Discretized Belief -> Action) Bellman Equation V*(b)=max aεA(b) [c(a)+Σ oεO P(b,a,o) V*(b a o )]

No observability Deterministic search in the belief space. Output ?