CS 416 Artificial Intelligence

Slides:

Advertisements

Similar presentations

Markov Decision Process

Advertisements

Partially Observable Markov Decision Process (POMDP)

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Decision Theoretic Planning

Reinforcement Learning

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

Partially Observable Markov Decision Process By Nezih Ergin Özkucur.

Markov Decision Processes

Planning under Uncertainty

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Markov Decision Processes

Department of Computer Science Undergraduate Events More

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Department of Computer Science Undergraduate Events More

A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.

MDPs (cont) & Reinforcement Learning

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

CS 416 Artificial Intelligence Lecture 19 Making Complex Decisions Chapter 17 Lecture 19 Making Complex Decisions Chapter 17.

Department of Computer Science Undergraduate Events More

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Comparison Value vs Policy iteration

Department of Computer Science Undergraduate Events More

CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17.

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Markov Decision Process (MDP)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes II

Making complex decisions

Reinforcement learning (Chapter 21)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Reinforcement learning (Chapter 21)

Markov Decision Processes

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Biomedical Data & Markov Decision Process

Markov Decision Processes

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Markov Decision Processes

CS 188: Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

CS 188: Artificial Intelligence Fall 2007

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Instructor: Vincent Conitzer

Chapter 17 – Making Complex Decisions

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Spring 2006

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Reinforcement Learning (2)

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17

Markov decision processes (MDP) Initial State S0 Transition Model T (s, a, s’) How does Markov apply here? Uncertainty is possible Reward Function R(s) For each state

Building an optimal policy Value Iteration Calculate the utility of each state Use the state utilities to select an optimal action in each state Your policy is simple – go to the state with the best utility Your state utilities must be accurate Through an iterative process you assign correct values to the state utility values

Iterative solution of Bellman equations Start with arbitrary initial values for state utilities Update the utility of each state as a function of its neighbors Repeat this process until an equilibrium is reached

Example Let g = 1 and R(s) = -0.04 Notice: Utilities higher near goal reflecting fewer –0.04 steps in sum

Building a policy How might we acquire and store a solution? Is this a search problem? Isn’t everything? Avoid local mins Avoid dead ends Avoid needless repetition Key observation: if the number of states is small, consider evaluating states rather than evaluating action sequences

Policy Iteration Imagine someone gave you a policy How good is it? Assume we know g and R Eyeball it? Try a few paths and see how it works? Let’s be more precise…

Policy iteration Checking a policy Just for kicks, let’s compute a utility (at this particular iteration of the policy, i) for each state according to Bellman’s equation

Policy iteration Checking a policy But we don’t know Ui(s’) No problem n Bellman equations n unknowns equations are linear (in value iteration, the equations had the non-linear “max” term) We can solve for the n unknowns in O(n3) time using standard linear algebra methods

Policy iteration Checking a policy Now we know Ui(s) for all s For each s, compute This is the best action If this action is different from policy, update the policy

Policy Iteration Often the most efficient approach Requires small state spaces to be tractable: O(n3) Approximations are possible Rather than solve for U exactly, approximate with a speedy iterative technique Explore (update the policy of) only a subset of total state space Don’t bother updating parts you think are bad

Can MDPs be used in real situations? Remember our assumptions We know what state we are in, s We know the reward at s We know the available actions, a We know the transition function, t (s, a, s’)

Is life fully observable? We don’t always know what state we are in Frequently, the environment is partially observable agent cannot look up action, p(s) agent cannot calculate utilities We can build a model of the state uncertainty and we call them Partially Observable MDPs (POMDPs)

Our robot problem as a POMDP No knowledge of state Robot has no idea of what state it is in What’s a good policy? The “Drunken Hoo” strategy

Observation Model To help model uncertainty Observation Model, O(s, o) specifies the probability of perceiving the observation o when in state s In our example, O() returns nothing with prob. 1

Belief state To help model uncertainty A belief state, b the probability distribution over being in each state, b(s) initial b = (.1, .1, .1, .1, .1, .1, .1, .1, .1, 0, 0) b(s) will be updated with each new observation / action a normalizes equation so b sums to 1.0

Insight about POMDPs Beliefs are more important than reality Optimal action will depend on agent’s belief state not its actual state p*(b) maps belief state to actions Think about “The Matrix”

POMDP agent A POMDP agent iterates through following steps Given current belief state, b, execute the action a = p*(b) Receive the observation o Use a and o to update the belief state Repeat

Mapping MDPs to POMDPs What is the probability an agent transitions from one belief state to another after action a? We would have to execute the action to obtain the new observation if we were to use this equation Instead, use conditional probabilities to construct b’ by summing over all states agent might reach

Predicting future observation Prob of perceiving o given starting in belief state b action a was executed s’ is the set of potentially reached states

Predicting new belief state Previously we predicted observation… Now predict new belief state t (b, a, b’) prob of reaching b’ from b given action a This is a transition model for belief states

Computing rewards for belief states We saw that R(s) was required… How about R(b)? call it r(b)

Pulling it together We’ve defined an observable MDP to model this POMDP t(b, a, b’) and r(b) replace t(s, a, s’) and R(s) The optimal policy, p*(b) is also an optimal policy for the original POMDP

An important distinction The “state” is continuous in this representation The belief state of the 4x3 puzzle consists of a vector of 11 numbers (one cell is an obstacle) between 0 and 1 The state in our older problems was a discrete cell ID We cannot reuse the exact value/policy iteration algorithms “Summing” over states is now impossible There are ways to make them work, though

Truth in advertising Finding optimal strategies is slow It is intractable for problems with a few dozen states