Chapter 17 – Making Complex Decisions

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Planning under Uncertainty
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Markov Decision Processes
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
CS 416 Artificial Intelligence Lecture 19 Making Complex Decisions Chapter 17 Lecture 19 Making Complex Decisions Chapter 17.
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Comparison Value vs Policy iteration
Department of Computer Science Undergraduate Events More
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Decision Process (MDP)
Announcements Grader office hours posted on course website
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes II
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Biomedical Data & Markov Decision Process
Markov Decision Processes
Markov Decision Processes
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Instructor: Vincent Conitzer
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Making Simple Decisions
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Chapter 17 – Making Complex Decisions March 30, 2004

17.1 Sequential Decision Problems Sequential decision problems vs. episodic ones Fully observable environment Stochastic actions Figure 17.1

Markov Decision Process (MDP) S0. Initial state. T(s, a, s’). Transition Model. Yields a 3-D table filled with probabilities. Markovian. R(s). Reward associated with a state. Short term reward for state.

Policy, π π(s). Action recommended in state s. This is the solution to the MDP. π*(s). Optimal policy. Yields MEU (maximum expected utility) of environment histories. Figure 17.2

Utility Function (Long term reward for state) Additive. Uh([s0, s1, …]) = R(s0) + R(s1) + … Discounted. Uh([s0, s1, …]) = R(s0) + γR(s1) + γ2R(s2) + … 0 <= γ <= 1 γ = gamma

Discounted Rewards Better However, there are problems because infinite state sequences can lead to +infinity or –infinity. Set Rmax, γ < 1 Guarantee that agent will reach goal Use average reward/step as basis for comparison γ

17.2 Value Iteration MEU principle. π*(s) = argmax a ∑ s’ T(s, a, s’) * U(s’) Bellman Equation. U(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * U(s’) Bellman Update. Ui+1(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * Ui(s’)

Example S0 = A T(A, move, B) = 1. T(B, move, C) = 1. R(A) = -.2 R(B) = -.2 R(C) = 1 Figure 17.4

Value Iteration Convergence guaranteed! Unique solution guaranteed!

17.3 Policy Iteration Figure 17.7 Policy Evaluation. Given a policy πi, calculate the utility of each state. Policy Improvement. Use a one step look ahead to calculate a better policy, πi+1

Bellman Equation (Standard Policy Iteration) Old version. U(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * U(s’) New version. Ui(s) = R(s) + γ ∑ s’ T(s, πi(s), s’) * Ui(s’) This is a linear equation. With n states, there are n3 equations to solve.

Bellman Update (Modified Policy Iteration) Old method. Ui+1(s) = R(s) + γ max a ∑ s’ T(s, a, s’) * Ui(s’) New method. Ui+1(s) = R(s) + γ ∑ s’ T(s, πi(s), s’) * Ui(s’) Run k updates for an estimation of the utilities. This is called modified policy iteration

17.4 Partially Observable MDPs (POMDPs) Observation Model. O(s, o). Gives the probability of perceiving observation o in state s. Belief state, b. For example <.5, .5, 0> if a 3 state environment. Mechanisms are needed to update b. Optimal action depends only on the agent’s current belief state!

Algorithm Given b, execute action a = π*(b). Receive observation o. Update b. b is a continuous, multi-dimensional state. Repeat.