Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

Negotiation A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

Decision Theoretic Planning

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

Markov Decision Processes

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Markov Decision Processes

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Department of Computer Science Undergraduate Events More

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MDP Reinforcement Learning. Markov Decision Process “Should you give money to charity?” “Would you contribute?” “Should you give money to charity?” $

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

Reinforcement Learning

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Department of Computer Science Undergraduate Events More

Reinforcement Learning Yishay Mansour Tel-Aviv University.

INTRODUCTION TO Machine Learning

MDPs (cont) & Reinforcement Learning

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.

Department of Computer Science Undergraduate Events More

Department of Computer Science Undergraduate Events More

CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Markov Decision Process (MDP)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Making complex decisions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

CPS 570: Artificial Intelligence Markov decision processes, POMDPs

Reinforcement Learning

Markov Decision Processes

Markov Decision Processes

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence Fall 2007

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Instructor: Vincent Conitzer

Chapter 17 – Making Complex Decisions

Hidden Markov Models (cont.) Markov Decision Processes

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Reinforcement Nisheeth 18th January 2019.

Reinforcement Learning (2)

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC

Utility Preferences are recorded as a utility function S is the set of observable states in the world u i is utility function R is real numbers States of the world become ordered.

Properties of Utilites  Reflexive: u i (s) ≥ u i (s)  Transitive: If u i (a) ≥ u i (b) and u i (b) ≥ u i (c) then u i (a) ≥ u i (c).  Comparable: a,b either u i (a) ≥ u i (b) or u i (b) ≥ u i (a).

Selfish agents: A rational agent is one that wants to maximize its utilities, but intends no harm. Agents Rational, non- selfish agents Selfish agents Rational agents

Utility is not money: while utility represents an agent’s preferences it is not necessarily equated with money. In fact, the utility of money has been found to be roughly logarithmic.

Marginal Utility Marginal utility is the utility gained from next event Example: getting A for an A student. versus A for an B student

Transition function Transition function is represented as Transition function is defined as the probability of reaching S’ from S with action ‘a’

Expected Utility Expected utility is defined as the sum of product of the probabilities of reaching s’ from s with action ‘a’ and utility of the final state. Where S is set of all possible states

Value of Information Value of information that current state is t and not s: here represents updated, new info represents old value

Markov Decision Processes: MDP Graphical representation of a sample Markov decision process along with values for the transition and reward functions. We let the start state be s1. E.g.,

Reward Function: r(s) Reward function is represented as r : S → R

Deterministic Vs Non-Deterministic Deterministic world: predictable effects Example: only one action leads to T=1, else Ф Nondeterministic world: values change

Policy: Policy is behavior of agents that maps states to action Policy is represented by

Optimal Policy Optimal policy is a policy that maximizes expected utility. Optimal policy is represented as

Discounted Rewards: Discounted rewards smoothly reduce the impact of rewards that are farther off in the future Where represents discount factor

Bellman Equation Where r(s) represents immediate reward T(s,a,s’)u(s’) represents future, discounted rewards

Brute Force Solution Write n Bellman equations one for each n states, solve … This is a non-linear equation due to

Value Iteration Solution Set values of u(s) to random numbers Use Bellman update equation Converge and stop using this equation when where max utility change

Value Iteration Algorithm

The algorithm stops after t=4

MDP for one agent Multiagent: one agent changes, others are stationary. Better approach is a vector of size ‘n’ showing each agent’s action. Where ‘n’ represents number of agents Rewards: – Dole out equally among agents – Reward proportional to contribution

noise + cannot observe world … Belief state Observation model O(s,o) = probability of observing ‘o’, being in state ‘s’. Where is normalization constant Observation model

Partially observable MDP if * holds = 0 otherwise * - is true for new reward function Solving POMDP is hard.