Making Simple Decisions

Slides:

Advertisements

Similar presentations

Chapter 17: Making Complex Decisions April 1, 2004.

Advertisements

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.

Markov Decision Process

This Segment: Computational game theory Lecture 1: Game representations, solution concepts and complexity Tuomas Sandholm Computer Science Department Carnegie.

Chapter 14 Infinite Horizon 1.Markov Games 2.Markov Solutions 3.Infinite Horizon Repeated Games 4.Trigger Strategy Solutions 5.Investing in Strategic Capital.

An Introduction to... Evolutionary Game Theory

EC941 - Game Theory Lecture 7 Prof. Francesco Squintani

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Decision Theoretic Planning

A camper awakens to the growl of a hungry bear and sees his friend putting on a pair of running shoes, “You can’t outrun a bear,” scoffs the camper. His.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Planning under Uncertainty

KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Games in the normal form- An application: “An Economic Theory of Democracy” Carl Henrik Knutsen 5/

Department of Computer Science Undergraduate Events More

Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.

Making Decisions CSE 592 Winter 2003 Henry Kautz.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Instructor: Vincent Conitzer

MAKING COMPLEX DEClSlONS

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Making Simple Decisions

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Department of Computer Science Undergraduate Events More

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

MDPs (cont) & Reinforcement Learning

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Department of Computer Science Undergraduate Events More

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Markov Decision Process (MDP)

Comparison Value vs Policy iteration

Chapter 16 March 25, Probability Theory: What an agent should believe based on the evidence Utility Theory: What the agent wants Decision Theory:

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

Markov Decision Process (MDP)

CS 188: Artificial Intelligence Spring 2006

Announcements Grader office hours posted on course website

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Making complex decisions

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Markov Decision Processes

Vincent Conitzer CPS Repeated games Vincent Conitzer

Markov Decision Processes

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Markov Decision Processes

CS 188: Artificial Intelligence

Multiagent Systems Game Theory © Manfred Huber 2018.

Rational Decisions and

CS 188: Artificial Intelligence Fall 2007

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

Multiagent Systems Repeated Games © Manfred Huber 2018.

Markov Decision Problems

Vincent Conitzer Repeated games Vincent Conitzer

Chapter 14 & 15 Repeated Games.

Chapter 17 – Making Complex Decisions

CS 416 Artificial Intelligence

Reinforcement Learning Dealing with Partial Observability

CS 416 Artificial Intelligence

Markov Decision Processes

Collaboration in Repeated Games

Normal Form (Matrix) Games

Markov Decision Processes

Reinforcement Learning

Vincent Conitzer CPS Repeated games Vincent Conitzer

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3

Presentation transcript:

Making Simple Decisions

Outline Combining Beliefs & Desires Basis of Utility Theory Utility Functions. Multi-attribute Utility Functions. Decision Networks Value of Information

Combining Beliefs & Desires(1) Rational Decision Based on Beliefs & Desires Where uncertainty & conflicting goals exist Utility assigned single number by some function Fn. express the desirability of a state combined with outcome prob. expected utility for each action

Combining Beliefs & Desires(2) Notation U(S) : utility of state S S : snapshot of the world A : action of the agent : outcome state by doing A E : available evidence Do(A) : executing A in current S

Combining Beliefs & Desires(3) Expected Utility Maximum Expected Utility(MEU) Choose an action which maximizes agent’s expected utility

Basis of Utility Theory Notation Lottery(L): a complex decision making scenario Different outcomes are determined by chance. : A is preferred to B : indifference btw. A & B : B is not preferred to A Multi—outcome lottery

Basis of Utility Theory(2) Constraints Orderability Transitivity Continuity

Basis of Utility Theory(3) Constraints (cont.) Substitutability Monotonicity Decomposability

Basis of Utility Theory(4) Utility principle Maximum Expected Utility principle Utility Represents that the agent’s actions are trying to achieve. Can be constructed by observing agent’s preferences. Non-uniqueness

Utility Functions. (1) Utility mapping state to real numbers approach Compare A to standard lottery u : best possible prize with prob. p u : worst possible catastrophe with prob. 1-p Adjust p until eg) $30 ~ L 0.9 continue death 0.1

Utility Functions. (2) Utility scales u = 1.0, u = 0.0 positive linear transform Normalized utility u = 1.0, u = 0.0 Micromort (“微死亡”) one-millionth chance of death Russian roulette, insurance QALY quality-adjusted life years

Utility Functions. (3) Utility of Money Example (Grayson, 1960) TV game 1million vs 2.5 millions by chance 0.5 Example (Grayson, 1960)

Utility Functions. (4) Given a lottery L Money : does NOT behave as a utility fn. Given a lottery L risk-averse (不愿承担风险) risk-seeking (风险追求) In reality, true probability is not easy to estimate, we may estimate Utility function by learning U $

Outline Combining Belief & Desire Basis of Utility Theory Utility Functions Multi-attribute Utility Functions Decision Networks Value of Information Expert Systems

Multi-attribute Utility Functions. (1) Multi-Attribute Utility Theory (MAUT) Outcomes are characterized by 2 or more attributes. eg) Site a new airport disruption by construction, cost of land, noise, …. Approach Identify regularities in the preference behavior

Multi-attribute Utility Functions. (2) Notation Attributes Attribute value vector Utility Fn.

Multi-attribute Utility Functions. (3) Dominance Certain (strict dominance, Fig.1) eg) airport site S1 cost less, less noise, safer than S2 : strict dominance of S1 over S2 Uncertain(Fig. 2) Fig. 1 Fig.2

Multi-attribute Utility Functions. (4) Dominance(cont.) Stochastic dominance In real world problem, very few dominance eg) S1 : $2.8 billion and $4.8 billion S2 : $3 billion and $5.2 billion S1 stochastically dominates S2

Multi-attribute Utility Functions. (5) Dominance(cont.)

Multi-attribute Utility Functions. (6) Preferences（偏好） without Uncertainty Preferences btw. concrete outcome values. Preference structure X1 & X2 preferentially independent of X3 iff Preference btw. Does not depend on eg) Airport site : <Noise,Cost,Safety> <20,000 suffer, $4.6billion, 0.06deaths/mpm> vs. <70,000 suffer, $4.2billion, 0.06deaths/mpm>

Multi-attribute Utility Functions. (7) Preferences without Uncertainty(cont.) Mutual preferential independence(MPI) every pair of attributes is P.I of its complements. eg) Airport site : <Noise, Cost, Safety> Noise & Cost P.I Safety Noise & Safety P.I Cost Cost & Safety P.I Noise : <Noise,Cost,Safety> exhibits MPI Agent’s preference behavior

Multi-attribute Utility Functions. (7) Preferences without Uncertainty(cont.) Mutual preferential independence(MPI) Airport site selection <Noise,Cost,Safety> exhibits MPI Agent’s preference behavior

Multi-attribute Utility Functions. (8) Preferences with Uncertainty Preferences btw. Lotteries’ utility Utility Independence(U.I) X is utility-independent of Y iff preferences over lotteries’ attribute set X do not depend on particular values of a set of attribute Y. Mutual U.I. (MUI) Each subset of attributes is U.I of the remaining attributes. Agent’s behavior : multiplicative Utility Fn

Value of Information (1) Idea Compute value gain of acquiring each of evidence Example : Buying oil drilling rights Three blocks A,B and C, exactly one has oil, worth k dollars Prior probabilities 1/3 each, mutually exclusive Current price of each block is k/3 Consultant offers accurate survey of A. What is the fair price?

Value of Information (2) Solution: Compute expected value of Information = Expected value of best action given information -- Expected value of best action without information Survey say “oil in A” with pdf 1/3or ‘no oil in A”( 2/3) With pdf 1/3, A has oil profit: k-k/3=2k/3 With pdf 2/3, A has no oil Profit: k/2-k/3=k/6

General Formula Notation Value of perfect information (VPI) Current evidence E, Current best action  Possible action outcomes Resulti(A)=Si Potential new evidence Ej Value of perfect information (VPI)

Properties of VPI Nonnegative Nonadditive Order-Independent

Three generic cases for VoI a) Choice is obvious, information worth little b) Choice is nonobvious, information worth a lot c) Choice is nonobvious, information worth little

Summary Combining Belief & Desire Basis of Utility Theory Utility Functions. Multi-attribute Utility Functions. Decision Networks Value of Information

MAKING COMPLEX DEClSlONS

Outline MDPs(Markov Decision Processes) POMDPs Game Theory Sequential decision problems Value iteration & Policy iteration POMDPs Partially observable MDPs Decision-theoretic Agents Game Theory Decisions with Multiple Agents: Game Theory Mechanism Design

Sequential decision problems An example

Sequential decision problems Game rules: 4 x 3 environment shown Beginning in the start state Choose an action at each time step End in the goal states, marked +1 or -1. Actions : {Up, Down, Left, Right} The environment is fully observable Terminal states have reward +1 and -1,respectively All other states have a reward of -0.04

Sequential decision problems Each action achieves the intended effect with probability 0.8, but the rest of the time, the action moves the agent at right angles to the intended direction. If the agent bumps into a wall, it stays in the same square.

Sequential decision problems Transition model A specification of the outcome probabilities for each action in each possible state Environment history a sequence of states Utility of an environment history the sum of the rewards (positive or negative) received

Sequential decision problems Definition of MDP Markov Decision Process: The specification of a sequential decision problem for a fully observable environment with a Markovian transition model and additive rewards An MDP is defined by Initial State: S0 Transition Model: T ( s , a, s') Reward Function: R(s)

Sequential decision problems Policy(策略)(denoted by ) a solution which specify what the agent should do for any state that the agent might reach is the action recommended by the policy for state s Optimal policy(denoted by ) a policy that yields the highest expected utility

Sequential decision problems An optimal policy for the world of Figure 17.1

Sequential decision problems A finite horizon There is a fixed time N after which nothing matters- the game is over The optimal policy for finite horizon is nonstationary (the optimal action in a given state could change over time) Complex An infinite horizon There is not a fixed time N The optimal policy for infinite horizon is stationary simpler

Sequential decision problems Calculate the utility of state sequences Additive rewards: The utility of a state sequence is Discounted rewards: The utility of a state sequences is where the discount factory γ is a number between 0 and 1

Sequential decision problems Infinite horizons Definition：A policy that is guaranteed to reach a terminal state is called a proper policy with proper policy, we may use γ =1 Another possibility is to compare infinite sequences in terms of the average reward obtained per time step

Sequential decision problems How to choose between policies The value of a policy is the expected sum of discounted rewards obtained, where the expectation is taken over all possible state sequences that could occur, given that the policy is executed. An optimal policy satisfies

Outline MDPs(Markov Decision Processes) POMDPs Game Theory Sequential decision problems Value iteration & Policy iteration POMDPs Partially observable MDPs Decision-theoretic Agents Game Theory Decisions with Multiple Agents: Game Theory Mechanism Design

Value iteration The basic idea is to calculate the utility of each state and then use the state utilities to select an optimal action in each state. Definition: Utilities of states (given a specific policy π ) let be the state the agent is in after executing π for t steps (note that is a random variable) Difference between short term reword long term reword

Value iteration The utilities for the 4 x 3 world

Value iteration Choose: the action that maximizes the expected utility of the subsequent state The utility of a state is given by Bellman equation

Value iteration Let us look at one of the Bellman equations for the 4 x 3 world. The equation for the state (1,1) is Up Left Down right

Value iteration The value iteration algorithm a Bellman update, looks like this VALUE-ITERATION algorithm as follows

Value iteration

Convergence of Value iteration Starting with initial values of zero, the utilities evolve as shown in Figure 17.5(a)

Value iteration Two important properties of contractions: A contraction function has only one fixed point.. When the function is applied to any argument, the value must get closer to the fixed point. Let denote the vector of utilities for all the states at the i-th iteration. Then the Bellman update equation can be written as

Value iteration Use the max norm, which measures the length of a vector by the length of its biggest component: Let Ui and Ui' be any two utility vectors. Then we have

Value iteration The number of value iterations k required to guarantee an error of at most for different values of c.

Policy iteration The policy iteration algorithm alternates the following two steps, beginning from some initial policy π0 : Policy evaluation: given a policy πi, calculate Ui = Uπi, the utility of each state if πi were to be executed. Policy improvement: Calculate a new MEU policy πi+1, using one-step look-ahead based on Ui (as in Equation (17.4)).

Policy iteration For n states, we have n linear equations with n unknowns, which can be solved exactly in time O(n3) by standard linear algebra methods. For large state spaces, O(n3) time might be prohibitive Modified policy iteration The simplified Belllman update for this process

Policy iteration Example：See the figure. Suppose is the policy shown in the figure

Policy iteration

Policy iteration In fact, on each iteration, we can pick any subset of states and apply either kind of updating (policy improvement or simplified value iteration) to that subset. This very general algorithm is called asynchronous policy iteration.

Outline MDPs(Markov Decision Processes) POMDPs Game Theory Sequential decision problems Value iteration & Policy iteration POMDPs Partially observable MDPs Decision-theoretic Agents Game Theory Decisions with Multiple Agents: Game Theory Mechanism Design

Partially observable MDPs When the environment is only partially observable, MDPs turns into Partially observable MDPs(or POMDPs pronounced "pom-dee-pees") POMDPs 's elements： elements of MDP（transition model、reward function） Observation model An POMDP is defined by Initial State: S0 ( also unknown) Transition Model: P ( s’| a, s) Reward Function: R(s) Sensor model P(e|s)

Partially observable MDPs an example for POMDPs

Partially observable MDPs How to calculate belief state ? where is a normalized term. Suppose the agent move LEFT and its sensor reports it adjacent wall; then it’s quite likely that the agent is now in (3,1) under the motion and the sensor are noisy

Partially observable MDPs Decision cycle of a POMDP agent: Given the current belief state b, execute the action Receive observation e. Set the current belief state to FORWARD(b, a, e) and repeat.

The probability of e Given that a was performed starting in belief state b, the probability of e

The probability of e where if , =0, otherwise. Reward function: Thus define an observable MDP on the space of belief state. Solving a POMDB on a physical state space can be reduced to solving an MDP on the corresponding belief-state space.

Value Iteration for POMDPs Denote the utility of executing a fixed conditional plan p in physical state s. The expected utility of executing p in the belief state is The expected utility of b under optimal policy is the utility of that conditional plan The Utility iteration

Outline MDPs(Markov Decision Processes) POMDPs Game Theory Sequential decision problems Value iteration & Policy iteration POMDPs Partially observable MDPs Decision-theoretic Agents Game Theory Decisions with Multiple Agents: Game Theory Mechanism Design

Decision –theoretic Agents Basic elements of approach to agent design Dynamic Decision network(DDN=DBN+Utility) A filtering algorithm & Make decisions A dynamic decision network as follows:

Decision –theoretic Agents Dynamic Decision network(DDN=DBN+Utility) Transition model: Sensor model： Reward: Utility:

Decision –theoretic Agents Part of the look-ahead solution of the DDN

Outline MDPs(Markov Decision Processes) POMDPs Game Theory Sequential decision problems Value iteration & Policy iteration POMDPs Partially observable MDPs Decision-theoretic Agents Game Theory Decisions with Multiple Agents: Game Theory Mechanism Design

Game Theory Components of a game in game theory Players: Alice, Bob Actions: one , testify A payoff function The payoff matrix for two finger Morra: O: one O: two E: one E=+2; O= -2 E= -3; O= +3 E: two E=-3; O= +3 E=+4; O= -4

Game Theory Agent design: Mechanism design: To analyze the agent’s decisions and compute the expected utility for each decision under assumption that other agents are acting optimally according to game theory. Mechanism design: When an environment is inhabited by many agents, it might be possible to define the rules of the environment so that the collective good of all agents is maximized when each agent adopts the game-theoretic solution that maximizes its own utility Example: Internet traffic routers

Game Theory Strategy of players Strategy profile Solution Pure strategy(deterministic policy) Mixed strategy(randomized policy) Strategy profile an assignment of a strategy to each player Solution a strategy profile in which each player adopts a rational strategy.

Game Theory Game theory describes rational behavior for agents in situations where multiple agents interact simultaneously. Solutions of games are Nash equilibria - strategy profiles in which no agent has an incentive to deviate from the specified strategy.

Prisoner’s Dilemma Alice: testify Alice: refuse Bob:testify A=-5; B= -5 A= -10; B= 0 Bob:refuse A=0; B= -10 A= -1; B= -1 Suppose Bob testifies Alice: testify 5 years refuse: 10 years Suppose Bob refuse Alice: testify: 0 year refuse: 1year

Dominant strategy Dominant strategy Pareto Optimal For player p a strategy s strongly dominates strategy s’ if the outcome for s is better than for p than the outcome for s’( for every choice of strategies by the other players) Strategy s weakly dominates s’ if s is better than s’ at least one strategy profile and no worse on any others. Pareto Optimal If there is no other outcome that all players would prefer.

Equilibrium Alice’s reason: Bob’s dominant strategy is “testfy”, and both get five years. Dominant strategy equilibrium John Nash proved that every game has at least one equilibrium. Known as Nash equilibrium Dominant strategy equilibrium is Nash equilibrium, but Nash equilibrium is not necessary to be Dominant

Mechanism Design Mechanism design can be used to set the rules by which agents will interact, in order to maximize some global utility through the operation of individually rational agents. Sometimes, mechanisms exist that achieve this goal without requiring each agent to consider the choices made by other agents.

The End of Talk