CS 188: Artificial Intelligence Fall 2008 Lecture 8: MEU / Utilities 9/23/2008 Dan Klein – UC Berkeley Many slides over the course adapted from either.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Markov Decision Processes
Planning under Uncertainty
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Lecture 16: MEU / Utilities Oct 8, 2010 A subset of Lecture 8 slides of Dan Klein – UC.
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
Decision Making Under Uncertainty Russell and Norvig: ch 16 CMSC421 – Fall 2006.
CS 182/Ling109/CogSci110 Spring 2008 Reinforcement Learning: Basics 3/20/2008 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009 Lecture 8: MEU / Utilities 9/22/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
CSC 412: AI Adversarial Search
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Expectimax Evaluation
Reminder Midterm Mar 7 Project 2 deadline Mar 18 midnight in-class
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Utility Theory & MDPs Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Quiz 5: Expectimax/Prob review
Making Simple Decisions
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Quiz 4 : Minimax Minimax is a paranoid algorithm. True
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
CS 188: Artificial Intelligence Fall 2008 Lecture 7: Expectimax Search 9/18/2008 Dan Klein – UC Berkeley Many slides over the course adapted from either.
CS 188: Artificial Intelligence Fall 2007 Lecture 8: Expectimax Search 9/20/2007 Dan Klein – UC Berkeley Many slides over the course adapted from either.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Markov Decision Process (MDP)
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
CS 188: Artificial Intelligence Fall 2006 Lecture 8: Expectimax Search 9/21/2006 Dan Klein – UC Berkeley Many slides over the course adapted from either.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
CS 188: Artificial Intelligence Spring 2006
Utilities and Decision Theory
Announcements Grader office hours posted on course website
Markov Decision Processes II
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
CS 4/527: Artificial Intelligence
Expectimax Lirong Xia. Expectimax Lirong Xia Project 2 MAX player: Pacman Question 1-3: Multiple MIN players: ghosts Extend classical minimax search.
CS 4/527: Artificial Intelligence
CSE 473: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence
Announcements Homework 3 due today (grace period through Friday)
Rational Decisions and
CAP 5636 – Advanced Artificial Intelligence
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 188: Artificial Intelligence Fall 2007
CSE 473 Markov Decision Processes
CS 188: Artificial Intelligence Spring 2006
Expectimax Lirong Xia.
CS 188: Artificial Intelligence Fall 2007
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Warm-up as You Walk In Given Set actions (persistent/static)
CS 416 Artificial Intelligence
Utilities and Decision Theory
Utilities and Decision Theory
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

CS 188: Artificial Intelligence Fall 2008 Lecture 8: MEU / Utilities 9/23/2008 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1

Announcements 2

Maximum Expected Utility  Principle of maximum expected utility:  A rational agent should chose the action which maximizes its expected utility, given its knowledge  Questions:  Why not do minimax (worst-case) planning in practice?  If every agent is M’ing it’s EU, why do we need expectations?  How do we know that there are utility functions which “work” with expectations (i.e. where expected values can be meaningfully compared)? 3

Multi-Agent Utilities  Similar to minimax:  Utilities are now tuples  Each player maximizes their own entry at each node  Propagate (or back up) nodes from children  Can give rise to cooperation and competition dynamically… 1,6,61,6,67,1,27,1,26,1,26,1,27,4,17,4,15,1,75,1,71,5,21,5,27,7,17,7,15,1,55,1,5 4

Expectimax Quantities 5

Expectimax Pruning? 6

Expectimax Evaluation  For minimax search, terminal utilities are insensitive to monotonic transformations  We just want better states to have higher evaluations (get the ordering right)  For expectimax, we need the magnitudes to be meaningful as well  E.g. must know whether a 50% / 50% lottery between A and B is better than 100% chance of C  100 or -10 vs 0 is different than 10 or -100 vs 0  How do we know such utilities even exist? 7

Preferences  An agent chooses among:  Prizes: A, B, etc.  Lotteries: situations with uncertain prizes  Notation: 8

Rational Preferences  We want some constraints on preferences before we call them rational  For example: an agent with intransitive preferences can be induced to give away all its money  If B > C, then an agent with C would pay (say) 1 cent to get B  If A > B, then an agent with B would pay (say) 1 cent to get A  If C > A, then an agent with A would pay (say) 1 cent to get C 9

Rational Preferences  Preferences of a rational agent must obey constraints.  The axioms of rationality:  Theorem: Rational preferences imply behavior describable as maximization of expected utility 10

MEU Principle  Theorem:  [Ramsey, 1931; von Neumann & Morgenstern, 1944]  Given any preferences satisfying these constraints, there exists a real-valued function U such that:  Maximum expected likelihood (MEU) principle:  Choose the action that maximizes expected utility  Note: an agent can be entirely rational (consistent with MEU) without ever representing or manipulating utilities and probabilities  E.g., a lookup table for perfect tictactoe, reflex vacuum cleaner 11

Human Utilities  Utilities map states to real numbers. Which numbers?  Standard approach to assessment of human utilities:  Compare a state A to a standard lottery L p between  “best possible prize” u + with probability p  “worst possible catastrophe” u - with probability 1-p  Adjust lottery probability p until A ~ L p  Resulting p is a utility in [0,1] 12

Utility Scales  Normalized utilities: u + = 1.0, u - = 0.0  Micromorts: one-millionth chance of death, useful for paying to reduce product risks, etc.  QALYs: quality-adjusted life years, useful for medical decisions involving substantial risk  Note: behavior is invariant under positive linear transformation  With deterministic prizes only (no lottery choices), only ordinal utility can be determined, i.e., total order on prizes 13

Money  Money does not behave as a utility function  Given a lottery L:  Define expected monetary value EMV(L)  Usually U(L) < U(EMV(L))  I.e., people are risk-averse  Utility curve: for what probability p am I indifferent between:  A prize x  A lottery [p,$M; (1-p),$0] for large M?  Typical empirical data, extrapolated with risk-prone behavior: 14

Example: Insurance  Consider the lottery [0.5,$1000; 0.5,$0]  What is its expected monetary value? ($500)  What is its certainty equivalent?  Monetary value acceptable in lieu of lottery  $400 for most people  Difference of $100 is the insurance premium  There’s an insurance industry because people will pay to reduce their risk  If everyone were risk-prone, no insurance needed! 15

Example: Human Rationality?  Famous example of Allais (1953)  A: [0.8,$4k; 0.2,$0]  B: [1.0,$3k; 0.0,$0]  C: [0.2,$4k; 0.8,$0]  D: [0.25,$3k; 0.75,$0]  Most people prefer B > A, C > D  But if U($0) = 0, then  B > A  U($3k) > 0.8 U($4k)  C > D  0.8 U($4k) > U($3k) 16

Reinforcement Learning  [DEMOS]  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act so as to maximize expected rewards  Change the rewards, change the learned behavior  Examples:  Playing a game, reward at the end for winning / losing  Vacuuming a house, reward for each piece of dirt picked up  Automated taxi, reward for each passenger delivered 17

Markov Decision Processes  An MDP is defined by:  A set of states s  S  A set of actions a  A  A transition function T(s,a,s’)  Prob that a from s leads to s’  i.e., P(s’ | s,a)  Also called the model  A reward function R(s, a, s’)  Sometimes just R(s) or R(s’)  A start state (or distribution)  Maybe a terminal state  MDPs are a family of non- deterministic search problems  Reinforcement learning: MDPs where we don’t know the transition or reward functions 18

Solving MDPs  In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal  In an MDP, we want an optimal policy  (s)  A policy gives an action for each state  Optimal policy maximizes expected if followed  Defines a reflex agent Optimal policy when R(s, a, s’) = for all non-terminals s 19

Example Optimal Policies R(s) = -2.0 R(s) = -0.4 R(s) = -0.03R(s) =

Example: High-Low  Three card types: 2, 3, 4  Infinite deck, twice as many 2’s  Start with 3 showing  After each card, you say “high” or “low”  New card is flipped  If you’re right, you win the points shown on the new card  Ties are no-ops  If you’re wrong, game ends  Differences from expectimax:  #1: get rewards as you go  #2: you might play forever!

High-Low  States: 2, 3, 4, done  Actions: High, Low  Model: T(s, a, s’):  P(s’=done | 4, High) = 3/4  P(s’=2 | 4, High) = 0  P(s’=3 | 4, High) = 0  P(s’=4 | 4, High) = 1/4  P(s’=done | 4, Low) = 0  P(s’=2 | 4, Low) = 1/2  P(s’=3 | 4, Low) = 1/4  P(s’=4 | 4, Low) = 1/4 ……  Rewards: R(s, a, s’):  Number shown on s’ if s  s’  0 otherwise  Start: 3 Note: could choose actions with search. How? 4 22

Example: High-Low 3 High Low HighLow HighLow High Low 3, High, Low 3 T = 0.5, R = 2 T = 0.25, R = 3 T = 0, R = 4 T = 0.25, R = 0 23

MDP Search Trees  Each MDP state gives an expectimax-like search tree a s s’ s, a (s,a,s’) called a transition T(s,a,s’) = P(s’|s,a) R(s,a,s’) s,a,s’ s is a state (s, a) is a q-state 24

Utilities of Sequences  In order to formalize optimality of a policy, need to understand utilities of sequences of rewards  Typically consider stationary preferences:  Theorem: only two ways to define stationary utilities  Additive utility:  Discounted utility: Assuming that reward depends only on state for these slides! 25

Infinite Utilities?!  Problem: infinite sequences with infinite rewards  Solutions:  Finite horizon:  Terminate after a fixed T steps  Gives nonstationary policy (  depends on time left)  Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)  Discounting: for 0 <  < 1  Smaller  means smaller “horizon” – shorter term focus 26

Discounting  Typically discount rewards by  < 1 each time step  Sooner rewards have higher utility than later rewards  Also helps the algorithms converge 27

Optimal Utilities  Fundamental operation: compute the optimal utilities of states s  Define the utility of a state s: V * (s) = expected return starting in s and acting optimally  Define the utility of a q-state (s,a): Q * (s,a) = expected return starting in s, taking action a and thereafter acting optimally  Define the optimal policy:  * (s) = optimal action from state s a s s, a s,a,s’ s’ 28

The Bellman Equations  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy  Formally: a s s, a s,a,s’ s’ 29

Solving MDPs  We want to find the optimal policy  *  Proposal 1: modified expectimax search: a s s, a s,a,s’ s’ 30

MDP Search Trees?  Problems:  This tree is usually infinite (why?)  The same states appear over and over (why?)  There’s actually one tree per state (why?)  Ideas:  Compute to a finite depth (like expectimax)  Consider returns from sequences of increasing length  Cache values so we don’t repeat work 31

Value Estimates  Calculate estimates V k * (s)  Not the optimal value of s!  The optimal value considering only next k time steps (k rewards)  As k  , it approaches the optimal value  Why:  If discounting, distant rewards become negligible  If terminal states reachable from everywhere, fraction of episodes not ending becomes negligible  Otherwise, can get infinite expected utility and then this approach actually won’t work 32

Memoized Recursion?  Recurrences:  Cache all function call results so you never repeat work  What happened to the evaluation function? 33

Mixed Layer Types  E.g. Backgammon  Expectiminimax  Environment is an extra player that moves after each agent  Chance nodes take expectations, otherwise like minimax 34

Stochastic Two-Player  Dice rolls increase b: 21 possible rolls with 2 dice  Backgammon  20 legal moves  Depth 4 = 20 x (21 x 20) x 10 9  As depth increases, probability of reaching a given node shrinks  So value of lookahead is diminished  So limiting depth is less damaging  But pruning is less possible…  TDGammon uses depth-2 search + very good eval function + reinforcement learning: world- champion level play 35