11/19  Connection between MC/HMM and MDP/POMDP  Utility in terms of the value of the vantage point.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Adversarial Search We have experience in search where we assume that we are the only intelligent being and we have explicit control over the “world”. Lets.
Partially Observable Markov Decision Process (POMDP)
Games & Adversarial Search Chapter 5. Games vs. search problems "Unpredictable" opponent  specifying a move for every possible opponent’s reply. Time.
Games & Adversarial Search
Von Neuman (Min-Max theorem) Claude Shannon (finite look-ahead) Chaturanga, India (~550AD) (Proto-Chess) John McCarthy (  pruning) Donald Knuth ( 
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
Adversarial Search Chapter 6 Section 1 – 4.
Lecture 12 Last time: CSPs, backtracking, forward checking Today: Game Playing.
10/19/2004TCSS435A Isabelle Bichindaritz1 Game and Tree Searching.
Minimax and Alpha-Beta Reduction Borrows from Spring 2006 CS 440 Lecture Slides.
Infinite Horizon Problems
Planning under Uncertainty
MDPs as Utility-based problem solving agents
This time: Outline Game playing The minimax algorithm
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Models of Planning ClassicalContingent (FO)MDP ???Contingent POMDP ???Conformant (NO)MDP Complete Observation Partial None Uncertainty Deterministic Disjunctive.
Game Playing CSC361 AI CSC361: Game Playing.
Nov 14 th  Homework 4 due  Project 4 due 11/26.
4/21  Make-up class on this Friday  No class on next Tuesday Progression corresponds to finding a single path in the transition graph What about regression?
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
An eye for eye only ends up making the whole world blind. -Mohandas Karamchand Gandhi, born October 2 nd, Lecture of October 2 nd, 2001.
5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
Games & Adversarial Search Chapter 6 Section 1 – 4.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
1 Adversary Search Ref: Chapter 5. 2 Games & A.I. Easy to measure success Easy to represent states Small number of operators Comparison against humans.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
MAKING COMPLEX DEClSlONS
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
Game Playing.
Chapter 12 Adversarial Search. (c) 2000, 2001 SNU CSE Biointelligence Lab2 Two-Agent Games (1) Idealized Setting  The actions of the agents are interleaved.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
ARTIFICIAL INTELLIGENCE (CS 461D) Princess Nora University Faculty of Computer & Information Systems.
Markov Decision Process (MDP)
Adversarial Search and Game Playing Russell and Norvig: Chapter 6 Slides adapted from: robotics.stanford.edu/~latombe/cs121/2004/home.htm Prof: Dekang.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Explorations in Artificial Intelligence Prof. Carla P. Gomes Module 5 Adversarial Search (Thanks Meinolf Sellman!)
Chapter 5 Adversarial Search. 5.1 Games Why Study Game Playing? Games allow us to experiment with easier versions of real-world situations Hostile agents.
Artificial Intelligence AIMA §5: Adversarial Search
Making complex decisions
Adversarial Search and Game Playing (Where making good decisions requires respecting your opponent) R&N: Chap. 6.
Games & Adversarial Search
Markov Decision Processes
Markov Decision Processes
Chapter 6 : Game Search 게임 탐색 (Adversarial Search)
Games & Adversarial Search
Games & Adversarial Search
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Games & Adversarial Search
Markov Decision Processes
CS51A David Kauchak Spring 2019
Games & Adversarial Search
Markov Decision Processes
Presentation transcript:

11/19  Connection between MC/HMM and MDP/POMDP  Utility in terms of the value of the vantage point

Choose between Two Lotteries Lottery A –80% chance of $4K Lottery B –100% chance of 3K Lottery C –20% chance of $4K Lottery D –25% chance of $3K 0.8U($4K) < U($3K) 0.2U($4K) > 0.25U($3K) 0.8U($4K) > U($3K) People are risk-averse with high-probability events but are willing to take risks with unlikely payoffs (see 16.3 in R&N)

Choose between Two Lotteries Lottery A –80% chance of $4K Lottery B –100% chance of 3K Lottery C –20% chance of $4K Lottery D –25% chance of $3K 0.8U($4K) < U($3K) 0.2U($4K) > 0.25U($3K) 0.8U($4K) > U($3K) People are risk-averse with high-probability events but are willing to take risks with unlikely payoffs (see 16.3 in R&N) Standard notation for a lottery [p,A; (1-p) B] with prob p. you get prize A, and (1-p) you get prize B

Money  Utility The previous slide on two lotteries shows that not only is money not utility, but the money  Utility conversion can be inconsistent

Expected Monetary value and Certainty Amount Consider a lottery: if the coin comes heads you get 1000$ and if it is tails you get 0$ –The EMV of the lottery is 500$ I have the option of taking part in this lottery I want to see how much money I need to give you up-front so you will give up the option. –Apparently, on the average, people seem to want 400$ to give up on this lottery (obviously, this is an average—your mileage may vary) –This is called the certainty amount The difference between certainty amount and EMV is called the insurance premium –To see why it makes sense, suppose the lottery was with prob you lose your house to fire and nothing happens –You take insurance in essense to avoid taking part in this lottery.

What is a solution to an MDP ? The solution should tell the optimal action to do in each state (called a “Policy”) –Policy is a function from states to actions (* see finite horizon case below*) –Not a sequence of actions anymore Needed because of the non-deterministic actions –If there are |S| states and |A| actions that we can do at each state, then there are |A| |S| policies How do we get the best policy? –Pick the policy that gives the maximal expected reward –For each policy  Simulate the policy (take actions suggested by the policy) to get behavior traces Evaluate the behavior traces Take the average value of the behavior traces. We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

Horizon & Policy We said policy is a function from states to actions.. but we sort of lied. Best policy is non-stationary, i.e., depends on how long the agent has to “live” – which is called “horizon” More generally, a policy is a mapping from  –So, if we have a horizon of k, then we will have k policies If the horizon is infinite, then policies must all be the same.. (So infinite horizon case is easy!) If you are twenty and not a liberal, you are heartless If you are sixty and not a conservative, you are mindless --Churchill

Horizon & Policy How long should behavior traces be? –Each trace is no longer than k (Finite Horizon case) Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon) –Eg: Financial portfolio advice for yuppies vs. retirees. –No limit on the size of the trace (Infinite horizon case) Policy is not horizon dependent We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state) If you are twenty and not a liberal, you are heartless If you are sixty and not a conservative, you are mindless --Churchill

How to handle unbounded state sequences? If we don’t have a horizon, then we can have potentially infinitely long state sequences. Three ways to handle them 1.Use discounted reward model ( i th state in the sequence contributes only ° i R(s i ) 2.Assume that the policy is proper (i.e., each sequence terminates into an absorbing state with non-zero probability). 3.Consider “average reward per-step”

How to evaluate a policy? Step 1: Define utility of a sequence of states in terms of their rewards –Assume “stationarity” of preferences If you prefer future f1 to f2 starting tomorrow, you should prefer them the same way even if they start today –Then, only two reasonable ways to define Utility of a sequence of states –U(s 1, s 2  s n ) =  n R(s i ) –U(s 1, s 2  s n ) =  n ° i R(s i ) (0 · ° · 1) Maximum utility bounded from above by R max /(1 - °) Step 2: Utility of a policy ¼ is the expected utility of the behaviors exhibited by an agent following it. E [  1 t=0 ° t R(s t ) | ¼ ] Step 3: Optimal policy ¼ * is the one that maximizes the expectation: argmax ¼ E [  1 t=0 ° t R(s t ) | ¼ ] –Since there are only A |s| different policies, you can evaluate them all in finite time (Haa haa..)

Utility of a State The (long term) utility of a state s with respect to a policy \pi is the expected value of all state sequences starting with s –U ¼ (s) = E [  1 t=0 ° t R(s t ) | ¼, s 0 =s ] The true utility of a state s is just its utility w.r.t optimal policy U(s) =U ¼ *(s) Thus, U and ¼ * are closely related – ¼ * (s) = argmax a  s’ M a ss’ U(s’) As are utilities of neighboring states –U(s) = R(s) + ° argmax a  s’ M a ss’ U(s’) Bellman Eqn

(Value) How about deterministic case? U(s i ) is the shortest path to the goal Repeat (“sequence of states” = “behavior”)

Bellman Equations as a basis for computing optimal policy Qn: Is there a simpler way than having to evaluate |A| |S| policies? –Yes… The Optimal Value and Optimal Policy are related by the Bellman Equations –U(s) = R(s) + ° argmax a  s’ M a ss’ U(s’) – ¼ * (s) = argmax a  s’ M a ss’ U(s’) The equations can be solved exactly through –“value iteration” (iteratively compute U and then compute ¼ * ) – “policy iteration” ( iterate over policies) –Or solve approximately through “real-time dynamic programming”

.8.1 U(i) = R(i) + ° max j M a ij U(j) + °

Value Iteration Demo mdp/vi.htmlhttp:// mdp/vi.html Things to note –The way the values change (states far from absorbing states may first reduce and then increase their values) –The convergence speed difference between Policy and value

Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often.8.1

Terminating Value Iteration The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration) –Set a threshold  and stop when the change across two consecutive iterations is less than  –There is a minor problem since value is a vector We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by  Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||U i – U i+1 || < 

Policies converge earlier than values There are finite number of policies but infinite number of value functions. So entire regions of value vector are mapped to a specific policy So policies may be converging faster than values. Search in the space of policies Given a utility vector U i we can compute the greedy policy  ui The policy loss of  ui is ||U  ui  U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) V(S 1 ) V(S 2 ) Consider an MDP with 2 states and 2 actions P1P1 P2P2 P3P3 P4P4 U*U*

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation) n linear equations with n unknowns.

Thanks and Giving Suppose you randomly reshuffled the world, and you have 100 people on your street (randomly sampled from the entire world). On your street, there will be 5 people from US. Suppose they are a family. This family: –Will own 2 of the 8 cars on the entire street –Will own 60% of the wealth of the whole street –Of the 100 people on the street, you (and you alone) will have had a college education …and of your neighbors –Nearly half (50) of your neighbors would suffer from malnutrition. –About 13 of the people would be chronically hungry. –One in 12 of the children on your street would die of some mostly preventable disease by the age of 5: from measles, malaria, or diarrhea. One in 12. “If we came face to face with these inequities every day, I believe we would already be doing something more about them.” --William H. Gates (5/2003) (On Bill Moyers’ NOW program) 11/21 It's the mark of a truly educated man to be deeply moved by statistics. -Oscar Wilde

Bellman equations when actions have costs The model discussed in class ignores action costs and only thinks of state rewards –C(s,a) is the cost of doing action a in state s Assume costs are just negative rewards.. –The Bellman equation then becomes U(s) = R(s) + ° max a [ -C(s,a) +  s’ R(s’) M a ss’ ] Notice that the only difference is that -C(s,a) is now inside the maximization With this model, we can talk about “partial satisfaction” planning problems where –Actions have costs; goals have utilities and the optimal plan may not satisfy all goals.

Incomplete observability (the dreaded POMDPs) To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not) –Policy maps belief states to actions In practice, this causes (humongous) problems –The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??} –Even approximate policies are hard to find (PSPACE- hard). Problems with few dozen world states are hard to solve currently –“Depth-limited” exploration (such as that done in adversarial games) are the only option… Belief state = { s 1 :0.3, s 2 :0.4; s 4 :0.3} This figure basically shows that belief states change as we take actions 5 LEFTs5 UPs5 rights

Real Time Dynamic Programming Value and Policy iteration are the bed-rock methods for solving MDPs. Both give optimality guarantees –Both of them tend to be very inefficient for large (several thousand state) MDPs (Polynomial in |S|  ) Many ideas are used to improve the efficiency while giving up optimality guarantees –E.g. Consider the part of the policy for more likely states (envelope extension method) –Interleave “search” and “execution” (Real Time Dynamic Programming) Do limited-depth analysis based on reachability to find the value of a state (and there by the best action you should be doing—which is the action that is sending you the best value) The values of the leaf nodes are set to be their immediate rewards –Alternatively some admissible estimate of the value function (h*) If all the leaf nodes are terminal nodes, then the backed up value will be true optimal value. Otherwise, it is an approximation… RTDP For leaf nodes, can use R(s) or some heuristic value h(s)

MDPs and Deterministic Search Problem solving agent search corresponds to what special case of MDP? –Actions are deterministic; Goal states are all equally valued, and are all sink states. Is it worth solving the problem using MDPs? –The construction of optimal policy is an overkill The policy, in effect, gives us the optimal path from every state to the goal state(s)) –The value function, or its approximations, on the other hand are useful. How? As heuristics for the problem solving agent’s search This shows an interesting connection between dynamic programming and “state search” paradigms –DP solves many related problems on the way to solving the one problem we want –State search tries to solve just the problem we want –We can use DP to find heuristics to run state search..

RTA* (RTDP with deterministic actions and leaves evaluated by f(.)) Sn m k G S n m G=1 H=2 F=3 G=1 H=2 F=3 k G=2 H=3 F=5 infty --Grow the tree to depth d --Apply f-evaluation for the leaf nodes --propagate f-values up to the parent nodes f(parent) = min( f(children)) RTA* is a special case of RTDP --It is useful for acting in determinostic, dynamic worlds --While RTDP is useful for actiong in stochastic, dynamic worlds

What if you see this as a game? The expected value computation is fine if you are maximizing “expected” return If you are --if you are risk-averse? (and think “nature” is out to get you) V 2 = min(V 3,V 4 ) If you are perpetual optimist then V 2 = max(V 3,V 4 ) If you have deterministic actions then RTDP becomes RTA* (if you use h(.) to evaluate leaves

Incomplete observability (the dreaded POMDPs) To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not) –Policy maps belief states to actions In practice, this causes (humongous) problems –The space of belief states is “continuous” (even if the underlying world is discrete and finite). {GET IT? GET IT??} –Even approximate policies are hard to find (PSPACE- hard). Problems with few dozen world states are hard to solve currently –“Depth-limited” exploration (such as that done in adversarial games) are the only option… Belief state = { s 1 :0.3, s 2 :0.4; s 4 :0.3} This figure basically shows that belief states change as we take actions 5 LEFTs 5 UPs

Von Neuman (Min-Max theorem) Claude Shannon (finite look-ahead) Chaturanga, India (~550AD) (Proto-Chess) John McCarthy (  pruning) Donald Knuth (  analysis)

What if you see this as a game? The expected value computation is fine if you are maximizing “expected” return If you are --if you are risk-averse? (and think “nature” is out to get you) V 2 = min(V 3,V 4 ) If you are perpetual optimist then V 2 = max(V 3,V 4 ) Review

Game Playing (Adversarial Search) Perfect play –Do minmax on the complete game tree Alpha-Beta pruning (a neat idea that is the bane of many a CSE471 student) Resource limits –Do limited depth lookahead –Apply evaluation functions at the leaf nodes –Do minmax Miscellaneous –Games of Chance –Status of computer games..

Fun to try and find analogies between this and environment properties…

2 <= 2 Cut 14 <= 14 5 <= 5 2 <= 2 Whenever a node gets its “true” value, its parent’s bound gets updated When all children of a node have been evaluated (or a cut off occurs below that node), the current bound of that node is its true value Two types of cutoffs: If a min node n has bound =j, then cutoff occurs as long as j >=k If a max node n has bound >=k, and a min ancestor of n, say m, has a bound <=j, then cutoff occurs as long as j<=k

11/26 Agenda: Adversarial Search (30min) Learning & Inductive Learning (45min)

Another alpha-beta example Project 2 assigned

Click for an animation of Alpha-beta search in action on Tic-Tac-Toen animation of Alpha-beta search in action on Tic-Tac-Toe (order nodes in terms of their static eval values)

Searching Tic Tac Toe using Minmax A game is considered Solved if it can be shown that the MAX player has a winning (or at least Non-losing) Strategy This means that the backed-up Value in the Full min-max Tree is +ve

Evaluation Functions: TicTacToe If win for Max +infty If lose for Max -infty If draw for Max 0 Else # rows/cols/diags open for Max - #rows/cols/diags open for Min

What depth should we go to? --Deeper the better (but why?) Should we go to uniform depth? --Go deeper in branches where the game is in a flux (backed up values are changing fast) [Called “Quiescence” ] Can we avoid the horizon effect?

Why is “deeper” better? Possible reasons –Taking mins/maxes of the evaluation values of the leaf nodes improves their collective accuracy –Going deeper makes the agent notice “traps” thus significantly improving the evaluation accuracy All evaluation functions first check for termination states before computing the non-terminal evaluation

(just as human weight lifters refuse to compete against cranes)

End of Gametrees

(so is MDP policy)

Multi-player Games Everyone maximizes their utility --How does this compare to 2-player games? (Max’s utility is negative of Min’s)

Expecti-Max