Decision Making under Uncertainty MURI Meeting, June 2001

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

Making Simple Decisions
Markov Decision Process
Continuation Methods for Structured Games Ben Blum Christian Shelton Daphne Koller Stanford University.
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
RL for Large State Spaces: Value Function Approximation
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
THE HONG KONG UNIVERSITY OF SCIENCE & TECHNOLOGY CSIT 5220: Reasoning and Decision under Uncertainty L09: Graphical Models for Decision Problems Nevin.
Decision Theoretic Planning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Markov Decision Processes
Planning under Uncertainty
Visual Recognition Tutorial
Generalizing Plans to New Environments in Relational MDPs
Temporal Action-Graph Games: A New Representation for Dynamic Games Albert Xin Jiang University of British Columbia Kevin Leyton-Brown University of British.
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Generalizing Plans to New Environments in Multiagent Relational MDPs Carlos Guestrin Daphne Koller Stanford University.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
CS 589 Information Risk Management 23 January 2007.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Structured Models for Multi-Agent Interactions Daphne Koller Stanford University Joint work with Brian Milch, U.C. Berkeley.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
MAKING COMPLEX DEClSlONS
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CS584 - Software Multiagent Systems Lecture 12 Distributed constraint optimization II: Incomplete algorithms and recent theoretical results.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
MDPs (cont) & Reinforcement Learning
Tractable Inference for Complex Stochastic Processes X. Boyen & D. Koller Presented by Shiau Hong Lim Partially based on slides by Boyen & Koller at UAI.
Projection Methods (Symbolic tools we have used to do…) Ron Parr Duke University Joint work with: Carlos Guestrin (Stanford) Daphne Koller (Stanford)
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
1 Automated Planning and Decision Making 2007 Automated Planning and Decision Making Prof. Ronen Brafman Various Subjects.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Intelligent Agents (Ch. 2)
Extensive-Form Game Abstraction with Bounds
Nevin L. Zhang Room 3504, phone: ,
Making complex decisions
Analytics and OR DP- summary.
István Szita & András Lőrincz
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Extensive-form games and how to solve them
Clearing the Jungle of Stochastic Optimization
Structured Models for Multi-Agent Interactions
Exploiting Graphical Structure in Decision-Making
Announcements Homework 3 due today (grace period through Friday)
Presented By Aaron Roth
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CASE − Cognitive Agents for Social Environments
Dr. Unnikrishnan P.C. Professor, EEE
RL for Large State Spaces: Value Function Approximation
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Normal Form (Matrix) Games
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Decision Making under Uncertainty MURI Meeting, June 2001 Daphne Koller Stanford University

Research Themes Decision making in high-dimensional spaces Feature selection [Guestrin, Ormoneit] Factored models [Guestrin, Parr, K.] Decision making in multi-agent settings Inferring preferences from behavior [Chajewska, Ormoneit, K.] Strategic interactions [Milch, K.] Hybrid (discrete/continuous) models [Lerner, Parr, K.] Reasoning in complex multi-entity domains [Getoor, Segal, Taskar, K.] Learning probabilistic models from data [Tong, K.]

Motivation A complex battlespace, composed of multiple entities, moving across space and time Very large state space: Positions of all units Intentions of enemy units Weather & terrain … Many agents (units) making parallel decisions, trying to coordinate This time Next time

The MDP Framework State space: S Action space: A State, Reward/Cost ? Actor Complex environment State space: S Action space: A Actions stochastically influence next state: Transition model: P(s’ | s,a) States are associated with momentary rewards Rewards accumulate over time Task: Maximize expected, discounted reward Find policy

Policies & Value Functions Suppose an expert told you the “value” of each state: V(s) is the value of acting optimally starting at s V(S1) = 10 V(S2) = 5 S1 S2 Action 1 0.5 Action 2 0.7 0.3 If V is optimal, then it is optimal to act greedily wrt V Pick action with highest expected immediate value Expectation over next-state values

Large state spaces There are several approaches for computing the optimal value function: Policy iteration: an iterative bootstrap algorithm Linear programming Reinforcement learning for cases when process dynamics unknown or available only via simulation Problem: In most real-world problems, state space is very large Exponential in number of features used to describe it Value function has a value for each state: Impractical to represent or compute in most cases

Feature Selection in MDPs One approach: Select features Solve problem as if it was an MDP over the features H H’ X’ X How bad is this approximation?

Theorem: Near-Optimal Policy Theorem: The loss of acting according to the greedy memory-less policy over the observable variables is bounded by a factor of the mutual information between H and H’ given X’. H H’ X X’ For normally distributed H’ and linear features (X = WTH) I(H,H’|X’) is minimized if W spans first k principal components of H

Learning to Ride a Bicycle Task proposed by [Randløv and Alstrøm ‘98]: Learn to ride from initial state to goal Control handlebar torque and center of mass [Randløv and Alstrøm ‘98]: Discretized 6 dof, used NN to represent policy Distance to reach goal: 7 Km on average Best case 1.7 Km PEGASUS [Ng and Jordan ‘00]: Used 15 features and linear sigmoid policies Worst case distance 1.07 Km Start 1Km Goal

Feature Selection Algorithm Used same 15 features and policy representation Simulate system using do-nothing policy Run PCA on points in sample trajectories Apply PEGASUS algorithm using only first k principal components as features

Summary Can get excellent performance in sequential decision process using few features Feature selection algorithm tells us which features: of the state are most important for decision making of recent past are worth remembering Can allow us to deal effectively with high-dimensional spaces

Factored MDPs Total reward adding sub-rewards: R=R1+R2 Z R1 Y’ Z’ Y X’ X Time t t+1 Total reward adding sub-rewards: R=R1+R2 Actions only make local changes to transition model

Decomposable Value Functions Linear combination of restricted domain functions [Tsitsiklis & Van Roy ’96] [Koller & Parr ’99,’00] Each hi is the status of some small part(s) of a complex system status of a machine inventory of a store K basis functions 2n states h1(s1) h2(s1)... h1(s2) h2(s2)… . A=

Approach I: Policy Iteration Guess w0 pt= greedy(A wt) Awt+1  value of acting on pt Guess V0 pt= greedy(Vt) Vt+1 = value of acting on pt Approximate Value of Acting on p: (2nx2n) (2nx1) Innovative approach based on linear programming

Network Management Problem Computers connected in a network Each computer can fail with some probability If a computer fails, it increases the probability the neighbors will fail At every time step, the sys-admin must decide which computer to fix Server Bidirectional Ring Ring and Star Star 3 Legs Ring of Rings

Experimental Results Running time Error Runs in time O(n2) not O((2n)3) Error remains bounded

Approach II: Linear Programming Find the optimal value function by linear programming Exponential number of variables! Exponential number of constraints!

Approximate LP formulation Approximate LP solution to value function using value function approximation: Error analysis by [De Farias & Van Roy, ‘01] Small number of variables. Still exponential number of constraints. Factored MDPs allow for compact, closed form representation of constraints!

Approx LP vs Policy Iteration Running time Value of final policy Note: state space sizes up to 232!!

Summary Factored representation of system dynamics allows representation of very complex systems We use value functions that approximate value as sum of values of system components System components can overlap Very natural approximation that people also use Allows very efficient algorithms for sequential decision making in structured complex systems Collaborative multi-agent decision making State space size 4.3x1028 (30 agents in parallel)!! New New

Research Themes Decision making in high-dimensional spaces Feature selection [Guestrin, Ormoneit] Factored models [Guestrin, Parr, K.] Decision making in multi-agent settings Inferring preferences from behavior [Chajewska, Ormoneit, K.] Strategic interactions [Milch, K.] Hybrid (discrete/continuous) models [Lerner, Parr, K.] Reasoning in complex multi-entity domains [Getoor, Segal, Taskar, K.] Learning probabilistic models from data [Tong, K.]

Motivation The enemy is also acting in the battlespace He is also a rational agent, making decisions to optimize his goals To act optimally in presence of other intelligent agent, need to figure out what he wants We address two issues: Figuring out utility functions Both ours and the enemy’s Acting optimally in context of strategic interaction

Example Decision Task Chance nodes: as in a Bayesian network Test Abortion Loss of Fetus Test result Miscarriage Mother’s age Utility Future pregnancy Down’s syndrome Knowledge One of the main advantages of probabilistic models is the ability to make decisions under uncertainty. In order to do that, we extend BNs to influence diagrams. Qualitatively, influence diagrams have a similar structure to BNs. Chance nodes, decision nodes, utility nodes. To quantify the model, we need to specify probabilities and utilities. The former are easy: obtained from an expert or learned from data. The latter are much harder: are different for each individual patient. Chance nodes: as in a Bayesian network Decision nodes: parents are observed prior to decision Utility nodes: deterministic real-valued function of parents

Utility — A Random Variable Main idea: Express uncertainty over user’s utilities as a probability distribution p(U) Utility(o1) Utility(o2)

Incorporating Information We start with a prior over utilities As we observe behavior or ask questions, we obtain constraints We condition distribution on constraints to obtain informed posterior Posterior cannot be represented in closed form Use MCMC sampling to generate “prototypical” utilities

Partial Utility Elicitation Compute optimal policy based on current p(U) Expected regret low enough? Yes No Condition p(U) on the answer Ask question with highest value of information

Experimental Results I Predicted and Actual Regret 0.005 0.01 0.015 0.02 0.025 0.03 1 2 3 4 5 6 7 8 9 10 age - 20 predicted actual 40 Regret Number of questions

Experimental Results II Target loss = 0.01 Number of questions asked avg best worst std dev 2.7-4.8 1 24 2.3-4.3 Utility loss at end avg best worst std dev 0.003-0.005 0.05 0.01 Distance from indifference point 0.11 - 0.20 0.07 0.26 0.09 0.22 0.15 avg over 15 Q3 Q2 Q1

Inferring Utilities [Ng & Russell, 2000] Given probability distribution over events observed decision sequence Compute utility values for outcomes Agent’s Decision Outcomes d1 Nature’s move p1 1-p1 d2 p2 1-p2 o1 o2 o3 o4 Knowledge about another agent’s utility function gives us insight about the agent allows us to predict his future actions enables us to optimize our own actions in non-cooperative situations Assume the observed agent is rational — acts to maximize expected utility

Example: Online Bookseller Sign up for e-mail yes no Offer discount Offer discount yes no no no yes no Buy Buy Buy Buy yes no yes no yes no yes no Enjoy Enjoy Enjoy Enjoy yes no yes no yes no yes no l1 l2 l3 l4 l5 l6 l7 l8 l9 l10 l11 l12 U(l1) = u(enjoy) + u(discount-price) + u(e-mail) + u(bargain) Moves: Bookseller — strategic player Customer — oblivious player Nature (distribution known to both players) U(l5) = u(hate) + u(full-price) + u(e-mail)

Partial Strategy ? Sign up for e-mail U(E1) ≥ U(l3) Offer discount U(B1) = U(E1) U(B1) = max(U(E1), U(l3)) U(B2) = max(U(E2), U(l6)) Buy Buy U(E1)=U(l1)*P(l1)+U(l2)*P(l2) U(E2)=U(l4)*P(l4)+U(l5)*P(l5) Enjoy Enjoy U(l1) U(l2) U(l3) U(l5) U(l6) U(l4)

Behavior Implies Bounds U(O1) ≥ U(O2) Sign up for e-mail U(E1) ≥ U(l3) Offer discount U(B2) = i minc (i,c) ui,c U(B2) = i maxc (i,c) ui,c Buy U(B1) = U(E1) Buy U(E1) = c p(c)U(c) U(E1) = c p(c)U(c) Enjoy Enjoy U(l1) = U(l1) = U(l1) = i i ui U(l1) U(l2) U(l3) U(l4) U(l5) U(l6)

Constraints in the Utility Space uo’ uo Our linear constraints form a convex region which contains all consistent utility functions Which one should we choose? [Ng & Russell, 2000] — propose heuristics for selecting “natural” utility functions

Projection onto u1u2 plane Feasible Region Projection onto u1u2 plane After 1 observation After 17 observations

Predicting Using Learned Utility 1.6 predicting utility function predicting strategy 1.4 1.2 1 Distance 0.8 0.6 0.4 0.2 50 100 150 200 250 300 Number of observations

Strategizing based on Learned Utility 0.24 0.238 0.236 0.234 0.232 0.23 actual utility obtained utility obtained by following the optimal strategy 0.228 0.226 0.224 20 40 60 80 Number of observations

Summary Utilities can be treated as a “random variable” Distribution over utilities can be learned from population Observations of behavior and/or answers to questions “narrow down” distribution This approach can be used: To facilitate utility elicitation in cooperative setting To determine another agent’s utility and act accordingly

Road Example Util 1W Building Building 1W 1E Util 1E Util 2W Building Suitability 1W Suitability 1E Util 1W Building 1W Building 1E Util 1E Suitability 2W Suitability 2E Util 2W Building 2W Building 2E Util 2E Suitability 3W Suitability 3E Util 3W Building 3W Building 3E Util 3E

Compactness Assume all variables have three values Each decision node observes three variables Number of information sets per agent: 33 = 27 Size of MAID: n chance nodes of “size” 3 n decision nodes of “size” 27·3 Size of game tree: 2n splits, each over three values Size of normal (matrix) form: n players, each with 327 pure strategies 54n 32n  (327)n

Decision Making: Single Agent Goods Burglary Earthquake Recovery Alarm PhoneCall Newscast Go Home Sale Meeting Sale Need to choose d Val(D) for every information set u Val(Parents(D)) home/stay for every value of PhoneCall Compute expectation of utility nodes in distribution conditioned on d,u For each u choose d that maximizes expected utility

Strategic Relevance Question: What do we need to know in order to compute utility-maximizing strategy at D? Need to compute expected utility for decisions dVal(D) given information uVal(Parents(D)) Intuitively, D relies on D’ if we need to know the decision rule at D’ in order to optimize decision rule at D. We define a relevance graph, with: a node for each decision an edge from D to D’ if D relies on D’ We provide sound & complete procedure for determining strategic relevance using only graph structure Can build relevance graph in quadratic time D D’

Examples I: Information D D’ perfect info U D D’ perfect enough U D D’ simultaneous move U D D’ U don’t care

Examples II: Card Game Deal Bet1 Bet2 Bet1 Bet2 U Bet2 relies on Bet1 even though Bet2 observes Bet1 Bet2 can depend on Deal Deal influences U Need probability model of Bet2 to derive posterior on Deal and compute expectation over U Decision D can rely on D’ even if D’ is observed at D !

Theorem: The result is equilibrium for whole game Solving Games Nash equilibrium: Ascribes strategy for all agents in game Rational game-theoretic solution concept Structured algorithm for computing equilibrium Find minimal set of decisions that rely only on each other Find equilibrium for subgame over these decisions Fix their strategy to be selected equilibrium 1W 1E 2W 2E 3W 3E 1W 1E 2W 2E 1W 1E Theorem: The result is equilibrium for whole game

Experiment: “Road” Example Reminder, for n=4: Tree size: 6561 nodes Matrix size: 4.71027 For n=40: Tree size: 1.47 1038 nodes

Summary Multi-agent influence diagrams: Compact intuitive language for multi-agent interactions Like Bayesian nets for multi-agent setting MAIDs elucidate important qualitative structure: How different decisions interact Can exploit structure to find strategies efficiently Sometimes exponentially faster than existing algorithms

Conclusions & Future Directions Goal: Deal with complex decision problems, involving multiple agents moving across space & time Progress: Substantial scaling up of decision problems solved in single agent case New ideas for dealing with multiple agents Some current directions: Multi-agent MDPs Object-relational decision-making problems Domains where things evolve at different time scales

Related Publications (2001) “Feature Selection for Reinforcement Learning”, C. Guestrin & D. Ormoneit. Submitted. “Max-norm Projections for Factored MDPs”, C. Guestrin, D. Koller, and R. Parr. To appear IJCAI 2001. “Cooperative Multiagent Planning with Factored MDPs”, C. Guestrin, D. Koller, and R. Parr. Submitted. “Learning an Agent's Utility Function by Observing Behavior”, U. Chajewska, D. Koller, & D. Ormoneit. To appear ICML 2001. “Multi-Agent Influence Diagrams for Representing and Solving Games”, D. Koller and B. Milch. To appear IJCAI 2001. Plus: Two papers on hybrid models Three papers on object-relational models Two papers on learning Bayesian networks from data