Thank you for the introduction and good morning to all of you

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
Partially Observable Markov Decision Process (POMDP)
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Compressing Mental Model Spaces and Modeling Human Strategic Intent.
Decision Theoretic Planning
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Extensive Game with Imperfect Information Part I: Strategy and Nash equilibrium.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Influence Diagrams for Robust Decision Making in Multiagent Settings.
MAKING COMPLEX DEClSlONS
Approach Participants controlled a UAV, flying over the Bagram Air Force Base, and were tasked with the goal of reaching a designated target. Success was.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
The set of SE models include s those that are BE. It further includes models that include identical distributions over the subject agent’s action observation.
Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
On the Difficulty of Achieving Equilibrium in Interactive POMDPs Prashant Doshi Dept. of Computer Science University of Georgia Athens, GA Twenty.
Yifeng Zeng Aalborg University Denmark
Keep the Adversary Guessing: Agent Security by Policy Randomization
Univariate Gaussian Case (Cont.)
OPERATING SYSTEMS CS 3502 Fall 2017
12. Principles of Parameter Estimation
Experimental Psychology
CS b659: Intelligent Robotics
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
MSDM AAMAS-09 Two Level Recursive Reasoning by Humans Playing Sequential Fixed-Sum Games Authors: Adam Goodie, Prashant Doshi, Diana Young Depts.
Statistical Data Analysis
Copyright © Cengage Learning. All rights reserved.
Bounded Rationality Herbert A. Simon.
Data Mining Lecture 11.
When Security Games Go Green
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Artificial Intelligence
4. The Postulates of Quantum Mechanics 4A. Revisiting Representations
Hidden Markov Models Part 2: Algorithms
Announcements Homework 3 due today (grace period through Friday)
Multiagent Systems Extensive Form Games © Manfred Huber 2018.
Alpha-Beta Search.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Artificial Intelligence
Alpha-Beta Search.
Chapter 29 Game Theory Key Concept: Nash equilibrium and Subgame Perfect Nash equilibrium (SPNE)
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Alpha-Beta Search.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Statistical Data Analysis
LECTURE 09: BAYESIAN LEARNING
Lecture 3: Environs and Algorithms
Alpha-Beta Search.
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
12. Principles of Parameter Estimation
Alpha-Beta Search.
CS249: Neural Language Model
Decimals: Connections to the Common Core and the IES Practice Guide
Blockchain Mining Games
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
F test for Lack of Fit The lack of fit test..
Presentation transcript:

A Computational Framework for Interactive Decision Making with Applications in Strategic Tasks Thank you for the introduction and good morning to all of you. I would like to begin by thanking Jan Glascher for inviting me to give a talk at the Cross-modal Learning Center. Today, I am going to talk about interesting studies of human strategic intent and how we may computationally modeling the behavioral data collected in these studies.

Professor of Computer Science Prashant Doshi (PI) Professor of Computer Science My name is Prashant Doshi and I am on the faculty of the University of Georgia in Athens. A great town to visit with all its southern charm.

If my talk piques your interest, and you would like to read more, then please visit my lab web pages for all the relevant papers. http://thinc.cs.uga.edu

Computational modeling of strategic reasoning in Trust games (Ray et al.08, Hula et al.15, …) Computational modeling of strategic reasoning in Centipede games (Goodie et al.09, Qu et al.10, …) In this broad area, I have worked on two projects one of which studied human recursive thinking while engaged in strategic games, and the other is studying human probability judgment in a UAV operation theater. In this talk, I will focus on the first one.

(I think what you think that I think...) Human strategic reasoning is generally hobbled by low levels of recursive thinking (Stahl&Wilson95,Hedden&Zhang02,Camerer et al.04,Ficici&Pfeffer08) (I think what you think that I think...) The general literature is quite pessimistic about the recursive thinking levels shown by humans while playing strategic games. For example, Camerer asserted that human recursive reasoning in general does not go beyond a single level.

You are Player I and II is human. Will you move or stay? Payoff for I: Payoff for II: 3 1 2 4 I II Player to move: Take a look at this example sequential game from our experiment, which is similar to the games studied by Hedden & Zhang. There are two players, I and II, who move in turns and each player chooses between moving or staying. If you were player I at this point and II is human, what would you do? The rational action, which involves thinking about what II would do and whose action depends on what I would do, is to stay. An attentive participant would have to think two levels deep to arrive at the rational action.

Less than 40% of the sample population performed the rational action! Thinking about how others think (...) is hard in general contexts This could be due to several reasons including cognitive overloading, and ongoing research is exploring the causes.

Move Stay 0.6 0.4 0.2 0.8 I II Player to move: Payoff for I: (Payoff for II is 1 – decimal) 0.6 0.4 0.2 0.8 I II Player to move: Now, let’s play this game. The difference between the two games is in the payoffs, which are fixed-sum in this game. So this game is strictly competitive and the participants have to reason with less numbers making it simpler. Again, you are player I and II is human. What would you do?

About 70% of the sample population performed the rational action in this simpler and strictly competitive game

(Flobbe et al.08, Meijering et al.11, Goodie et al.12) Simplicity, competitiveness and embedding the task in intuitive representations seem to facilitate reasoning about others’ reasoning (Flobbe et al.08, Meijering et al.11, Goodie et al.12) Evidence is emerging that the context matters significantly.

Myopic opponents default to staying (level 0) while predictive opponents think about the other player’s decision (level 1) 3-stage game In our psychological study, we observed these data on the fixed-sum game played repeatedly with differing payoffs. In these charts, a myopic opponent type does not think about player I’s action while a predictive opponent type does think. This chart gives the achievement score, which is the proportion of games in which the participant at the first state performed the rational action given the opponent type. Clearly, participants are thinking at two levels of reasoning and having difficulty thinking at one level in these games. In this chart, I show the proportion of participants who expect the opponent to think at level one in both categories. Finally, this chart shows the amount of inconsistency between the participants’ predictions of other’s actions and their own choice, which is a measure of rationality error. There are enough rationality errors that we cannot ignore them.

Investor Trustee 3 x I Invest I Repay fraction f f.(3 x I) (10 rounds)

Reduced investments in the Personal setting Cooperation by the Trustee and high investments in the Impersonal setting Reduced investments in the Personal setting

Can we computationally model these strategic behaviors using generative process models? We are especially interested in process models rather than simply statistical curve fitting.

Modeling the average behavior on the Centipede game Let’s begin, by learning gamma – the rate of learning opponent type -- from the participants’ expectations of others actions, which sounds reasonable, and of course lambda – the rationality error – from the participants’ actions. (Use participants’ predictions of other’s action to learn  and participants’ actions to learn ) Here’s how the model performs in comparison to the participants’ data. Unfortunately, it does pretty bad for the predictive opponent type. On the other hand, it fits the data on participants’ expectations quite well, as we should expect. So, what gives?

Learning the parameters from a different modality results in improved modeling We looked deeper, and realized that the expectations data wasn’t very reliable about their thinking. In particular, participants’ actions were not consistent with their expectations more so for the predictive opponent type. This could be a data collection issue – possibly inattentiveness when entering this data. Anyway, so this time we learned both gamma and lambda from the participants’ choices which were incentivized and possibly more reflective of their actual thinking. We additionally noticed that rationality errors were dropping as more games were played – a sign that participants’ were attending more to the tasks – so we let lambda vary linearly across the games and learned the intercept and slope. Here are the results – I would say that this is a very good fit of the data.

Generative process model to classify the types of investors and trustees (Ray et al.09) Strategy (ToM) levels Uncooperative or cooperative trustee

Uncooperative / Cooperative Trustee 48 students at CalTech played the Interpersonal task and 54 students played the Personal task. Uncooperative / Cooperative Trustee

A generated game between Investor at level 1 and uncooperative Trustee also at level 1 (both with uniform priors)

Computational Framework for the Modeling

What is a mental behavioral model? I will begin by asking two simple questions. Here’s the first one: Hypothesis of how the other agent is acting or could be acting.

How large is the behavioral model space? Here’s the second question:

How large is the behavioral model space? General definition A mapping from the agent’s history of observations to its actions In order to get a handle on this question, let’s start by defining a model. Most AI and game theory textbooks give this general definition of a model.

How large is the behavioral model space? 2H  (Aj) Mathematically, we may define the model as a function. Even if the agent has just two observations, the space of models is uncountably infinite. Uncountably infinite

How large is the behavioral model space? Let’s assume computable models By assuming computable models, we can get the space down to countable models. Barring domain specific information or data, we cannot whittle this space down any further. Interestingly, this means that a large portion of the model space is not computable! Countable A very large portion of the model space is not computable!

Intentional stance Ascribe beliefs, preferences and intent to Daniel Dennett Philosopher and Cognitive Scientist Anybody recognize him? He is Daniel Dennett, a philosopher and cognitive scientist at Tufts U. and well known for his theory of mental content, which defines 3 levels of abstraction. One of these is the intentional stance according to which an agent ascribes beliefs, preferences and intent to others in order to understand their behavior. It’s analogous to the theory of mind in cognitive psychology. Intentional stance Ascribe beliefs, preferences and intent to explain others’ actions (analogous to theory of mind - ToM)

Dennett’s Intentional stance A (Bayesian) Theory of Mind Ascribe beliefs, preferences and intent to explain others’ actions A (Bayesian) Theory of Mind Adults and children ascribe mental states such as beliefs to others to explain their observed actions Anybody recognize him? He is Daniel Dennett, a philosopher and cognitive scientist at Tufts U. and well known for his theory of mental content, which defines 3 levels of abstraction. One of these is the intentional stance according to which an agent ascribes beliefs, preferences and intent to others in order to understand their behavior. It’s analogous to the theory of mind in cognitive psychology.

Organize the mental models Intentional models Subintentional models Dennett’s intentional stance allows a taxonomical classification of the models into those that are intentional and those that are not, which we call subintentional.

Organize the mental models Intentional models E.g., POMDP  =  bj, Aj, Tj, j, Oj, Rj, OCj  BDI Subintentional models Frame Some of the examples of intentional models include a POMDP, BDI and TOM-based models. Let’s pay special attention to a POMDP, which is a tuple of the agent’s belief, its capabilities, and its preferences. For convenience, we can collect all the elements except for the belief into a frame. Because intentional models include beliefs, which could itself be over others’ models, using these models may give rise to recursive modeling where an agent has some distribution over others’ models each of which in turn has a distribution, and so on. (may give rise to recursive modeling)

Organize the mental models Intentional models E.g., POMDP  =  bj, Aj, Tj, j, Oj, Rj, OCj  BDI Subintentional models E.g., (Aj), finite state controller, plan Frame Subintentional models on the other hand directly give the actions. Examples include a static distribution over the actions, a finite state controller or a plan.

Sequential decision making in single agent settings POMDP framework (Sondik&Smallwood,73)

Generalize POMDPs to multiagent settings POMDPs allow principled planning under uncertainty and adequately model many problems Computationally complex!

Individual (subjective) planning Joint planning Useful for a team of multiple agents each controlled by its own local policy Decentralized POMDP Individual (subjective) planning Useful in both non-cooperative and cooperative contexts Interactive POMDP Game-theoretic equilibrium Useful for analysis but not control of a multiagent system Non unique & Incomplete

Sequential decision making in multi-agent settings Interactive POMDP framework (Gmytrasiewicz&Doshi,05) In multiagent settings, the I-POMDP framework expands the state space to include models of the other agent. An agent in the framework updates beliefs not only over the state but the models as well. If the models are intentional, this may lead to recursive modeling as illustrated here. I-POMDP is an important framework for sequential decision making in non-cooperative settings. Include models of the other agent in the state space Update beliefs over the physical state and models

How do the player’s beliefs change with rounds? Investor’s level 1 belief (uncoop tru. coop tru) Trustee’s level 2 belief

Several methods for solving a problem modeled as an I-POMDP exactly and approximately to obtain normative behavior (Doshi 12)

Finite model space grows as the interaction progresses An interesting observation is that even if your space of models is finite, it could grow as the interaction progresses.

Growth in the model space Other agent may receive any one of |j| observations Here’s why it will grow. The other agent in the setting may receive any one of a number of observations. Because of this, it will update its belief giving rise to a new intentional model. If at time step 0, it has |M_j| models, it will have |M_j||Omega_j| models in the next time step, and so on. Analogously for subintentional models such as finite state controllers. |Mj|  |Mj||j|  |Mj||j|2  ...  |Mj||j|t 1 2 t

Growth in the model space Bottomline is that the growth in the model space is exponential with time. Exponential

General model space is large and grows exponentially as the interaction progresses So, what we have learned so far? We have learned that the general model space is very large and that it grows exponentially as the interaction progresses.

It would be great if we can compress this space! The large size of the space often motivates ad hoc ways for reducing the space – driven by domain constraints and data. However, it would be great if we can find general ways of compressing this space. Such compression techniques may come at no expense in value to the modeler or the modeler is fine with some loss in value for greater compression. Lossless and lossy methods. No loss in value to the modeler Flexible loss in value for greater compression Lossless Lossy

General and domain-independent approach for compression Establish equivalence relations that partition the model space and retain representative models from each equivalence class

Approach #1: Behavioral equivalence (Rathanasabapathy et al Approach #1: Behavioral equivalence (Rathanasabapathy et al.06,Pynadath&Marsella07) The first approach I will discuss considers intentional models to be equivalent if their complete solutions are identical. This is referred to as behavioral equivalence and pertains to intentional models. Intentional models whose complete solutions are identical are considered equivalent

Approach #1: Behavioral equivalence Consider this space of intentional models. Here, vertical lines denote intentional models, say POMDPs with differing beliefs. The beliefs are shown on top. Each separately shaded region denotes a policy, which is a solution to the POMDP. BE clusters intentional models within each shaded region together and selects a representative denoted by the red vertical line. Consequently, we have compressed the space from 10 models to just 3 models! If a belief is maintained over the model space, the beliefs in each region must be summed as well. We call this set of models behaviorally minimal. Behaviorally minimal set of models

Approach #1: Behavioral equivalence Lossless Works when intentional models have differing frames Importantly, BE based partitioning is lossless to the modeler such as the agent in the I-POMDP framework. Furthermore, this approach applies even when the intentional models have differing frames (so they could be different POMDPs).

Approach #1: Behavioral equivalence Impact on dt-planning in multiagent settings Multiagent tiger Multiagent tiger Let’s take a quick look at the impact of utilizing BE in decision-theoretic planning in multiagent settings as formalized by the I-POMDP framework. Notice the speedup we get in obtaining the same average reward compared to the exact without compression. This speedup is due to not only a reduced model space in the first time step but the minimal model space even grows less. Similar speed up for another toy problem domain. Multiagent MM

Approach #1: Behavioral equivalence Utilize model solutions (policy trees) for mitigating model growth Model reps that are not BE may become BE next step onwards Preemptively identify such models and do not update all of them

Modeling Strategic Human Intent I would like to take this opportunity to also present some results on computationally modeling data on human strategic behavior. This should not take more than 10 minutes.

Can we adapt I-POMDPs to computationally model the strategic behaviors in Centipede and Trust games? We are especially interested in process models rather than simply statistical curve fitting.

Yes! Using a parameterized Interactive POMDP framework Recall that I had introduced the I-POMDP framework previously.

Modeling behavior in the Centipede game

Learning is slow and partial Notice that the achievement score increases as more games are played indicating learning of the opponent models Learning is slow and partial Next, I am going to make a couple of observations that will play a key role in how we formulate our model for the behavioral data. My first observation is that all the scores we saw previously were increasing as more games are being played. The participants could observe the action that the other player took if they chose to move at the end of each game. So, learning is evident and this stimuli could be playing a role in the participant’s learning process. However, the learning is slow and partial – not everybody has figured out the opponent type. We replace I-POMDP’s Bayesian belief update with a Bayesian update that underweights evidence. The underweighting is parameterized by gamma. Replace I-POMDP’s normative Bayesian belief update with Bayesian learning that underweights evidence, parameterized by 

Errors appear to reduce with time Notice the presence of rationality errors in the participants’ choices (action is inconsistent with prediction) Errors appear to reduce with time Next, I showed a chart with rationality errors that cannot be ignored. In order to model this, we substituted the expected utility maximization in I-POMDPs with a quantal response model. This model is parameterized by lambda. So, we have a two-parameter model. Replace I-POMDP’s normative expected utility maximization with quantal response model that selects actions proportional to their utilities, parameterized by 

Underweighting evidence during learning and quantal response for choice have prior psychological support

Use participants’ predictions of other’s action to learn  and participants’ actions to learn  Let’s begin, by learning gamma – the rate of learning opponent type -- from the participants’ expectations of others actions, which sounds reasonable, and of course lambda – the rationality error – from the participants’ actions. Here’s how the model performs in comparison to the participants’ data. Unfortunately, it does pretty bad for the predictive opponent type. On the other hand, it fits the data on participants’ expectations quite well, as we should expect. So, what gives?

Use participants’ actions to learn both  and  Let  vary linearly We looked deeper, and realized that the expectations data wasn’t very reliable about their thinking. In particular, participants’ actions were not consistent with their expectations more so for the predictive opponent type. This could be a data collection issue – possibly inattentiveness when entering this data. Anyway, so this time we learned both gamma and lambda from the participants’ choices which were incentivized and possibly more reflective of their actual thinking. We additionally noticed that rationality errors were dropping as more games were played – a sign that participants’ were attending more to the tasks – so we let lambda vary linearly across the games and learned the intercept and slope. Here are the results – I would say that this is a very good fit of the data.

Modeling behavior in the Trust game

Investor uses the quantal response model for deciding on his own choices Investor attributes the quantal response model to trustee for deciding on her choices

Investor uses the regular (normative) Bayes rule to update his belief over the trustee’s type A generated game by this I-POMDP based process model

Insights revealed by process modeling: Much evidence that participants did not make rote use of BI, instead engaged in recursive thinking Rationality errors cannot be ignored when modeling human decision making and they may vary Evidence that participants’ could be attributing surprising observations of others’ actions to their rationality errors Strategic investors give more while strategic trustees coax more from the investor by projecting cooperation (reputation forming) Ok, time for some insights from this study and the computational process modeling.

Faculty, Univ. of Illinois at Chicago Piotr Gmytrasiewicz Faculty, Univ. of Illinois at Chicago Yifeng Zeng Faculty, Teesside Univ. Expert on graphical models Yingke Chen Post doctoral associate The research discussed in the first part of the talk is the outcome of a close collaboration with Yifeng Zeng who recently joined the faculty of Teesside Univ, UK, his students, Yingke Chen, Hua Mao, and my student, Muthu C. Xia Qu Ph.D., 2014 Muthu Chandrasekaran Doctoral candidate Ekhlas Sonu

Professor of Psychology, UGA Xia Qu Doctoral student Roi Ceren Matthew Meisel Adam Goodie Professor of Psychology, UGA This line of research is an ongoing collaboration with Adam Goodie, a cognitive psychologist at UGA, and his student, Mathew Meisel, and my students, Xia Qu, and Roi Ceren.

Robot learning from human teacher using I-POMDPs (Woodward&Wood,10) Advances learning from demonstration Learning from reinforcement HRI Application

State Observations = {words, signals, gestures, …} actions = {interruption, clarification, correction} State

Robot benefits from recursively modeling the teacher as it can estimate the teacher’s intent Interrupt the teacher to inform him that concept X is clear Clarify whether the teacher intends to teach concept Y Communicate that the robot believes that the teacher believes that the robot was asking about concept X, but it actually asked concept Y

Thank you for your time

Absolute continuity condition (ACC) Let’s look at an important condition on the agent’s initial belief. I have seldom found mention of this condition in the PAIR literature.

Subjective distribution over histories ACC Subjective distribution over histories True distribution over histories We need two distributions in order to understand ACC. The first one pertains to how the agent thinks the interaction could evolve. Which histories are possible given it’s own policy or plan and its belief over the other agent’s models or plans. The second one pertains to how the interaction could actually evolve – in other words, the set of histories possible given the policies or plans of both agents. ACC states that the subjective distribution should assign a non-zero probability to each history possible in the true distribution.

ACC is a sufficient and necessary condition for Bayesian update of belief over models So, why is ACC important. It represents a sufficient and necessary condition for a successful Bayesian belief update over models.

How do we satisfy ACC? Cautious beliefs (full prior) How should an agent satisfy ACC given that it doesn’t know the true distribution (it doesn’t know the actual plan or policy of the other agent). One way is to start with a full prior which would then assign a non-zero probability to the true model of the other agent. Subsequently, the agent’s belief has a grain of truth, and this is sufficient for satisfying ACC though it’s not necessary. See me after the talk if you are interested in knowing why the grain of truth is not necessary. Cautious beliefs (full prior) Grain of truth assumption Prior with a grain of truth is sufficient but not necessary

Approach #2: -Behavioral equivalence (Zeng et al.11,12) Redefine BE Is there a notion of approximate BE? The difficulty is in comparing two policies side-by-side and being able to state that these two policies are approximately similar. A difference in even a single action between two models could make a world of difference to the modeler. So, is there a way? There is no clear way, but we can make partial inroads. Let’s start by redefining BE.

Lossless if frames are identical Approach #2: Revisit BE (Zeng et al.11,12) Intentional models whose partial depth-d solutions are identical and vectors of updated beliefs at the leaves of the partial trees are identical are considered equivalent Let’s define two intentional models to be BE whose partial solutions (policy trees) up to depth d are equal and the two vectors of updated beliefs at the leaves of the partial trees are equal. This works because beliefs are a sufficient statistic and if beliefs are the same, the future behavior is the same. This redefined BE is also lossless provided the frames are the same. Furthermore, this redefinition is sufficient but not necessary – it may miss out on some BE models and create a finer partition than necessary. I would be glad to explain why offline. Lossless if frames are identical Sufficient but not necessary

Approach #2: (,d)-Behavioral equivalence Two models are (,d)-BE if their partial depth-d solutions are identical and vectors of updated beliefs at the leaves of the partial trees differ by  This redefinition paves the way for defining an approximate BE, which we refer to as (e,d)-BE. We utilize the same definition as before except that now we let the vectors of beliefs at the leaves of the two policy tree solutions to diverge by at most epsilon and consider such models to be equivalent. Here’s an illustration: These are policy tree solutions of two intentional models in the multiagent tiger problem. Notice that the depth-1 trees are identical and the beliefs at the leaves of the two trees diverge by at most an epsilon of 0.33. Therefore, the two intentional models that produced these trees are (0.33,1)-BE. Obviously, this approximate BE leads to a lossy compression. Models are (0.33,1)-BE Lossy

Approach #2: -Behavioral equivalence Lemma (Boyen&Koller98): KL divergence between two distributions in a discrete Markov stochastic process reduces or remains the same after a transition, with the mixing rate acting as a discount factor Let’s note that the previous definition has two parameters, epsilon and d. Consider this lemma from Boyen and Koller that extends a well known result for Markov stochastic processes. It says that the KL divergence between two posterior distributions remains the same or reduces by a discount factor called the mixing rate after a transition. The mixing rate is precisely what its name implies – how much do two posteriors overlap or mix after a single transition. Note that this is a property of a problem and it may be precomputed from the problem definition. Mixing rate represents the minimal amount by which the posterior distributions agree with each other after one transition Property of a problem and may be pre-computed

Approach #2: -Behavioral equivalence Given the mixing rate and a bound, , on the divergence between two belief vectors, lemma allows computing the depth, d, at which the bound is reached This lemma is very useful to us because we can use it to derive the depth of the partial trees to compare, d, given epsilon because the belief update is a stochastic process. Given this d, we compare two policy tree solutions up to this depth for equality only. Compare two solutions up to depth d for equality

Approach #2: -Behavioral equivalence Impact on dt-planning in multiagent settings Discount factor F = 0.5 Multiagent Concert Multiagent Concert Let’s see the impact of this approximate version of BE on dt-planning as performed using I-POMDPs. Here DMU represents exact BE and it’s clear that epsilon-BE is significantly more efficient in obtaining the same reward. Of course, we are matching the reward by planning over a longer horizon in case of e-BE. Notice that e-BE when epsilon is zero produces more equivalence classes, which validates that the redefinition creates a finer partition than is necessary. However, increasing epsilon reduces the number of equivalence classes. Here’s another result this time on a more realistic and large problem domain. On a UAV reconnaissance problem in a 5x5 grid, allows the solution to scale to a 10 step look ahead in 20 minutes

Approach #2: -Behavioral equivalence What is the value of d when some problems exhibit F with a value of 0 or 1? In addition to the constraint that frames be the same, we need to be aware of another fine print. We may encounter problems with the mixing rate being 0 or 1. If it is 1, we set d to 1. If it is 0, we set d to be the horizon and compare complete trees. F=1 implies that the KL divergence is 0 after one step: Set d = 1 F=0 implies that the KL divergence does not reduce: Arbitrarily set d to the horizon

Approach #3: Action equivalence (Zeng et al.09,12) Intentional or subintentional models whose predictions at time step t (action distributions) are identical are considered equivalent at t Let’s continue in the direction of reducing how much of the solution we need to compare for deeming two models to be equivalent. In this approach, we will compare at each time step the action predicted by the models at that time step. If the action is identical, the two models are actionally equivalent at the time step. Let me note here that this equivalence applies to both intentional and subintentional models.

Approach #3: Action equivalence Let me illustrate AE using model solutions from the multiagent tiger problem. Here, we have 4 models (with these initial beliefs) whose policy trees are merged bottom up indicating that we have compressed the model spaces using BE. Now at each time step, AE requires that we group models with the same action. It produces these equivalence classes. An important change is that each equivalence class now makes a probabilistic transition to another class at the next time step. Also, notice that after the grouping, the prediction is that the other agent can OR, then listen and OR again. However, this wasn’t possible previously. Consequently, AE introduces an approximation.

Approach #3: Action equivalence Lossy Works when intentional models have differing frames AE is lossy and works when intentional models may have different frames as well.

Approach #3: Action equivalence Impact on dt-planning in multiagent settings Multiagent tiger Let’s see the impact of AE on dt-planning in comparison to the equivalences we have seen so far. Interestingly, AE is able to reach the reward levels of exact BE but its likely due to being able to plan for longer horizons given the improved efficiency. An appealing benefit of AE is that using it always bounds the model space at each time step t to the largest number of distinct actions. AE bounds the model space at each time step to the number of distinct actions

Open questions Fortunately, there are several open questions and this direction is very much a work in progress.

N > 2 agents Under what conditions could equivalent models belonging to different agents be grouped together into an equivalence class? One condition could be agent anonymity. It doesn’t matter which other agent performed the particular action.

Can we avoid solving models by using heuristics for identifying approximately equivalent models? A disadvantage of these compression techniques is that the models need to be solved. If the models are POMDPs, the solution complexity could get prohibitive. So, a direction of investigation with immediate benefit would be to find heuristics for identifying approximately equivalent models without solving the models, such as by comparing initial beliefs in the case of intentional models or distributions over future paths of interactions for subintentional models.