 We have also applied VPI in a disaster management setting:  We investigate overlapping coalition formation models. Sequential Decision Making in Repeated.

Slides:

Advertisements

Similar presentations

Dialogue Policy Optimisation

Advertisements

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham 2003.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Decision Theoretic Planning

Reinforcement learning (Chapter 21)

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.

Markov Decision Processes

Planning under Uncertainty

Reinforcement learning

Reinforcement Learning

Markov Decision Processes

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Learning: Reinforcement Learning Russell and Norvig: ch 21 CMSC421 – Fall 2005.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MAKING COMPLEX DEClSlONS

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning

Introduction Many decision making problems in real life

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.

MDPs (cont) & Reinforcement Learning

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.

Reinforcement learning (Chapter 21)

QUANTITATIVE TECHNIQUES

Markov Decision Process (MDP)

1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.

Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.

Chapter 16 March 25, Probability Theory: What an agent should believe based on the evidence Utility Theory: What the agent wants Decision Theory:

Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Making complex decisions

Reinforcement Learning (1)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

When Security Games Go Green

Markov Decision Processes

Markov Decision Processes

13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel

October 6, 2011 Dr. Itamar Arel College of Engineering

Multiagent Systems Repeated Games © Manfred Huber 2018.

Chapter 17 – Making Complex Decisions

CS 188: Artificial Intelligence Spring 2006

CS 188: Artificial Intelligence Spring 2006

CS 416 Artificial Intelligence

CS 416 Artificial Intelligence

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Markov Decision Processes

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

 We have also applied VPI in a disaster management setting:  We investigate overlapping coalition formation models. Sequential Decision Making in Repeated Coalition Formation under Uncertainty Georgios Chalkiadakis Craig Boutilier Ongoing and Future Work  We have recasted RL algorithms / sequential decision making ideas within a computational trust framework ( we beat the winner of the international ART competition!! ) : Paper in this AAMAS!: W.T.L.Teacy, Georgios Chalkiadakis, A. Rogers and N.R. Jennings: “Sequential Decision Making with Untrustworthy Service Providers” p1 c1 e1 c2 p2 e2 c3 C0C0 C2C2 C1C1 e3 p3 I believe that some guys are better than my current partners…but is there any possible coalition that can guarantee me a higher payoff share? Beliefs are over types. Types reflect capabilities (private information) Agents have to:  decide who to join  decide on how to act  decide how to share the coalitional value / utility Coalition structure CS=  C 0, C 1, C 2  p1 c1 e1 c2 p2 e2 c3 Coalition C 0 ={p1, c2, e1} C2C2 C1C1 e3 p3 Action vector: a=  a C 0, a C 1, a C 2  Coalitional value: u(C 0 | a C 0 ) = 30 Allocation: Coalition Formation Reasoning under Type UncertaintyType Uncertainty: It Matters! Coalition structure CS=  C 0, C 1, C 2  + Action-related uncertainty + Action outcomes are stochastic + No superadditivity assumptions  Agents have own beliefs about the types (capabilities) of others.  Type uncertainty is then translated to value uncertainty:  According to i, what’s the value (quality) of ? A Bayesian Coalition Formation Model  N agents; each agent i has a type t  T i  Set of type profiles:  For any coalition  Agent i has beliefs about the types of the members of any C of agents:  Coalitional actions (i.e., choice of task) for C : A C  Action’s outcome s  S (given actual members’ types)  Probability  Each s results into some reward R(s)  Each i has (a possibly different) estimate about the value of any coalition C: Example experiment: The Good, the Bad, and the Ugly Optimal Repeated Coalition Formation Approximation Algorithms Discounted accumulated rewards: Total actual rewards gathered during the “Big Crime” phase: Belief-State MDP formulation to address the induced exploration- exploitation problem: i.e.: Equations account for the sequential value of coalitional agreements  One-step lookahead (OSLA): performs a one-step lookahead in belief space  VPI exploration: estimates Value of Perfect Information regarding coalitional agreements  VPI-over-OSLA: combines VPI with OSLA  Maximum a Posteriori (MAP): uses the most likely type vector given beliefs  Myopic: calculates expectation disregarding the sequential value of formation decisions Takes into account the immediate reward from forming a coalition and executing an action Takes into account the long-term impact of a coalitional agreement (i.e., the value of information: through belief- state updating and incorporation of the belief-state value into calculations) School of Electronics and Computer Science University of Southampton Southampton, United Kingdom Department of Computer Science University of Toronto Toronto, Canada VPI is a winner! Balances the expected gain against the expected cost from executing a suboptimal action: Use current model to myopically evaluate actions’ EU Assume an action results to perfect information regarding its Q-value. This perfect information has non-zero value only if it results to a change in policy. EVPI is calculated and accounted for in action selection (act greedily towards EU + EVPI ) a) Bayesian, and, yet, b) efficient: uses myopic evaluation of actions, but boosts their desirability EVPI estimates  Consistently outperforms other approximation algorithms  Scales to dozens / hundreds of agents (see below), unlike lookahead approaches