Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Dynamic Bayesian Networks (DBNs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
Automatic Induction of MAXQ Hierarchies Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University.
Reinforcement learning (Chapter 21)
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
An Introduction to Markov Decision Processes Sarah Hickmott
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Improving Market-Based Task Allocation with Optimal Seed Schedules IAS-11, Ottawa. September 1, 2010 G. Ayorkor Korsah 1 Balajee Kannan 1, Imran Fanaswala.
Reinforcement learning
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Multiagent Planning with Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
CS121 Heuristic Search Planning CSPs Adversarial Search Probabilistic Reasoning Probabilistic Belief Learning.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Making Decisions CSE 592 Winter 2003 Henry Kautz.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford.
MAKING COMPLEX DEClSlONS
1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Efficient Solution Algorithms for Factored MDPs by Carlos Guestrin, Daphne Koller, Ronald Parr, Shobha Venkataraman Presented by Arkady Epshteyn.
1 Factored MDPs Alan Fern * * Based in part on slides by Craig Boutilier.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Basic Economic Concepts Ways to organize an economic system –Free markets –Government-run Nature of decision making –Objectives –Constraints.
By: Messias, Spaan, Lima Presented by: Mike Plasker DMES – Ocean Engineering.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Extending PDDL to Model Stochastic Decision Processes Håkan L. S. Younes Carnegie Mellon University.
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
MDPs (cont) & Reinforcement Learning
Solving Systems of Equations by Graphing
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Reinforcement learning (Chapter 21)
Department of Computer Science Undergraduate Events More
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Making complex decisions
Reinforcement learning (Chapter 21)
Timothy Boger and Mike Korostelev
The story of distributed constraint optimization in LA: Relaxed
Markov Decision Processes
Markov Decision Processes
Warm - Up Graph each equations on its own coordinate plane.
COMP60611 Fundamentals of Parallel and Distributed Systems
Introduction to Reinforcement Learning and Q-Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Modified MDPs for Concurrent Execution AnYuan Guo Victor Lesser University of Massachusetts

Concurrent Execution A set of tasks where each task is relatively easy to solve on its own, but when executed concurrently, new interactions arise that complicate the execution of the composite task. Single agent executing multiple tasks in parallel (example: office robot) Multiple agents act in parallel (team)

Cross Product MDP The problem of concurrent execution can be solved optimally by solving the cross product MDP formed by the separate processes. Problem: exponential blow up

Related Work Deterministic Planning - Situation calculus [Reiter96] - Extending STRIPS [Boutilier97, Knoblock94] Termination schemes for temporally extended actions [Rohanimanesh03] Planning in cross-product MDP [Singh98] Learning ( W-learning [Humphrys96], MAXQ [Dietterich00])

The Goal Somehow break apart the interactions, encapsulate them within each agent, so they can again be solved independently.

Algorithm Summary Define the types of events and interactions of interest Summarize the other agent’s effect on self in terms of statistical information of how often the constraining event occurs Change my model to reflect this statistic

Events in MDP State based events (agent enters s5) Action based events (agent moves north 1 step) State-action based events (agent moves north 1 step from s4) Events in MDP 1 affect events in MDP 2, a total of 9 types of interactions

Assumptions The list of possible interactions between the MDPs are given The constraints are one-way only. The effects do not propagate back to the originator of the constraint.

Directed Acyclic Constraints Constraints between a set of events that forms a directed acyclic graph.

Event Frequency & MDP modification 1)Calculate frequency 2) Modify MDP

Calculating State Visitation Frequency Given a policy, solve the system of simultaneous linear equations: Under the constraint that:

Calculating Action Frequencies Given a policy, the action frequency F(a) is the sum of the visitation frequencies of all the states in which action a is executed. where

Calculating State-Action Frequencies otherwise if Now both the action and the state at which it is executed matters: Also generalizes to a set of states and actions.

Account for the Effects of Constraints Modify the model Modify the transition probability table Intuition: other agents can change the dynamics of my environment Example: A1A2

Account for State Based Events A constraint from another task can affect the current task’s ability to enter certain states: P(s1,a1,s1)P(s1, a1, s2)P(s1, a1, s3) P(s2,a1,s1)P(s2, a1, s2)P(s2, a1, s3) P(s3,a1,s1)P(s3, a1, s2)P(s3, a1, s3) s1 s2 s3 s2s1 A slice of the TPT: under action a1. from: to:

Account for Action Based Events A constraint from another task can affect the current task’s ability to carry out certain actions: P(s1,a1,s1)P(s1, a1, s2)P(s1, a1, s3) P(s2, a1, s1)P(s2,a1,s2)P(s2, a1, s3) P(s3, a1, s1)P(s3, a1, s2)P(s3,a1,s3) s1 s2 s3 s1 s2 s3 TPT for affected action a1

Account for State-Action Based Events A constraint from another task can affect the current task’s ability to carry out certain actions at certain states: s1 s2 s3 P(s1, a1, s1)P(s1, a1, s2)P(s1, a1, s3) P(s2, a1, s1)P(s2, a1, s2)P(s2, a1, s3) P(s3, a1, s1)P(s3, a1, s2)P(s3,a1,s3) s1 s2 s3 TPT for affected action a1

Experiments States (location of the agent) Actions (move up, down, left, right or any of the 4 diagonal steps, 8 total) Transitions (0.05 of slipping to an adjacent state rather than intended) Rewards (-1, -3 for diagonal, 100 for goal) Constraint: agent 1 taking the “up” action prevents agent 2 from doing so The mountain climbing scenario:

Results: Policies Policies when executing independently Policies when executed concurrently, after we apply the algorithm

Results Size of State Space Average Value of Policy

Improvements Explore different ways to modify the MDP (e.g. shrink action set) Relax the directed-acyclic constraint restriction (take an iterative approach) Show that it is optimal for summaries that consist of a single random variable

New Directions Different types of summaries - steady state behavior (current work) - multi-state summaries - summaries with temporal information Dynamic task arrival/departure: - given some model of arrival - without model – learning Positive interactions (e.g. enable)

The End