Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Dialogue Policy Optimisation

Incentivize Crowd Labeling under Budget Constraint

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Partially Observable Markov Decision Process (POMDP)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Extraction and Transfer of Knowledge in Reinforcement Learning A.LAZARIC Inria “30 minutes de Science” Seminars SequeL Inria Lille – Nord Europe December.

Supervised Learning Recap

Hierarchical mission control of automata with human supervision Prof. David A. Castañon Boston University.

Planning under Uncertainty

On Systems with Limited Communication PhD Thesis Defense Jian Zou May 6, 2004.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Kuang-Hao Liu et al Presented by Xin Che 11/18/09.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 2: Evaluative Feedback pEvaluating actions vs. instructing by giving correct.

Jointly Optimal Transmission and Probing Strategies for Multichannel Systems Saswati Sarkar University of Pennsylvania Joint work with Sudipto Guha (Upenn)

Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Cooperative Q-Learning Lars Blackmore and Steve Block Expertness Based Cooperative Q-learning Ahmadabadi, M.N.; Asadpour, M IEEE Transactions on Systems,

Today Introduction to MCMC Particle filters and MCMC

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

From T. McMillen & P. Holmes, J. Math. Psych. 50: 30-57, MURI Center for Human and Robot Decision Dynamics, Sept 13, Phil Holmes, Jonathan.

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Aeronautics & Astronautics Autonomous Flight Systems Laboratory All slides and material copyright of University of Washington Autonomous Flight Systems.

Decentralised Coordination of Mobile Sensors School of Electronics and Computer Science University of Southampton Ruben Stranders,

1 Patch Complexity, Finite Pixel Correlations and Optimal Denoising Anat Levin, Boaz Nadler, Fredo Durand and Bill Freeman Weizmann Institute, MIT CSAIL.

MAKING COMPLEX DEClSlONS

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Introduction Many decision making problems in real life

TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.

MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.

Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

1 Distributed and Optimal Motion Planning for Multiple Mobile Robots Yi Guo and Lynne Parker Center for Engineering Science Advanced Research Computer.

Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Cooperative Q-Learning Lars Blackmore and Steve Block Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents Tan, M Proceedings of the.

CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.

Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking Presented by Shihao Ji Duke University Machine Learning.

Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Competitive Scheduling in Wireless Networks with Correlated Channel State Ozan.

Analyzing wireless sensor network data under suppression and failure in transmission Alan E. Gelfand Institute of Statistics and Decision Sciences Duke.

Lecture 2: Statistical learning primer for biologists

Primbs1 Receding Horizon Control for Constrained Portfolio Optimization James A. Primbs Management Science and Engineering Stanford University (with Chang.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Bandits.

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.

Thrust IIB: Dynamic Task Allocation in Remote Multi-robot HRI Jon How (lead) Nick Roy MURI 8 Kickoff Meeting 2007.

Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.

Learning for Physically Diverse Robot Teams Robot Teams - Chapter 7 CS8803 Autonomous Multi-Robot Systems 10/3/02.

Can small quantum systems learn? NATHAN WIEBE & CHRISTOPHER GRANADE, DEC

Predicting Consensus Ranking in Crowdsourced Setting Xi Chen Mentors: Paul Bennett and Eric Horvitz Collaborator: Kevyn Collins-Thompson Machine Learning.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Mean Field Methods for Computer and Communication Systems Jean-Yves Le Boudec EPFL Network Science Workshop Hong Kong July

Carnegie Mellon University

Figure 5: Change in Blackjack Posterior Distributions over Time.

Online Multiscale Dynamic Topic Models

CS b659: Intelligent Robotics

C.-S. Shieh, EC, KUAS, Taiwan

Multi-Agent Exploration

Announcements Homework 3 due today (grace period through Friday)

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Chapter 2: Evaluative Feedback

Markov Decision Problems

Chapter 2: Evaluative Feedback

Presentation transcript:

Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering

Introduction Exploration vs exploitation is a classic tradeoff in decision problems with uncertainty -Exploit available information vs improving information -Numerous applications in finance, adaptive control, machine learning, … Interested in paradigms for teams of agents -Search and exploit information with limited resources -Task partitioning, motivated by applications (e.g. surveillance) Objective: techniques for team control of activities -Improve coordination among team members -Allow for variations on human roles in team -Understand aspects of task partitioning for mixed human/automata teams

Experimental Platform Multiple robots search for and perform tasks -Can provide varying levels of operator control: human-automata teams -Control information displayed, risk to each operator using video Possible model for interesting problems Boeing  BU

Illustration: Dynamic Search/Classify Problems Multiple agents with different fields of regard Multiple sites to search for potential objects using noisy measurements, classify them Agents have search modes/exploitation modes, finite budget

Related Work Quickest detection problem -Noisy measurements of alternative hypotheses -Trade off decision accuracy versus time (cost) to make decision Results: optimal strategy is to make decision when log- likelihood ratio leaves a threshold interval -Sequential Probability Ratio Test -Log-likelihood ratio  drift-diffusion model (Cohen-Holmes, …) -Correlates well with human decision strategies on sequential repeated tasks Single sensing mode, single site

Multi-Armed Bandits Robbins (50s), Gittins (70s), many others Independent Machines -Each machine has individual state with random dynamics -State evolves when machine is played, stationary otherwise -Random payoff depending on state of machine Objective: infinite horizon sum of discounted rewards -Repeated decisions among finite alternatives Result: Optimal policy based on Gittins indices, computed independently for each machine based on current state -Select machine with largest current Gittins index to play next

Human Exploration/Exploitation in Multi-armed Bandit Problems Multi-armed bandit paradigm -Complex choice task with simple structure to normative solution -Can model human choice with heuristics/simple strategies that approximate Gittins indices and include parameters for human variability Daw et al (05,06): Time varying environments, propose alternative models for exploration/ exploitation using soft-max and other random decision rules Steyvers et al (09), Zhang et al (09): Finite horizon total reward paradigm, with models using a latent variable to encourage exploration vs exploitation Yu-Dayan (05), Aston-Jones & Cohen (05), others : Propose mechanisms underlying brain activity in aspects of exploration vs exploitation Cohen et al (07): Highlight limitations of the multi-armed bandit paradigm and discuss directions for future work

Limits of Bandit Paradigm Stationary environments -Tasks do not evolve unless acted on -Likelihood of success at task plus follow-on task is time invariant -Set of possible tasks is time-invariant Single action per time -Not geared to team activities Infinite horizon objective -Unbounded resources Single type of action per bandit -Can’t vary choice  Interested in other paradigms that remove these limitations

A Different Paradigm: Spatial Resource Allocation Multiple agents with finite resources -Multiple actions per agent -Different action types per agent/location -Similar to classical search theory -Focus on effectiveness of collective team actions Problem: M team members, N locations, actions x ij from member j to location i: Solution mechanism: Pricing of resources + Duality - Allocation, but no tradeoff in exploration vs exploitation

Extension: Dynamics Allow learning of information based on outcomes of action -Actions on cell i from agent j at time t: x ij (t) result in information as to the content of cell -Different actions  different quality of information, different effort -Bayesian integration of information from multiple actions, times -Exploration: inexpensive actions to identify valuable locations (e.g. detect activity) -Exploitation: expensive actions to collect value (e.g. identify objects, etc) Allow dynamic arrivals and departures of objects in cells Problem: Normative solution of such problems requires full stochastic dynamic programming -Very large information space: probability measures on product space of possible cell contents per time -Combinatorial explosion in number of control actions  Complex problem to solve, even for simple versions

Problem Setup N cells, T decision stages -State of cell i at stage t: x i (t) can evolve according to a Markov chain (arrivals, departures) independently across cells M agents, each capable of allocating resource R j (t) in each stage t by choosing actions (modes) to act on cells -Decisions u ijm (t) = 1 if agent j uses action m on cell i at stage t, 0 otherwise -c ijm are resources needed for action: constraint  -Some actions m are exploratory, others exploit Action m on cell i at stage t yields noisy measurement y ijm (t) of current cell content -Evolves information state:  i (t) is probability distribution over content of cell i at stage t -Conditionally independent likelihoods p(y ijm (t)|x i (t)) Additional decision at stage t: identify -For each cell, can identify content correctly and get reward (content-dependent) -Objective: Collect total cumulative reward given resources at each time

Constraint Relaxation As stated previously, optimal solution requires feedback strategies based on the product space of information states across all states -State of cell i at stage t: x i (t) can evolve according to a Markov chain (arrivals, departures) independently across cells -Can solve conceptually, but hard to do except for very small number of cells Reformulate problem: relax resource constraints at future times -Replace sample path constraints with average constraints -Can be optimistic: will allocate more resources on difficult problems than available, balance by allocating less resources on easy problems -Resulting problem has special structure that allows for easier solution -Solution provides optimistic bounds on performance of original problem

Results: Duality Theorem: -Given resource prices j (t) for agents at each stage, stochastic optimization problem decouples into N independent cell problems -Optimal solution can be found in terms of feedback strategies that use only information on the current information state of a cell to select actions for that cell -Overall information state is product of marginal information state for each cell Implication: efficient solution algorithms -Merge pricing approaches from resource allocation with single cell subproblem solutions that use stochastic dynamic optimization -Reduce joint N-cell optimization problem into N decoupled problems, coordinated by prices -Replaces combinatorial optimization across cells by pricing mechanisms May provide tractable models for human choice in resource allocation and optional stopping -Replace detailed enumeration of outcomes with price estimates

Feasible Decisions: Model-Predictive Control Strategies designed using relaxed constraints may run out of resources in future stages -Approximate dynamic programming technique requires on-line computation of decisions instead of off-line computation of strategies Restore feasibility by re-solving with receding horizon given most recent information -Adds robustness to model errors through replanning Resource Price Update Cell 1 Subproblem Cell N Subproblem Prices Resource Utilization

Team Computation Interesting problem: agents negotiate based on local problems to agree on prices -Concave maximization problem, but non-differentiable  challenging to establish converging algorithms -Can truncate negotiation with heuristics  current approach -Topic of current PhD effort Mixed Initiative Variations -Human team members as equals control subset of resources, negotiate on prices with automata -Human team members as leaders select actions for own resources, automata select complementary actions for others -Humans as controllers impose constraints (e.g. cell responsibility allocations) on automata Algorithms have been developed to implement above variations and explore potential experiments

Experiment 1 Team of 3 agents, 100 cells, partial overlap in coverage 3 levels of resources per agent 3 sensing modes per agent -Search, low-res ID, high-res ID 3 object types, 1 of them high-valued -Vary relative error for missing high-value -Potential initial arrival per cell no departures Discrete-valued observations -Tentative classifications Strategy Parameters -Horizon, truncation strategy

Experiment 2: Adaptive Radar Search

Problem definition studied by Hero et al (07) using stochastic dynamic programming Large number of cells, few of them occupied -Want to find occupied cells -Cell i state I i = 1 if occupied, 0 otherwise -Prior probability I i = 1 given Allocate energy to cells controls signal/noise ratio of measurement Two-stage problem: initial search, followed by refined location of occupied cells Adapt second stage allocations based on first stage search Limited energy budget Experiment 2: Adaptive Radar Search

Algorithms and Results Equal allocation among cells -Motivated by standard operations in wide-area Synthetic Aperture Radar (SAR) Model Predictive Control using relaxed resource constraints -Two stage allocation: low energy image to detect pixels of interest (high intensity reflections), followed by higher energy in pixels of interest Results similar to prior work by Hero et al (2007) with simpler algorithm 2 orders of magnitude faster

Paradigm Extension: Mobile Agents Viewable sites depend on agent positions -Slower time scale control -Focus on trajectory selection and mode -Sequencing of sites critical to set up future sites Mobile agents: trajectory and focus of attention control -Models where electronic steering is not feasible -Sequence-dependent setup cost for activities Additional uncertainty: risk of travel -Visiting a site accomplishes task that gains task value -Traversing among sites can result in vehicle failure and loss

Experimental Platform for Research Multiple robots search for and perform tasks -Can provide varying levels of operator control: human-automata teams -Control information displayed, risk to each operator using video

Future Activities Evaluate algorithms on experiments with dynamic arrivals/departures Develop algorithms for motion-constrained mobile agents Implement experiments involving tasks with performance uncertainty in robot test facility -Vary tempo, size, uncertainty, information Implement autonomous team control algorithms to interact with humans in alternative roles -Supervisory control, Team partners, others Extend existing algorithms to different classes of tasks -Area search, task discovery, risk to platforms Collaborate with MURI team to design and analyze experiments involving alternative structures for human-automata teams