Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering.

Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering

Introduction Exploration vs exploitation is a classic tradeoff in decision problems with uncertainty -Exploit available information vs improving information -Numerous applications in finance, adaptive control, machine learning, … Interested in paradigms for teams of agents -Search and exploit information with limited resources -Task partitioning, motivated by applications (e.g. surveillance) Objective: techniques for team control of activities -Improve coordination among team members -Allow for variations on human roles in team -Understand aspects of task partitioning for mixed human/automata teams

Experimental Platform Multiple robots search for and perform tasks -Can provide varying levels of operator control: human-automata teams -Control information displayed, risk to each operator using video Possible model for interesting problems Boeing  BU

Illustration: Dynamic Search/Classify Problems Multiple agents with different fields of regard Multiple sites to search for potential objects using noisy measurements, classify them Agents have search modes/exploitation modes, finite budget

Related Work Quickest detection problem -Noisy measurements of alternative hypotheses -Trade off decision accuracy versus time (cost) to make decision Results: optimal strategy is to make decision when log- likelihood ratio leaves a threshold interval -Sequential Probability Ratio Test -Log-likelihood ratio  drift-diffusion model (Cohen-Holmes, …) -Correlates well with human decision strategies on sequential repeated tasks Single sensing mode, single site

Multi-Armed Bandits Robbins (50s), Gittins (70s), many others Independent Machines -Each machine has individual state with random dynamics -State evolves when machine is played, stationary otherwise -Random payoff depending on state of machine Objective: infinite horizon sum of discounted rewards -Repeated decisions among finite alternatives Result: Optimal policy based on Gittins indices, computed independently for each machine based on current state -Select machine with largest current Gittins index to play next

Human Exploration/Exploitation in Multi-armed Bandit Problems Multi-armed bandit paradigm -Complex choice task with simple structure to normative solution -Can model human choice with heuristics/simple strategies that approximate Gittins indices and include parameters for human variability Daw et al (05,06): Time varying environments, propose alternative models for exploration/ exploitation using soft-max and other random decision rules Steyvers et al (09), Zhang et al (09): Finite horizon total reward paradigm, with models using a latent variable to encourage exploration vs exploitation Yu-Dayan (05), Aston-Jones & Cohen (05), others : Propose mechanisms underlying brain activity in aspects of exploration vs exploitation Cohen et al (07): Highlight limitations of the multi-armed bandit paradigm and discuss directions for future work

Limits of Bandit Paradigm Stationary environments -Tasks do not evolve unless acted on -Likelihood of success at task plus follow-on task is time invariant -Set of possible tasks is time-invariant Single action per time -Not geared to team activities Infinite horizon objective -Unbounded resources Single type of action per bandit -Can’t vary choice  Interested in other paradigms that remove these limitations

A Different Paradigm: Spatial Resource Allocation Multiple agents with finite resources -Multiple actions per agent -Different action types per agent/location -Similar to classical search theory -Focus on effectiveness of collective team actions Problem: M team members, N locations, actions x ij from member j to location i: Solution mechanism: Pricing of resources + Duality - Allocation, but no tradeoff in exploration vs exploitation

Extension: Dynamics Allow learning of information based on outcomes of action -Actions on cell i from agent j at time t: x ij (t) result in information as to the content of cell -Different actions  different quality of information, different effort -Bayesian integration of information from multiple actions, times -Exploration: inexpensive actions to identify valuable locations (e.g. detect activity) -Exploitation: expensive actions to collect value (e.g. identify objects, etc) Allow dynamic arrivals and departures of objects in cells Problem: Normative solution of such problems requires full stochastic dynamic programming -Very large information space: probability measures on product space of possible cell contents per time -Combinatorial explosion in number of control actions  Complex problem to solve, even for simple versions

Problem Setup N cells, T decision stages -State of cell i at stage t: x i (t) can evolve according to a Markov chain (arrivals, departures) independently across cells M agents, each capable of allocating resource R j (t) in each stage t by choosing actions (modes) to act on cells -Decisions u ijm (t) = 1 if agent j uses action m on cell i at stage t, 0 otherwise -c ijm are resources needed for action: constraint  -Some actions m are exploratory, others exploit Action m on cell i at stage t yields noisy measurement y ijm (t) of current cell content -Evolves information state:  i (t) is probability distribution over content of cell i at stage t -Conditionally independent likelihoods p(y ijm (t)|x i (t)) Additional decision at stage t: identify -For each cell, can identify content correctly and get reward (content-dependent) -Objective: Collect total cumulative reward given resources at each time

Constraint Relaxation As stated previously, optimal solution requires feedback strategies based on the product space of information states across all states -State of cell i at stage t: x i (t) can evolve according to a Markov chain (arrivals, departures) independently across cells -Can solve conceptually, but hard to do except for very small number of cells Reformulate problem: relax resource constraints at future times -Replace sample path constraints with average constraints -Can be optimistic: will allocate more resources on difficult problems than available, balance by allocating less resources on easy problems -Resulting problem has special structure that allows for easier solution -Solution provides optimistic bounds on performance of original problem

Results: Duality Theorem: -Given resource prices j (t) for agents at each stage, stochastic optimization problem decouples into N independent cell problems -Optimal solution can be found in terms of feedback strategies that use only information on the current information state of a cell to select actions for that cell -Overall information state is product of marginal information state for each cell Implication: efficient solution algorithms -Merge pricing approaches from resource allocation with single cell subproblem solutions that use stochastic dynamic optimization -Reduce joint N-cell optimization problem into N decoupled problems, coordinated by prices -Replaces combinatorial optimization across cells by pricing mechanisms May provide tractable models for human choice in resource allocation and optional stopping -Replace detailed enumeration of outcomes with price estimates

Feasible Decisions: Model-Predictive Control Strategies designed using relaxed constraints may run out of resources in future stages -Approximate dynamic programming technique requires on-line computation of decisions instead of off-line computation of strategies Restore feasibility by re-solving with receding horizon given most recent information -Adds robustness to model errors through replanning Resource Price Update Cell 1 Subproblem Cell N Subproblem Prices Resource Utilization

Team Computation Interesting problem: agents negotiate based on local problems to agree on prices -Concave maximization problem, but non-differentiable  challenging to establish converging algorithms -Can truncate negotiation with heuristics  current approach -Topic of current PhD effort Mixed Initiative Variations -Human team members as equals control subset of resources, negotiate on prices with automata -Human team members as leaders select actions for own resources, automata select complementary actions for others -Humans as controllers impose constraints (e.g. cell responsibility allocations) on automata Algorithms have been developed to implement above variations and explore potential experiments

Experiment 1 Team of 3 agents, 100 cells, partial overlap in coverage 3 levels of resources per agent 3 sensing modes per agent -Search, low-res ID, high-res ID 3 object types, 1 of them high-valued -Vary relative error for missing high-value -Potential initial arrival per cell no departures Discrete-valued observations -Tentative classifications Strategy Parameters -Horizon, truncation strategy

Experiment 2: Adaptive Radar Search

Problem definition studied by Hero et al (07) using stochastic dynamic programming Large number of cells, few of them occupied -Want to find occupied cells -Cell i state I i = 1 if occupied, 0 otherwise -Prior probability I i = 1 given Allocate energy to cells controls signal/noise ratio of measurement Two-stage problem: initial search, followed by refined location of occupied cells Adapt second stage allocations based on first stage search Limited energy budget Experiment 2: Adaptive Radar Search

Algorithms and Results Equal allocation among cells -Motivated by standard operations in wide-area Synthetic Aperture Radar (SAR) Model Predictive Control using relaxed resource constraints -Two stage allocation: low energy image to detect pixels of interest (high intensity reflections), followed by higher energy in pixels of interest Results similar to prior work by Hero et al (2007) with simpler algorithm 2 orders of magnitude faster

Paradigm Extension: Mobile Agents Viewable sites depend on agent positions -Slower time scale control -Focus on trajectory selection and mode -Sequencing of sites critical to set up future sites Mobile agents: trajectory and focus of attention control -Models where electronic steering is not feasible -Sequence-dependent setup cost for activities Additional uncertainty: risk of travel -Visiting a site accomplishes task that gains task value -Traversing among sites can result in vehicle failure and loss

Experimental Platform for Research Multiple robots search for and perform tasks -Can provide varying levels of operator control: human-automata teams -Control information displayed, risk to each operator using video

Future Activities Evaluate algorithms on experiments with dynamic arrivals/departures Develop algorithms for motion-constrained mobile agents Implement experiments involving tasks with performance uncertainty in robot test facility -Vary tempo, size, uncertainty, information Implement autonomous team control algorithms to interact with humans in alternative roles -Supervisory control, Team partners, others Extend existing algorithms to different classes of tasks -Area search, task discovery, risk to platforms Collaborate with MURI team to design and analyze experiments involving alternative structures for human-automata teams

Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering.

Similar presentations

Presentation on theme: "Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering.

Similar presentations

Presentation on theme: "Team Exploration vs Exploitation with Finite Budgets David Castañón Boston University Center for Information and Systems Engineering."— Presentation transcript:

Similar presentations

About project

Feedback