Multi-Player Pursuit Evasion Games, Learning, and Sensor Webs Shankar Sastry University of California, Berkeley ATO Novel Approaches to Information Assurance And Control Theory Workshop February 5 th, 2003.
1 UC Berkeley Pursuit-Evasion Game (PEG) Setup
2 Evader Pursuers Unknown Terrain Not accurately mapped Obstacles (moving & stationary) Pursuers Cooperative teams of UAVs & UGVs UAV visually scans a limited region UAVs rotorcraft-based Networked communications Evaders Intelligent UGVs Moves between & under obstacles Actively avoids detection Objective Capture evader(s) in minimum time UAV-UGV Coordinated PEG Obstacles
3 Cooperative Observation Problem Difficulty Suppose at each instant in time, the location of all evaders is given. Optimal placement of pursuers in order that the maximum number of evaders are visible? -> NP-hard Reducible to Vertex Cover Problem with G=(V,E) Obstacles Evaders Pursuers
4 A Two-step Solution: Exploration then Pursuit Exploration followed by pursuit is not efficient Sensors are imprecise Worst-case assumptions on the trajectories of the evaders leads to very conservative results Exploration Pursuit
5 Probabilistic Framework Use a probabilistic framework to combine exploration and pursuit-evasion games. Non-determinism comes from Poorly mapped terrain Noise and uncertainty in the sensors Probabilistic models for the motion of the evaders and the UAVs
6 Pursuit-Evasion Game Experimental Settings Multiple pursuers attempt capture of the evader(s) Pursuers can only move to adjacent empty cells Pursuers have perfect knowledge of current location Sensor model: false positives and negatives for evader detection Evader moves randomly to adjacent cells Unknown number of multiple evaders Sensor model for detection & tracking of targets Supervisory UAVs fly over obstacles & evaders, but cannot capture -> heterogeneous team Safety study of control policies
7 Map Building: Map of Evaders Measurement step + y(t) ={v(t),e(t),o(t)} Sensor model Prediction step Evader motion model At each t, the probability of evader state being x given the measurement histories is recursively updated.
8 Pursuit-Evasion Game Experiment Setup Ground Command Post Waypoint Commands Current position, vehicle status Pursuer: UAV Evader: UGV Evader location detected by vision system Pursuer: UGV s
9 Uncertainty pervades every layer! Hierarchy in Berkeley Platform actuator positions inertial positions height over terrain obstacles detected targets detected control signals INSGPS ultrasonic altimeter vision state of agents obstacles detected targets detected obstacles detected agents positions desired agents actions Tactical Planner & Regulation Vehicle-level sensor fusion Strategy PlannerMap Builder position of targets position of obstacles positions of agents Communications Network tactical planner trajectory planner regulation lin. accel. ang. vel. Targets Exogenous disturbance UAV dynamics Terrain actuator encoder s UGV dynamics
10 Representing and Managing Uncertainty Uncertainty is introduced in various channels –Sensing -> unable to determine the current state of world –Prediction -> unable to infer the future state of world –Actuation -> unable to make the desired action to properly affect the state of world Different types of uncertainty can be addressed by different approaches –Nondeterministic uncertainty : Robust Control –Probabilistic uncertainty : (Partially Observable) Markov Decision Processes –Adversarial uncertainty : Game Theory POMGAME
11 Markov Games Framework for sequential multiagent interaction in an Markov environment
12 Policy for Markov Games The policy of agent i at time t is a mapping from the current state to probability distribution over its action set. Agent i wants to maximize –the expected infinite sum of a reward that the agent will gain by executing the optimal policy starting from that state –where is the discount factor, and is the reward received at time t Performance measure: Every discounted Markov game has at least one stationary optimal policy, but not necessarily a deterministic one. Special case : Markov decision processes (MDP) –Can be solved by dynamic programming
13 Partial Observation Markov Games (POMGame)
14 Policy for POMGames The agent i wants to receive at least Poorly understood: analysis exists only for very specially structured games such as a game with a complete information on one side Special case : partially observable Markov decision processes (POMDP)
15 Acting under Partial Observations Memory-free policies (mapping from observation to action or probability distribution over action sets) are not satisfactory. In order to behave truly effectively we need to use memory of previous actions and observations to disambiguate the current state. The state estimate, or belief state –Posterior probability distribution over states = the likelihood the world is actually in the state x, at time t, given the agent’s past experience (I.e. actions and observation histories). A priori human input on the initial state of world
16 Updating Belief State –Can be updated recursively using the estimated world model and Bayes’ rule. New info on the state of world New info on prediction
17 Pursuit-Evasion Games Consider approach in Hespanha, Kim and Sastry –Multiple pursuers catching one single evader –Pursuers can only move to adjacent empty cells –Pursuers have perfect knowledge of current location –Sensor model: false positives (p) and negatives (q) for evader detection –Evader moves randomly to adjacent cells Extensions in Rashid and Kim –Multiple evaders: assuming each one is recognized individually –Supervisory agents: can “fly” over obstacles and evaders, cannot capture –Sensor model for obstacle detection as well
18 Problem Formulation
19 Performance measure : capture time Optimal policy minimizes the cost Optimal Pursuit Policy
20 cost-to-go for policy , when the pursuers start with Y t = Y and a conditional distribution for the state x(t) cost of policy Optimal Pursuit Policy
21 Persistent pursuit policies Optimization using dynamic programming is computationally intensive. Persistent pursuit policy g
22 Persistent pursuit policies Persistent pursuit policy g with a period T
23 Pursuit Policies Greedy Policy –Pursuer moves to the cell with the highest probability of having an evader at the next instant –Strategic planner assigns more importance to local or immediate considerations –u(v) : list of cells that are reachable from the current pursuers position v in a single time step.
24 Persistent Pursuit Policies for unconstrained motion Theorem 1, for unconstrained motion The greedy policy is persistent. ->The probability of the capture time being finite is equal to one ->The expected value of the capture time is finite
25 Persistent Pursuit Policies for constrained motion Assumptions 1.For any 2.Theorem 2, for constrained motion There is an admissible pursuit policy that is persistent on the average with period
26 Experimental Results: Pursuit Evasion Games with 4UGVs (Spring’ 01)
27 Experimental Results: Pursuit Evasion Games with 4UGVs and 1 UAV (Spring’ 01)
28 Pursuit-Evasion Game Experiment PEG with four UGVs Global-Max pursuit policy Simulated camera view (radius 7.5m with 50degree conic view) Pursuer=0.3m/s Evader=0.5m/s MAX
29 Pursuit-Evasion Game Experiment PEG with four UGVs Global-Max pursuit policy Simulated camera view (radius 7.5m with 50degree conic view) Pursuer=0.3m/s Evader=0.5m/s MAX
30 Experimental Results: Evaluation of Policies for different visibility Global max policy performs better than greedy, since the greedy policy selects movements based only on local considerations. Both policies perform better with the trapezoidal view, since the camera rotates fast enough to compensate the narrow field of view. Capture time of greedy and glo-max for the different region of visibility of pursuers 3 Pursuers with trapezoidal or omni-directional view Randomly moving evader
31 Experimental Results: Evader’s Speed vs. Intelligence Having a more intelligent evader increases the capture time Harder to capture an intelligent evader at a higher speed The capture time of a fast random evader is shorter than that of a slower random evader, when the speed of evader is only slightly higher than that of pursuers. Capture time for different speeds and levels of intelligence of the evader 3 Pursuers with trapezoidal view & global maximum policy Max speed of pursuers: 0.3 m/s
32 Game-theoretic Policy Search Paradigm Solving very small games with partial information, or games with full information, are sometimes computationally tractable Many interesting games including pursuit-evasion are a large game with partial information, and finding optimal solutions is well outside the capability of current algorithms Approximate solution is not necessarily bad. There might be simple policies with satisfactory performances -> Choose a good policy from a restricted class of policies ! We can find approximately optimal solutions from restricted classes, using a sparse sampling and a provably convergent policy search algorithm
33 Constructing A Policy Class Given a mission with specific goals, we –decompose the problem in terms of the functions that need to be achieved for success and the means that are available –analyze how a human team would solve the problem –determine a list of important factors that complicate task performance such as safety or physical constraints Maximize aerial coverage, Stay within a communications range, Penalize actions that lead an agent to a danger zone, Maximize the explored region, Minimize fuel usage, …
34 Policy Representation Quantitize the above features and define a feature vector that consists of the estimate of above quantities for each action given agents’ history Estimate the ‘goodness’ of each action by constructing where is the weighting vector to be learned. Choose an action that maximizes. Or choose a randomized action according to the distribution Degree of Exploration
35 Policy Search Paradigm Searching for optimal policies is very difficult, even though there might be simple policies with satisfactory performances. Choose a good policy from a restricted class of policies ! Policy Search Problem
36 PEGASUS (Ng & Jordan, 00) Given a POMDP, Assuming a deterministic simulator, we can construct an equivalent POMDP with deterministic transitions. For each policy 2 for we can construct an equivalent policy 0 2 0 for 0 such that they have the same value function, i.e. V ( ) = V 0 ( 0 ). It suffices for us to find a good policy for the transformed POMDP 0. Value function can be approximated by a deterministic function, and m s samples are taken and reused to compute the value function for each candidate policy. --> Then we can use standard optimization techniques to search for approximately optimal policy.
37 Performance Guarantee & Scalability Theorem We are guaranteed to have a policy with the value close enough to the optimal value in the class
38 Acting under Partial Observations Computing the value function is very difficult under partial observations. Naïve approaches for dealing with partial observations: –State-free deterministic policy : mapping from observation to action Ignores partial observability (i.e., treat observations as if they were the states of the environment) Finding an optimal mapping is NP-hard. Even the best policy can have very poor performance or can cause a trap. – State-free stochastic policy : mapping from observation to probability distribution over action Finding an optimal mapping is still NP-hard. Agents still cannot learn from the reward or penalty received in the past.
39 Example:Abstraction of Pursuit-Evasion Game Consider a partial-observation stochastic pursuit-evasion game in a 2-D grid world, between (heterogeneous) teams of n e evaders and n p pursuers. At each time t, –Each evader and pursuer, located at and respectively, –takes the observation over its visibility region –updates the belief state –chooses action from Goal: capture of the evader, or survival
40 Example: Policy Feature Maximize collective aerial coverage -> maximize the distance between agents where is the location of pursuer that will be landed by taking action from Try to visit an unexplored region with high possibility of detecting an evader where is a position arrived by the action that maximizes the evader map value along the frontier
41 Prioritize actions that are more compatible with the dynamics of agents Policy representation Example: Policy Feature (Continued)
42 Benchmarking Experiments Performance of two pursuit policies compared in terms of capture time Experiment 1 : two pursuers against the evader who moves greedily with respect to the pursuers’ location Experiment 2 : When we supposed the position of evader at each step is detected by the sensor network with only 10% accuracy, two optimized pursuers took 24.1 steps, while the one-step greedy pursuers took over 146 steps in average to capture the evader in 30 by 30 grid. Grid size1-Greedy pursuersOptimized pursuers 10 by 10(7.3, 4.8)(5.1, 2.7) 20 by 20(42.3, 19.2)(12.3, 4.3)
43 The State-of-the-Art
44 What pursuers really see
45 Tiny OS (TOS) Jason Hill, Robert Szewczyk, Alec Woo, David Culler TinyOS Ad hoc networking
46 Smart Dust, Dot Motes, MICA Motes Dot motes, MICA motes and smart dust
47 1. Field of wireless sensor nodes Ad hoc, rather than engineered placement At least two potential modes of observation –Acoustic, magnetic, RF
48 2. Subset of more powerful assets Gateway nodes with pan-tilt camera –Limited instantaneous field of view
49 3. Set of objects moving through
50 4. Track a distinguished object
51 Sensor net increases visibility
52 What a sensor network can do for PEG Potential Issues in current PEG –Cameras have small range –GPS jamming, unbounded error of INS, noisy ultrasonic sensors –Communication among pursuers may be difficult over a large area –Unmanned vehicles are expensive It is unrealistic to employ many number of unmanned vehicles to cover a large region to be monitored. –A smart evader is difficult to catch Benefits from sensor network –Large sensing coverage –Location aware sensor network provide pursuers with additional position information –Network can relay information among pursuers –Sensor network is cheap and can reduce number of pursuers without compromising capture time –Sensibly reduce exploration of the environment –A wide, distributed network is more difficult to compromise Overall Performance can be dramatically increased by lowering capture time, by increasing fault tolerance and making the pursuer team resilient to security attacks
53 Control Signals to pursuer GPS vision Tactical Planner & Regulation Vehicle-level sensor fusion Strategy Planner Map Builder Pursuers’ communication infrastructure Nest Sensorweb Physical Platform Sensor information layer Single vehicle estimation and control layer Vehicles coordination layer Used
54 Pursuit Evasion Games using sensor webs Self organization of motes into a sensorweb –Creation of a communication infrastructure –Self-localization –Synchronization Tracking of evaders’ by pursuers’ team –Evaders’ position and velocity estimation by sensor network –Communication of sensors’ estimates to ground pursuers Design of a pursuit strategy –Coordination of ground & aerial pursuers Network maintenance Robustness Security
55 Closed-loop at many levels Within a node –Algorithms adapt to available energy, physical measurements, network condition Across the network –discovery and routing, transmission protocols are energy aware and depend on application requirements Within the middleware components –synchronization, scheduling, localization On the vehicle –direction, stability, probabilistic map building Among the vehicles –competitive, hidden Markov decision processes Used
56 Coordinated pursuit strategy Estimation of number of evaders –Disambiguation of multiple signal traces Estimation of capture time: several possibilities –Every pursuer gets the closest evader –Pursuers relay partial info about evaders to base station –Base station estimate time-to-capture and assign evaders to pursuers –Pursuers communicate with each other locally and implement a distributed pursuit strategy Vision-based tracking –Pursuer switch to vision-based tracking when evader is within camera range
57 Pursuit Evasion Games specifications The goal is to minimize the time necessary to catch the evaders, i.e. having a ground pursuer within a certain distance from an evader Other possible performance metrics to optimize for (minimize) are: –Total energy spent –Given a number of evaders, minimize number of pursuers needed with respect to a constant average time to capture –Degradation of performance (average capture time) in view of: Percentage of corrupted nodes Percentage of failing nodes Smart evaders
58 Simulation of Multi-Target Tracker Using Sensor Webs
59 Target Tracking Using Berkeley Motes Sensor Web
60 Many interesting problems arise from this set up Targeting of the cameras so as to have objects of interest in the field of view Collaborate between field of nodes and platform to perform ranging and localization to create coordinate system Building of a routing structures between field nodes and higher-level resources Targeting of high-level assets Sensors guide video assets in real time Video assets refine sensor-based estimate Network resources focused on region of importance
61 Abstraction of Sensorwebs Properties of general sensor nodes are described by –sensing range, confidence on the sensed data –memory, computation capability –Clock skew –Communication range, bandwidth, time delay, transmission loss –broadcasting methods (periodic or event-based) –And more… To apply sensor nodes for the experiments with BEAR platform, introduce super-nodes ( or gateways ), which can –gather information from sub-nodes ( filtering or fusion of the data from sub-nodes for partial map building) –communicate with UAV/UGVs
62 Sensor WeBS for Pursuit Evasion games A distributed network of sensor motes is dropped by an MAV and this is used by the UAVs/UGVs to be able to localize and chase the pursuers. Variations: pursuers have access to one set of sensor motes and evaders have access to other sensor motes Other variations: attack of sensor webs of pursuer and evader during the game for deception and counter-intelligence. Bake Off against Vision Based Pursuit Evasion Games Mobile Macro-Motes for dynamic networking for the pursuer/evasion games.