Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University.

Similar presentations


Presentation on theme: "1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University."— Presentation transcript:

1 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University of Southern California Sarit Kraus Bar-Ilan University,Israel University of Maryland, College Park

2 2 University of Southern California Motivation: The Prediction Game An UAV (Unmanned Aerial Vehicle)  Flies between the 4 regions Can you predict the UAV-fly pattern ?? Pattern 1 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,…… Pattern 2 1, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3,… (as generated by 4-sided dice) Can you predict if 100 numbers in pattern 2 are given ?? Randomization decreases Predictability Increases Security Region 1Region 2 Region 3Region 4

3 3 University of Southern California Problem Definition Problem : Increase security by decreasing predictability for agent/agent-team acting in uncertain adversarial environments.  Even if Policy Given, it is Secure  Efficient Algorithms for Reward/Randomness Tradeoff Assumptions for Agent/agent-team:  Adversary is unobservable –Adversary’s actions/capabilities or payoffs are unknown Assumptions for Adversary:  Knows the agents plan/policy  Exploits the action predictability  Can see the agent’s state (or belief state)

4 4 University of Southern California Solution Technique Technique developed:  Intentional policy randomization  MDP/POMDP framework –Sequential decision making –MDP  Markov Decision Process –POMDP  Partially Observable MDP Increase Security => Solve Multi-criteria problem for agents  Maximize action unpredictability (Policy randomization)  Maintain reward above threshold (Quality constraints)

5 5 University of Southern California Domains Scheduled activities at airports like security check, refueling etc  Observable by anyone  Randomization of schedules helpful UAV/UAV-team patrolling humanitarian mission  Adversary disrupts mission – Can disrupt food, harm refugees, shoot down UAV’s etc  Randomize UAV patrol policy

6 6 University of Southern California My Contributions Two main contributions  Single Agent Case : –Formulate as Non linear program : Entropy based metric –Convert to Linear Program called BRLP n (Binary search for randomization) –Randomize single agent policies with reward > threshold  Multi Agent Case : RDR (Rolling Down Randomization) –Randomized policies for decentralized POMDPs –Threshold on team reward

7 7 University of Southern California MDP based single agent case MDP is tuple  S – Set of states  A – Set of actions  P – Transition function  R – Reward function Basic terms used :  x(s,a) : Expected times action a is taken in state s  Policy (as function of MDP flows) :

8 8 University of Southern California Entropy : Measure of randomness Randomness or information content : Entropy (Shannon 1948) Entropy for MDP -  Additive Entropy – Add entropies of each state (π is a function of x)  Weighted Entropy – Weigh each state by it contribution to total flow where, alpha_j is the initial flow of the system

9 9 University of Southern California Tradeoff : Reward vs Entropy Non-linear Program: Max entropy, Reward above threshold  Objective (Entropy) is non-linear BRLP ( Binary Search for Randomization LP ) :  Linear Program  No entropy calculation, Entropy as function of flows

10 10 University of Southern California BRLP Input and target reward (n% * maximum reward) Poly-time convergence Monotonicity: Entropy decreases or constant with increasing reward.  Control through Input can be any high entropy policy One such input is the uniform policy  Equal probability for all actions out of all states

11 11 University of Southern California LP for Binary Search Policy as function of and Linear Program

12 12 University of Southern California BRLP in Action = 1 - Max entropy = 0 Deterministic Max Reward Target Reward Beta =.5

13 13 University of Southern California Results (Averaged over 10 MDPs) Max entropy : Expected Entropy Method : 10% avg gain over BRLP Fastest : BRLP : 7 fold average speedup over Expected Entropy

14 14 University of Southern California Multi Agent Case: Problem Maximize entropy for agent teams subject to reward threshold For agent team :  Decentralized POMDP framework used  Agents know initial joint belief state  No communication possible between agents For adversary :  Knows the agents policy  Exploits the action predictability  Can calculate the agent’s belief state

15 15 University of Southern California RDR : Rolling Down Randomization Input :  Best ( local or global ) deterministic policy  Percent of reward loss  d parameter – Number of turns each agent gets –Ex: d =.5 => Number of steps = 1/d = 2 –Each agent gets one turn (for 2 agent case) –Single agent MDP problem at each step For agent 1’s turn :  Fix policy of other agents (Agent 2)  Find randomized policy –Maximizes joint entropy n ( w1 * Entropy(agent1) + w2 * Entropy(agent2) ) –Maintains joint reward above threshold

16 16 University of Southern California RDR : d =.5 Max Reward 80% of Max Reward Agent 1 Maximize joint entropy Joint Reward > 90% Reward = 90% Agent 2 Maximize joint entropy Joint reward > 80%

17 17 University of Southern California Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )

18 18 University of Southern California Summary Intentional randomization as main focus Single agent case :  BRLP algorithm introduced Multi agent case :  RDR algorithm introduced Multi-criterion problem solved that  Maximizes entropy  Maintains Reward > Threshold

19 19 University of Southern California Thank You Any comments/questions ??

20 20 University of Southern California

21 21 University of Southern California Difference between safety and security ?? Security: It is defined as the ability of the system to deal with threats that are intentionally caused by other intelligent agents and/or systems. Safety : A system's safety is its ability to deal with any other threats to its goals.

22 22 University of Southern California Probing Results : Single agent Case

23 23 University of Southern California Probing Results : Multi agent Case

24 24 University of Southern California Define POMDP

25 25 University of Southern California Define Distributed POMDP Dec-POMDP is a tuple, where S – Set of states A – Joint action set P – Transition function Ω – Set of joint observations O- Observation function – Probability of joint observation given current state and previous joint action. Observations independent of each other R – Immediate, Joint reward A DEC-MDP is a DEC-POMDP with the restriction that at each time step the agents observations together uniquely determine the state.

26 26 University of Southern California Counterexample : Entropy Lets say adversary shoots down UAV  Hence targets highest probable action --- Called Hit rate Assume UAV has 3 actions. 2 possible probability distributions  H ( 1/2, 1/2, 0 ) = 1 ( log base 2 )  H ( 1/2 - delta, 1/4 + delta, 1/4 ) ~ 3/2 Entropy = 3/2, Hit rate = 1/2-delta Entropy = 1, Hit rate = 1/2 Higher entropy but lower hit rate

27 27 University of Southern California d-parameter & Comments on Results Effect of d-parameter (avg of 10 instances) RDR : Avg runtime in sec and (Entropy), T = 2 Conclusions: Greater tolerance of reward loss => Higher entropy Reaching maximum entropy tougher than single agent case Lower miscoordination cost implies higher entropy d parameter of.5 is good for practical purposes. Reward Threshold 1.5.25.125 90%.67(.59)1.73(.74)3.47(.75)7.07(.75) 50%.67(1.53)1.47(2.52)3.4(2.62)7.47(2.66)

28 28 University of Southern California Example where uniform policy is not best

29 29 University of Southern California Entropies For uniform policy –  1 + ½ * 1 + 2 * ¼ * 1 + 4 * 1/8 * 1 = 2.5 If initially deterministic policy and then uniform –  0 + 1 * 1 + 2 * ½ * 1 + 4 * ¼ * 1 = 3 Hence, uniform policies need not always be optimal.


Download ppt "1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University."

Similar presentations


Ads by Google