Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy-preserving Reinforcement Learning

Similar presentations


Presentation on theme: "Privacy-preserving Reinforcement Learning"— Presentation transcript:

1 Privacy-preserving Reinforcement Learning
Tokyo Inst. of Tech. Jun Sakuma, Shigenobu Kobayashi Rutgers Univ. Rebecca N. Wright

2 Motivating application: Load balancing
Order Shipment Order Shipment Production Production Redirection when heavily loaded A load balancing among competing factories Factories obtain a reward by processing a job, but suffer a large penalty if overflow happens Factories may need to redirect jobs to the other factory when heavily loaded When should factories redirect jobs to the other factory? Jun Sakuma

3 Motivating application: Load balancing
Order Shipment Order Shipment Production Production Redirection when heavily loaded If two factories are competing… The frequency of orders and the speed of production is private (private model) The backlog is private (private state observation) The profit is private (private reward) Privacy-preserving Reinforcement Learning States, actions, and rewards are not shared But the learned policy is shared in the end Jun Sakuma

4 Definition of Privacy Environment Alice Bob Partitioned-by-time model
Agents share the state space, the action space and the reward function Agents cannot interact with the environment simultaneously Environment state st, reward rt state st, reward rt action at action at Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) Alice’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 Jun Sakuma

5 Definition of Privacy Environment Alice Bob
Partitioned-by-observation model State spaces and action spaces are mutually exclusive between agents Agents interact with the environment simultaneously Environment state stA, reward rtA state stB, reward rtB Alice action atA action atB Bob Alice’s perception (sA1,aA1,rA1) (sAt,aAt,rAt) Bob’s perception (sB1,aB1,rB1) (sBt,aBt,rBt) t=0 Jun Sakuma

6 Are existing RLs privacy-preserving?
Centralized RL (CRL) Distributed RL (DRL), [Schneider99][Ng05] environment environment Agent Agent Agent Agent Agent Agent Agent Agent Each distributed agent shares partial observation and learns Leader Agent Leader agent learns Independent DRL (IDRL) Optimality Privacy CRL optimal disclosed DRL medium partly disclosed IDRL bad Preserved PPRL preserved environment Agent Agent Agent Agent Each agent learns independently Target: achievement of privacy preservation without sacrificing the optimality Jun Sakuma

7 Privacy-preserving Reinforcement Learning
Algorithm Tabular SARSA learning with epsilon-greedy action selection Overview: (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

8 Building block: Homomorphic public-key cryptosystem
A pair of public and secret key (pk, sk) Encryption: c = epk(m; r), m is an integer, r is a random integer Decryption: m=dsk(c) Homomorphic public-key cryptosystem Addition of cipher epk(m1+m2; r1+r2) = epk(m1; r)・epk(m2; r) Multiplication of cipher epk(km; kr) = epk(m; r)k Paillier cryptosystem[Pai99] is homomorphic Jun Sakuma 8 8

9 Building block: Random shares
Secret: x public: N Alice Bob Random share a Random share b (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Jun Sakuma

10 Building block: Random shares
Secret: x=6 public: N=23 Alice Bob Random share a=15 Random share b=14 (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Example a=15 and b=14 6= (=29) mod 23 Jun Sakuma

11 Building block: Private comparison
Secure Function Evaluation [Yao86] allows parties to evaluate a specified function f by taking their private input after the SFE, their inputs and outputs are not revealed Private input: x Private input: y Private comparison Output: If x>y, 0. Else 1. Output: 0 Jun Sakuma

12 Privacy-preserving Reinforcement Learning
Protocol for partitioned-by-time model (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

13 Step 1: Initialization of Q-vales
Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) Q-values Jun Sakuma

14 Step 1: Initialization of Q-vales
Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

15 Step 1: Initialization of Q-vales
Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

16 Privacy-preserving Reinforcement Learning
The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

17 Step 2-3: Private Action Selection (greedy)
Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma

18 Step 2-3: Private Action Selection (greedy)
Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma

19 Step 2-3: Private Action Selection (greedy)
Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Split Q values as random shares a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

20 Step 2-3: Private Action Selection (greedy)
Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

21 Step 2-3: Private Action Selection (greedy)
Bob: Observe state state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) > Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

22 Privacy-preserving Reinforcement Learning
The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

23 Step 3: Private Update of Q-values
After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Encrypted Q-values Jun Sakuma

24 Step 3: Private Update of Q-values
After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Can Bob update encrypted Q-values? Encrypted Q-values ? ? Jun Sakuma

25 Step 3: Private Update of Q-values
Jun Sakuma

26 Step 3: Private Update of Q-values
×K, ×L Jun Sakuma

27 Step 3: Private Update of Q-values
×K, ×L Encryption Jun Sakuma

28 Step 3: Private Update of Q-values
public Bob holds Jun Sakuma

29 Step 3: Private Update of Q-values
×K, ×L Encryption Bob can update c(s,a) without knowledge of Q(s,a)! Jun Sakuma

30 Privacy-preserving Reinforcement Learning
The protocol overview (Step 1) Initialization of Q-vales (Step 2) Observation from the environment (Step 3) Private Action selection (Step 4) Private update of Q-values Go to step 2 Not mentioned in this talk, but… Partitioned-by-Observation model Epsilon-greedy action selection Q-learning can be treated in a similar manner Jun Sakuma

31 Experiment: Load balancing among factories
Job is assigned w.p. pin aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout Setting State space: SA,SB∊{0,1,…,5} Action space: AA,AB∊{redirect, no redirect} Reward: Cost for backlog : rA= 50-(sA)2 Cost for redirection: rA = rA -2 Cost for overflow: rA=0 Reward rB is set similarly System reward: rt= rtA + rtB Regular RL/PPRL DRL(rewards are shared) IDRL(no sharing) Jun Sakuma

32 Experiment: Load balancing among factories
Job is assigned w.p. pin Comparison aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout * Java Fairplay, 1.2 GHz Core solo detail comp(sec)* profit privacy CRL All information shared 5.11 90.0 Disclosed all DRL Rewards are shared 5.24 87.4 Partially disclosed IDRL No sharing 5.81 84.2 Perfect PPRL Security protocol 8.85*105 RL/SFE SFE[Yao86] >7.00*107 Jun Sakuma

33 Summary Reinforcement Learning from private observations
Achieve optimality as regular RL does Privacy preservation is guaranteed theoretically Computational load is higher than regular RL, but works efficiently with 36 state/4 action problem Future works Scalability Treatment of agents with competing reward functions Game theoretic analysis Jun Sakuma

34 Thank you!

35 Step 2-3: Private Action Selection (greedy)
Bob: Observe state state st, reward rt Bob: For all a, compute random shares of c’(st, a) = c(st, a)・encpk(-rB(st, a)) and send them Alice For all a, compute rA(st, a)=decsk(c’(st, a)) Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st, reward rt Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison decrypt c’(st, a) = c(st, a)・encpk(-rB(st, a)) a1 a2 s1 c’(s1 a1) c (s1,a2) a1 a2 s1 c’(s1 a1) c (s1,a2)

36 Distributed Reinforcement Learning
Environment state sA, reward rA state sB, reward rB action aA action aB Alice Bob (sA, rA , aA) (sB, rB , aB) Distributed Value Function [Schneider99] Manage huge state-action space Suppress the memory consumption Policy gradient approach [Peshkin00][Moallemi03][Bagnell05] Limit the communication DRL learns good, but sub-optimal, policies with minimal or limited sharing of agents’ perceptions


Download ppt "Privacy-preserving Reinforcement Learning"

Similar presentations


Ads by Google