Privacy-preserving Reinforcement Learning Tokyo Inst. of Tech. Jun Sakuma, Shigenobu Kobayashi Rutgers Univ. Rebecca N. Wright
Motivating application: Load balancing Order Shipment Order Shipment Production Production Redirection when heavily loaded A load balancing among competing factories Factories obtain a reward by processing a job, but suffer a large penalty if overflow happens Factories may need to redirect jobs to the other factory when heavily loaded When should factories redirect jobs to the other factory? Jun Sakuma
Motivating application: Load balancing Order Shipment Order Shipment Production Production Redirection when heavily loaded If two factories are competing… The frequency of orders and the speed of production is private (private model) The backlog is private (private state observation) The profit is private (private reward) Privacy-preserving Reinforcement Learning States, actions, and rewards are not shared But the learned policy is shared in the end Jun Sakuma
Definition of Privacy Environment Alice Bob Partitioned-by-time model Agents share the state space, the action space and the reward function Agents cannot interact with the environment simultaneously Environment state st, reward rt state st, reward rt action at action at Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 Jun Sakuma
Definition of Privacy Environment Alice Bob Partitioned-by-observation model State spaces and action spaces are mutually exclusive between agents Agents interact with the environment simultaneously Environment state stA, reward rtA state stB, reward rtB Alice action atA action atB Bob Alice’s perception (sA1,aA1,rA1) … (sAt,aAt,rAt) Bob’s perception (sB1,aB1,rB1) … (sBt,aBt,rBt) t=0 Jun Sakuma
Are existing RLs privacy-preserving? Centralized RL (CRL) Distributed RL (DRL), [Schneider99][Ng05] environment environment Agent Agent Agent Agent Agent Agent Agent Agent Each distributed agent shares partial observation and learns Leader Agent Leader agent learns Independent DRL (IDRL) Optimality Privacy CRL optimal disclosed DRL medium partly disclosed IDRL bad Preserved PPRL preserved environment Agent Agent Agent Agent Each agent learns independently Target: achievement of privacy preservation without sacrificing the optimality Jun Sakuma
Privacy-preserving Reinforcement Learning Algorithm Tabular SARSA learning with epsilon-greedy action selection Overview: (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma
Building block: Homomorphic public-key cryptosystem A pair of public and secret key (pk, sk) Encryption: c = epk(m; r), m is an integer, r is a random integer Decryption: m=dsk(c) Homomorphic public-key cryptosystem Addition of cipher epk(m1+m2; r1+r2) = epk(m1; r)・epk(m2; r) Multiplication of cipher epk(km; kr) = epk(m; r)k Paillier cryptosystem[Pai99] is homomorphic Jun Sakuma 8 8
Building block: Random shares Secret: x public: N Alice Bob Random share a Random share b (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Jun Sakuma
Building block: Random shares Secret: x=6 public: N=23 Alice Bob Random share a=15 Random share b=14 (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Example a=15 and b=14 6= 15 + 14 (=29) mod 23 Jun Sakuma
Building block: Private comparison Secure Function Evaluation [Yao86] allows parties to evaluate a specified function f by taking their private input after the SFE, their inputs and outputs are not revealed Private input: x Private input: y Private comparison Output: If x>y, 0. Else 1. Output: 0 Jun Sakuma
Privacy-preserving Reinforcement Learning Protocol for partitioned-by-time model (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma
Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) Q-values Jun Sakuma
Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma
Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma
Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma
Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma
Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma
Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Split Q values as random shares a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma
Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma
Step 2-3: Private Action Selection (greedy) Bob: Observe state state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) > Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma
Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma
Step 3: Private Update of Q-values After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Encrypted Q-values Jun Sakuma
Step 3: Private Update of Q-values After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Can Bob update encrypted Q-values? Encrypted Q-values ? ? Jun Sakuma
Step 3: Private Update of Q-values Jun Sakuma
Step 3: Private Update of Q-values ×K, ×L Jun Sakuma
Step 3: Private Update of Q-values ×K, ×L Encryption Jun Sakuma
Step 3: Private Update of Q-values public Bob holds Jun Sakuma
Step 3: Private Update of Q-values ×K, ×L Encryption Bob can update c(s,a) without knowledge of Q(s,a)! Jun Sakuma
Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales (Step 2) Observation from the environment (Step 3) Private Action selection (Step 4) Private update of Q-values Go to step 2 Not mentioned in this talk, but… Partitioned-by-Observation model Epsilon-greedy action selection Q-learning can be treated in a similar manner Jun Sakuma
Experiment: Load balancing among factories Job is assigned w.p. pin aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout Setting State space: SA,SB∊{0,1,…,5} Action space: AA,AB∊{redirect, no redirect} Reward: Cost for backlog : rA= 50-(sA)2 Cost for redirection: rA = rA -2 Cost for overflow: rA=0 Reward rB is set similarly System reward: rt= rtA + rtB Regular RL/PPRL DRL(rewards are shared) IDRL(no sharing) Jun Sakuma
Experiment: Load balancing among factories Job is assigned w.p. pin Comparison aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout * Java 1.5.0+Fairplay, 1.2 GHz Core solo detail comp(sec)* profit privacy CRL All information shared 5.11 90.0 Disclosed all DRL Rewards are shared 5.24 87.4 Partially disclosed IDRL No sharing 5.81 84.2 Perfect PPRL Security protocol 8.85*105 RL/SFE SFE[Yao86] >7.00*107 Jun Sakuma
Summary Reinforcement Learning from private observations Achieve optimality as regular RL does Privacy preservation is guaranteed theoretically Computational load is higher than regular RL, but works efficiently with 36 state/4 action problem Future works Scalability Treatment of agents with competing reward functions Game theoretic analysis Jun Sakuma
Thank you!
Step 2-3: Private Action Selection (greedy) Bob: Observe state state st, reward rt Bob: For all a, compute random shares of c’(st, a) = c(st, a)・encpk(-rB(st, a)) and send them Alice For all a, compute rA(st, a)=decsk(c’(st, a)) Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st, reward rt Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison decrypt c’(st, a) = c(st, a)・encpk(-rB(st, a)) a1 a2 s1 c’(s1 a1) c (s1,a2) a1 a2 s1 c’(s1 a1) c (s1,a2)
Distributed Reinforcement Learning Environment state sA, reward rA state sB, reward rB action aA action aB Alice Bob (sA, rA , aA) (sB, rB , aB) Distributed Value Function [Schneider99] Manage huge state-action space Suppress the memory consumption Policy gradient approach [Peshkin00][Moallemi03][Bagnell05] Limit the communication DRL learns good, but sub-optimal, policies with minimal or limited sharing of agents’ perceptions