Privacy-preserving Reinforcement Learning

Privacy-preserving Reinforcement Learning
Tokyo Inst. of Tech. Jun Sakuma, Shigenobu Kobayashi Rutgers Univ. Rebecca N. Wright

Motivating application: Load balancing
Order Shipment Order Shipment Production Production Redirection when heavily loaded A load balancing among competing factories Factories obtain a reward by processing a job, but suffer a large penalty if overflow happens Factories may need to redirect jobs to the other factory when heavily loaded When should factories redirect jobs to the other factory? Jun Sakuma

Motivating application: Load balancing
Order Shipment Order Shipment Production Production Redirection when heavily loaded If two factories are competing… The frequency of orders and the speed of production is private (private model) The backlog is private (private state observation) The profit is private (private reward) Privacy-preserving Reinforcement Learning States, actions, and rewards are not shared But the learned policy is shared in the end Jun Sakuma

Definition of Privacy Environment Alice Bob Partitioned-by-time model
Agents share the state space, the action space and the reward function Agents cannot interact with the environment simultaneously Environment state st, reward rt state st, reward rt action at action at Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 Jun Sakuma

Definition of Privacy Environment Alice Bob
Partitioned-by-observation model State spaces and action spaces are mutually exclusive between agents Agents interact with the environment simultaneously Environment state stA, reward rtA state stB, reward rtB Alice action atA action atB Bob Alice’s perception (sA1,aA1,rA1) … (sAt,aAt,rAt) Bob’s perception (sB1,aB1,rB1) … (sBt,aBt,rBt) t=0 Jun Sakuma

Are existing RLs privacy-preserving?
Centralized RL (CRL) Distributed RL (DRL), [Schneider99][Ng05] environment environment Agent Agent Agent Agent Agent Agent Agent Agent Each distributed agent shares partial observation and learns Leader Agent Leader agent learns Independent DRL (IDRL) Optimality Privacy CRL optimal disclosed DRL medium partly disclosed IDRL bad Preserved PPRL preserved environment Agent Agent Agent Agent Each agent learns independently Target: achievement of privacy preservation without sacrificing the optimality Jun Sakuma

Algorithm Tabular SARSA learning with epsilon-greedy action selection Overview: (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Building block: Homomorphic public-key cryptosystem
A pair of public and secret key (pk, sk) Encryption: c = epk(m; r), m is an integer, r is a random integer Decryption: m=dsk(c) Homomorphic public-key cryptosystem Addition of cipher epk(m1+m2; r1+r2) = epk(m1; r)・epk(m2; r) Multiplication of cipher epk(km; kr) = epk(m; r)k Paillier cryptosystem[Pai99] is homomorphic Jun Sakuma 8 8

Building block: Random shares
Secret: x public: N Alice Bob Random share a Random share b (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Jun Sakuma

Building block: Random shares
Secret: x=6 public: N=23 Alice Bob Random share a=15 Random share b=14 (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Example a=15 and b=14 6= (=29) mod 23 Jun Sakuma

Building block: Private comparison
Secure Function Evaluation [Yao86] allows parties to evaluate a specified function f by taking their private input after the SFE, their inputs and outputs are not revealed Private input: x Private input: y Private comparison Output: If x>y, 0. Else 1. Output: 0 Jun Sakuma

Protocol for partitioned-by-time model (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 1: Initialization of Q-vales
Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) Q-values Jun Sakuma

Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 2-3: Private Action Selection (greedy)
Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma

Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Split Q values as random shares a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Bob: Observe state state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) > Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 3: Private Update of Q-values
After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Encrypted Q-values Jun Sakuma

After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Can Bob update encrypted Q-values? Encrypted Q-values ? ? Jun Sakuma

Jun Sakuma

×K, ×L Jun Sakuma

×K, ×L Encryption Jun Sakuma

public Bob holds Jun Sakuma

×K, ×L Encryption Bob can update c(s,a) without knowledge of Q(s,a)! Jun Sakuma

The protocol overview (Step 1) Initialization of Q-vales (Step 2) Observation from the environment (Step 3) Private Action selection (Step 4) Private update of Q-values Go to step 2 Not mentioned in this talk, but… Partitioned-by-Observation model Epsilon-greedy action selection Q-learning can be treated in a similar manner Jun Sakuma

Experiment: Load balancing among factories
Job is assigned w.p. pin aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout Setting State space:　SA,SB∊{0,1,…,5} Action space: AA,AB∊{redirect, no redirect} Reward: Cost for backlog : rA= 50-(sA)2 Cost for redirection: rA = rA -2 Cost for overflow: rA=0 Reward rB is set similarly System reward: rt= rtA + rtB Regular RL/PPRL DRL(rewards are shared) IDRL(no sharing) Jun Sakuma

Experiment: Load balancing among factories
Job is assigned w.p. pin Comparison aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout * Java Fairplay, 1.2 GHz Core solo detail comp(sec)* profit privacy CRL All information shared 5.11 90.0 Disclosed all DRL Rewards are shared 5.24 87.4 Partially disclosed IDRL No sharing 5.81 84.2 Perfect PPRL Security protocol 8.85*105 RL/SFE SFE[Yao86] >7.00*107 Jun Sakuma

Summary Reinforcement Learning from private observations
Achieve optimality as regular RL does Privacy preservation is guaranteed theoretically Computational load is higher than regular RL, but works efficiently with 36 state/4 action problem Future works Scalability Treatment of agents with competing reward functions Game theoretic analysis Jun Sakuma

Thank you!

Bob: Observe state state st, reward rt Bob: For all a, compute random shares of c’(st, a) = c(st, a)･encpk(-rB(st, a)) and send them Alice For all a, compute rA(st, a)=decsk(c’(st, a)) Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st, reward rt Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison decrypt c’(st, a) = c(st, a)･encpk(-rB(st, a)) a1 a2 s1 c’(s1 a1) c (s1,a2) a1 a2 s1 c’(s1 a1) c (s1,a2)

Distributed Reinforcement Learning
Environment state sA, reward rA state sB, reward rB action aA action aB Alice Bob (sA, rA , aA) (sB, rB , aB) Distributed Value Function [Schneider99] Manage huge state-action space Suppress the memory consumption Policy gradient approach [Peshkin00][Moallemi03][Bagnell05] Limit the communication DRL learns good, but sub-optimal, policies with minimal or limited sharing of agents’ perceptions

Privacy-preserving Reinforcement Learning

Similar presentations

Presentation on theme: "Privacy-preserving Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Privacy-preserving Reinforcement Learning

Similar presentations

Presentation on theme: "Privacy-preserving Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback