Privacy-preserving Reinforcement Learning

Slides:



Advertisements
Similar presentations
Efficiency vs. Assumptions in Secure Computation Yuval Ishai Technion & UCLA.
Advertisements

Secure Evaluation of Multivariate Polynomials
Rational Oblivious Transfer KARTIK NAYAK, XIONG FAN.
CS555Topic 241 Cryptography CS 555 Topic 24: Secure Function Evaluation.
What Crypto Can Do for You: Solutions in Search of Problems Anna Lysyanskaya Brown University.
Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.
Session 4 Asymmetric ciphers.
Co-operative Private Equality Test(CPET) Ronghua Li and Chuan-Kun Wu (received June 21, 2005; revised and accepted July 4, 2005) International Journal.
Asymmetric Cryptography part 1 & 2 Haya Shulman Many thanks to Amir Herzberg who donated some of the slides from
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Public Encryption: RSA
WS Algorithmentheorie 03 – Randomized Algorithms (Public Key Cryptosystems) Prof. Dr. Th. Ottmann.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
1 Privacy-Preserving Distributed Information Sharing Nan Zhang and Wei Zhao Texas A&M University, USA.
Csci5233 Computer Security1 Bishop: Chapter 10 (Cont.) Key Management: Storage & Revoking.
Secure Incremental Maintenance of Distributed Association Rules.
1 Privacy Preserving Data Mining Haiqin Yang Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”
Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.
On the Communication Complexity of SFE with Long Output Daniel Wichs (Northeastern) joint work with Pavel Hubáček.
Non-Interactive Verifiable Computing August 5, 2009 Bryan Parno Carnegie Mellon University Rosario Gennaro, Craig Gentry IBM Research.
Hidden Access Control Policies with Hidden Credentials Keith Frikken, Mikhail Atallah, Jiangtao Li CERIAS and Department of Computer Sciences Purdue University.
CS 4803 Fall 04 Public Key Algorithms. Modular Arithmetic n Public key algorithms are based on modular arithmetic. n Modular addition. n Modular multiplication.
Cryptographic methods. Outline  Preliminary Assumptions Public-key encryption  Oblivious Transfer (OT)  Random share based methods  Homomorphic Encryption.
Multi-Party Computation r n parties: P 1,…,P n  P i has input s i  Parties want to compute f(s 1,…,s n ) together  P i doesn’t want any information.
Public Key Encryption Major topics The RSA scheme was devised in 1978
Attacks on Public Key Encryption Algorithms
Privacy-Preserving Clustering
Conception of parallel algorithms
Foundations of Secure Computation
CS480 Cryptography and Information Security
Privacy Preserving Similarity Evaluation of Time Series Data
RE-Tree: An Efficient Index Structure for Regular Expressions
Reinforcement learning (Chapter 21)
Group theory exercise.
Privacy Preserving Record Linkage
The first Few Slides stolen from Boaz Barak
Course Business I am traveling April 25-May 3rd
Cryptography CS 555 Lecture 22
The story of distributed constraint optimization in LA: Relaxed
Cryptography Lecture 3.
"Playing Atari with deep reinforcement learning."
Differential Privacy in Practice
Topic 25: Discrete LOG, DDH + Attacks on Plain RSA
CS 154, Lecture 6: Communication Complexity
Topic 3: Perfect Secrecy
刘振 上海交通大学 计算机科学与工程系 电信群楼3-509
Announcements Homework 3 due today (grace period through Friday)
Cryptography for Quantum Computers
Practical Aspects of Modern Cryptography
Zcash adds privacy to Bitcoin’s decentralization
یادگیری تقویتی Reinforcement Learning
Achieving Fairness in Private Contract Negotiation
El Gamal and Diffie Hellman
Discrete Math for CS CMPSC 360 LECTURE 14 Last time:
Where Complexity Finally Comes In Handy…
Shengyu Zhang The Chinese University of Hong Kong
Emerging Security Mechanisms for Medical Cyber Physical Systems
Where Complexity Finally Comes In Handy…
Introduction to Algorithms Second Edition by
Helen: Maliciously Secure Coopetitive Learning for Linear Models
Oblivious Transfer.
刘振 上海交通大学 计算机科学与工程系 电信群楼3-509
Reinforcement Learning (2)
Secure Diffie-Hellman Algorithm
Cryptography Lecture 23.
Introduction to Cryptography
Where Complexity Finally Comes In Handy…
ITIS 6200/8200 Chap 5 Dr. Weichao Wang.
Reinforcement Learning (2)
How to Use Charm Crypto Lib
Presentation transcript:

Privacy-preserving Reinforcement Learning Tokyo Inst. of Tech. Jun Sakuma, Shigenobu Kobayashi Rutgers Univ. Rebecca N. Wright

Motivating application: Load balancing Order Shipment Order Shipment Production Production Redirection when heavily loaded A load balancing among competing factories Factories obtain a reward by processing a job, but suffer a large penalty if overflow happens Factories may need to redirect jobs to the other factory when heavily loaded When should factories redirect jobs to the other factory? Jun Sakuma

Motivating application: Load balancing Order Shipment Order Shipment Production Production Redirection when heavily loaded If two factories are competing… The frequency of orders and the speed of production is private (private model) The backlog is private (private state observation) The profit is private (private reward) Privacy-preserving Reinforcement Learning States, actions, and rewards are not shared But the learned policy is shared in the end Jun Sakuma

Definition of Privacy Environment Alice Bob Partitioned-by-time model Agents share the state space, the action space and the reward function Agents cannot interact with the environment simultaneously Environment state st, reward rt state st, reward rt action at action at Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 Jun Sakuma

Definition of Privacy Environment Alice Bob Partitioned-by-observation model State spaces and action spaces are mutually exclusive between agents Agents interact with the environment simultaneously Environment state stA, reward rtA state stB, reward rtB Alice action atA action atB Bob Alice’s perception (sA1,aA1,rA1) … (sAt,aAt,rAt) Bob’s perception (sB1,aB1,rB1) … (sBt,aBt,rBt) t=0 Jun Sakuma

Are existing RLs privacy-preserving? Centralized RL (CRL) Distributed RL (DRL), [Schneider99][Ng05] environment environment Agent Agent Agent Agent Agent Agent Agent Agent Each distributed agent shares partial observation and learns Leader Agent Leader agent learns Independent DRL (IDRL) Optimality Privacy CRL optimal disclosed DRL medium partly disclosed IDRL bad Preserved PPRL preserved environment Agent Agent Agent Agent Each agent learns independently Target: achievement of privacy preservation without sacrificing the optimality Jun Sakuma

Privacy-preserving Reinforcement Learning Algorithm Tabular SARSA learning with epsilon-greedy action selection Overview: (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Building block: Homomorphic public-key cryptosystem A pair of public and secret key (pk, sk) Encryption: c = epk(m; r), m is an integer, r is a random integer Decryption: m=dsk(c) Homomorphic public-key cryptosystem Addition of cipher epk(m1+m2; r1+r2) = epk(m1; r)・epk(m2; r) Multiplication of cipher epk(km; kr) = epk(m; r)k Paillier cryptosystem[Pai99] is homomorphic Jun Sakuma 8 8

Building block: Random shares Secret: x public: N Alice Bob Random share a Random share b (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Jun Sakuma

Building block: Random shares Secret: x=6 public: N=23 Alice Bob Random share a=15 Random share b=14 (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Example a=15 and b=14 6= 15 + 14 (=29) mod 23 Jun Sakuma

Building block: Private comparison Secure Function Evaluation [Yao86] allows parties to evaluate a specified function f by taking their private input after the SFE, their inputs and outputs are not revealed Private input: x Private input: y Private comparison Output: If x>y, 0. Else 1. Output: 0 Jun Sakuma

Privacy-preserving Reinforcement Learning Protocol for partitioned-by-time model (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) Q-values Jun Sakuma

Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Split Q values as random shares a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) > Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 3: Private Update of Q-values After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Encrypted Q-values Jun Sakuma

Step 3: Private Update of Q-values After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Can Bob update encrypted Q-values? Encrypted Q-values ? ? Jun Sakuma

Step 3: Private Update of Q-values Jun Sakuma

Step 3: Private Update of Q-values ×K, ×L Jun Sakuma

Step 3: Private Update of Q-values ×K, ×L Encryption Jun Sakuma

Step 3: Private Update of Q-values public Bob holds Jun Sakuma

Step 3: Private Update of Q-values ×K, ×L Encryption Bob can update c(s,a) without knowledge of Q(s,a)! Jun Sakuma

Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales (Step 2) Observation from the environment (Step 3) Private Action selection (Step 4) Private update of Q-values Go to step 2 Not mentioned in this talk, but… Partitioned-by-Observation model Epsilon-greedy action selection Q-learning can be treated in a similar manner Jun Sakuma

Experiment: Load balancing among factories Job is assigned w.p. pin aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout Setting State space: SA,SB∊{0,1,…,5} Action space: AA,AB∊{redirect, no redirect} Reward: Cost for backlog : rA= 50-(sA)2 Cost for redirection: rA = rA -2 Cost for overflow: rA=0 Reward rB is set similarly System reward: rt= rtA + rtB Regular RL/PPRL DRL(rewards are shared) IDRL(no sharing) Jun Sakuma

Experiment: Load balancing among factories Job is assigned w.p. pin Comparison aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout * Java 1.5.0+Fairplay, 1.2 GHz Core solo detail comp(sec)* profit privacy CRL All information shared 5.11 90.0 Disclosed all DRL Rewards are shared 5.24 87.4 Partially disclosed IDRL No sharing 5.81 84.2 Perfect PPRL Security protocol 8.85*105 RL/SFE SFE[Yao86] >7.00*107 Jun Sakuma

Summary Reinforcement Learning from private observations Achieve optimality as regular RL does Privacy preservation is guaranteed theoretically Computational load is higher than regular RL, but works efficiently with 36 state/4 action problem Future works Scalability Treatment of agents with competing reward functions Game theoretic analysis Jun Sakuma

Thank you!

Step 2-3: Private Action Selection (greedy) Bob: Observe state state st, reward rt Bob: For all a, compute random shares of c’(st, a) = c(st, a)・encpk(-rB(st, a)) and send them Alice For all a, compute rA(st, a)=decsk(c’(st, a)) Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st, reward rt Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison decrypt c’(st, a) = c(st, a)・encpk(-rB(st, a)) a1 a2 s1 c’(s1 a1) c (s1,a2) a1 a2 s1 c’(s1 a1) c (s1,a2)

Distributed Reinforcement Learning Environment state sA, reward rA state sB, reward rB action aA action aB Alice Bob (sA, rA , aA) (sB, rB , aB) Distributed Value Function [Schneider99] Manage huge state-action space Suppress the memory consumption Policy gradient approach [Peshkin00][Moallemi03][Bagnell05] Limit the communication DRL learns good, but sub-optimal, policies with minimal or limited sharing of agents’ perceptions