Privacy-preserving Reinforcement Learning

Slides:

Advertisements

Similar presentations

Efficiency vs. Assumptions in Secure Computation Yuval Ishai Technion & UCLA.

Advertisements

Secure Evaluation of Multivariate Polynomials

Rational Oblivious Transfer KARTIK NAYAK, XIONG FAN.

CS555Topic 241 Cryptography CS 555 Topic 24: Secure Function Evaluation.

What Crypto Can Do for You: Solutions in Search of Problems Anna Lysyanskaya Brown University.

Yan Huang, Jonathan Katz, David Evans University of Maryland, University of Virginia Efficient Secure Two-Party Computation Using Symmetric Cut-and-Choose.

Session 4 Asymmetric ciphers.

Co-operative Private Equality Test(CPET) Ronghua Li and Chuan-Kun Wu (received June 21, 2005; revised and accepted July 4, 2005) International Journal.

Asymmetric Cryptography part 1 & 2 Haya Shulman Many thanks to Amir Herzberg who donated some of the slides from

1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.

Public Encryption: RSA

WS Algorithmentheorie 03 – Randomized Algorithms (Public Key Cryptosystems) Prof. Dr. Th. Ottmann.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

1 Privacy-Preserving Distributed Information Sharing Nan Zhang and Wei Zhao Texas A&M University, USA.

Csci5233 Computer Security1 Bishop: Chapter 10 (Cont.) Key Management: Storage & Revoking.

Secure Incremental Maintenance of Distributed Association Rules.

1 Privacy Preserving Data Mining Haiqin Yang Extracted from a ppt “Secure Multiparty Computation and Privacy” Added “Privacy Preserving SVM”

Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.

On the Communication Complexity of SFE with Long Output Daniel Wichs (Northeastern) joint work with Pavel Hubáček.

Non-Interactive Verifiable Computing August 5, 2009 Bryan Parno Carnegie Mellon University Rosario Gennaro, Craig Gentry IBM Research.

Hidden Access Control Policies with Hidden Credentials Keith Frikken, Mikhail Atallah, Jiangtao Li CERIAS and Department of Computer Sciences Purdue University.

CS 4803 Fall 04 Public Key Algorithms. Modular Arithmetic n Public key algorithms are based on modular arithmetic. n Modular addition. n Modular multiplication.

Cryptographic methods. Outline  Preliminary Assumptions Public-key encryption  Oblivious Transfer (OT)  Random share based methods  Homomorphic Encryption.

Multi-Party Computation r n parties: P 1,…,P n  P i has input s i  Parties want to compute f(s 1,…,s n ) together  P i doesn’t want any information.

Public Key Encryption Major topics The RSA scheme was devised in 1978

Attacks on Public Key Encryption Algorithms

Privacy-Preserving Clustering

Conception of parallel algorithms

Foundations of Secure Computation

CS480 Cryptography and Information Security

Privacy Preserving Similarity Evaluation of Time Series Data

RE-Tree: An Efficient Index Structure for Regular Expressions

Reinforcement learning (Chapter 21)

Group theory exercise.

Privacy Preserving Record Linkage

The first Few Slides stolen from Boaz Barak

Course Business I am traveling April 25-May 3rd

Cryptography CS 555 Lecture 22

The story of distributed constraint optimization in LA: Relaxed

Cryptography Lecture 3.

"Playing Atari with deep reinforcement learning."

Differential Privacy in Practice

Topic 25: Discrete LOG, DDH + Attacks on Plain RSA

CS 154, Lecture 6: Communication Complexity

Topic 3: Perfect Secrecy

刘振上海交通大学计算机科学与工程系电信群楼3-509

Announcements Homework 3 due today (grace period through Friday)

Cryptography for Quantum Computers

Practical Aspects of Modern Cryptography

Zcash adds privacy to Bitcoin’s decentralization

یادگیری تقویتی Reinforcement Learning

Achieving Fairness in Private Contract Negotiation

El Gamal and Diffie Hellman

Discrete Math for CS CMPSC 360 LECTURE 14 Last time:

Where Complexity Finally Comes In Handy…

Shengyu Zhang The Chinese University of Hong Kong

Emerging Security Mechanisms for Medical Cyber Physical Systems

Where Complexity Finally Comes In Handy…

Introduction to Algorithms Second Edition by

Helen: Maliciously Secure Coopetitive Learning for Linear Models

Oblivious Transfer.

刘振上海交通大学计算机科学与工程系电信群楼3-509

Reinforcement Learning (2)

Secure Diffie-Hellman Algorithm

Cryptography Lecture 23.

Introduction to Cryptography

Where Complexity Finally Comes In Handy…

ITIS 6200/8200 Chap 5 Dr. Weichao Wang.

Reinforcement Learning (2)

How to Use Charm Crypto Lib

Presentation transcript:

Privacy-preserving Reinforcement Learning Tokyo Inst. of Tech. Jun Sakuma, Shigenobu Kobayashi Rutgers Univ. Rebecca N. Wright

Motivating application: Load balancing Order Shipment Order Shipment Production Production Redirection when heavily loaded A load balancing among competing factories Factories obtain a reward by processing a job, but suffer a large penalty if overflow happens Factories may need to redirect jobs to the other factory when heavily loaded When should factories redirect jobs to the other factory? Jun Sakuma

Motivating application: Load balancing Order Shipment Order Shipment Production Production Redirection when heavily loaded If two factories are competing… The frequency of orders and the speed of production is private (private model) The backlog is private (private state observation) The profit is private (private reward) Privacy-preserving Reinforcement Learning States, actions, and rewards are not shared But the learned policy is shared in the end Jun Sakuma

Definition of Privacy Environment Alice Bob Partitioned-by-time model Agents share the state space, the action space and the reward function Agents cannot interact with the environment simultaneously Environment state st, reward rt state st, reward rt action at action at Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 Jun Sakuma

Definition of Privacy Environment Alice Bob Partitioned-by-observation model State spaces and action spaces are mutually exclusive between agents Agents interact with the environment simultaneously Environment state stA, reward rtA state stB, reward rtB Alice action atA action atB Bob Alice’s perception (sA1,aA1,rA1) … (sAt,aAt,rAt) Bob’s perception (sB1,aB1,rB1) … (sBt,aBt,rBt) t=0 Jun Sakuma

Are existing RLs privacy-preserving? Centralized RL (CRL) Distributed RL (DRL), [Schneider99][Ng05] environment environment Agent Agent Agent Agent Agent Agent Agent Agent Each distributed agent shares partial observation and learns Leader Agent Leader agent learns Independent DRL (IDRL) Optimality Privacy CRL optimal disclosed DRL medium partly disclosed IDRL bad Preserved PPRL preserved environment Agent Agent Agent Agent Each agent learns independently Target: achievement of privacy preservation without sacrificing the optimality Jun Sakuma

Privacy-preserving Reinforcement Learning Algorithm Tabular SARSA learning with epsilon-greedy action selection Overview: (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Building block: Homomorphic public-key cryptosystem A pair of public and secret key (pk, sk) Encryption: c = epk(m; r), m is an integer, r is a random integer Decryption: m=dsk(c) Homomorphic public-key cryptosystem Addition of cipher epk(m1+m2; r1+r2) = epk(m1; r)・epk(m2; r) Multiplication of cipher epk(km; kr) = epk(m; r)k Paillier cryptosystem[Pai99] is homomorphic Jun Sakuma 8 8

Building block: Random shares Secret: x public: N Alice Bob Random share a Random share b (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Jun Sakuma

Building block: Random shares Secret: x=6 public: N=23 Alice Bob Random share a=15 Random share b=14 (a, b) are random shares when a and b distributes uniform randomly with satisfying a+b = x mod N Example a=15 and b=14 6= 15 + 14 (=29) mod 23 Jun Sakuma

Building block: Private comparison Secure Function Evaluation [Yao86] allows parties to evaluate a specified function f by taking their private input after the SFE, their inputs and outputs are not revealed Private input: x Private input: y Private comparison Output: If x>y, 0. Else 1. Output: 0 Jun Sakuma

Privacy-preserving Reinforcement Learning Protocol for partitioned-by-time model (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) Q-values Jun Sakuma

Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

Step 1: Initialization of Q-vales Alice: Learn Q-values Q(s,a) from t=0 to T1 Alice: Generate a pair of keys (pk, sk) Alice: Compute c(s,a) = encpk(Q(s,a)) send them to Bob Environment state st, reward rt action at Alice Bob Alice’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 Q(s1,a1) Q(s1,a2) s2 Q(s2,a1) Q(s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Q-values Encrypted Q-values Jun Sakuma

Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Split Q values as random shares a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Step 2-3: Private Action Selection (greedy) Bob: Observe state state st, reward rt Bob: For all a, compute random shares of Q(st, a) and send them to Alice Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) > Private comparison a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) Jun Sakuma

Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales Building block 1: Homomorphic cryptosystem (Step 2) Observation from the environment (Step 3) Private Action selection Building block 2: Random shares Building block 3: Private comparison by Secure Function Evaluation (Step 4) Private update of Q-values Go to step 2 Jun Sakuma

Step 3: Private Update of Q-values After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Encrypted Q-values Jun Sakuma

Step 3: Private Update of Q-values After greedy action selection, Bob observes (rt, st+1) How can Bob update encrypted Q-values c(st,at) from (st, at, rt, st+1) ? Environment reward rt , state st+1 action at Alice Bob Taken by Bob (greedy) Regular update by SARSA a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Observed Can Bob update encrypted Q-values? Encrypted Q-values ? ? Jun Sakuma

Step 3: Private Update of Q-values Jun Sakuma

Step 3: Private Update of Q-values ×K, ×L Jun Sakuma

Step 3: Private Update of Q-values ×K, ×L Encryption Jun Sakuma

Step 3: Private Update of Q-values public Bob holds Jun Sakuma

Step 3: Private Update of Q-values ×K, ×L Encryption Bob can update c(s,a) without knowledge of Q(s,a)! Jun Sakuma

Privacy-preserving Reinforcement Learning The protocol overview (Step 1) Initialization of Q-vales (Step 2) Observation from the environment (Step 3) Private Action selection (Step 4) Private update of Q-values Go to step 2 Not mentioned in this talk, but… Partitioned-by-Observation model Epsilon-greedy action selection Q-learning can be treated in a similar manner Jun Sakuma

Experiment: Load balancing among factories Job is assigned w.p. pin aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout Setting State space:　SA,SB∊{0,1,…,5} Action space: AA,AB∊{redirect, no redirect} Reward: Cost for backlog : rA= 50-(sA)2 Cost for redirection: rA = rA -2 Cost for overflow: rA=0 Reward rB is set similarly System reward: rt= rtA + rtB Regular RL/PPRL DRL(rewards are shared) IDRL(no sharing) Jun Sakuma

Experiment: Load balancing among factories Job is assigned w.p. pin Comparison aB=“no redirect” SA=5 SB=2 aA=“redirect” Job is processed w.p. pout * Java 1.5.0+Fairplay, 1.2 GHz Core solo detail comp(sec)* profit privacy CRL All information shared 5.11 90.0 Disclosed all DRL Rewards are shared 5.24 87.4 Partially disclosed IDRL No sharing 5.81 84.2 Perfect PPRL Security protocol 8.85*105 RL/SFE SFE[Yao86] >7.00*107 Jun Sakuma

Summary Reinforcement Learning from private observations Achieve optimality as regular RL does Privacy preservation is guaranteed theoretically Computational load is higher than regular RL, but works efficiently with 36 state/4 action problem Future works Scalability Treatment of agents with competing reward functions Game theoretic analysis Jun Sakuma

Thank you!

Step 2-3: Private Action Selection (greedy) Bob: Observe state state st, reward rt Bob: For all a, compute random shares of c’(st, a) = c(st, a)･encpk(-rB(st, a)) and send them Alice For all a, compute rA(st, a)=decsk(c’(st, a)) Bob and Alice: Run private comparison of random shares to learn greedy action at Environment state st, reward rt Alice action at Bob Alice’s (st,at,rt) Bob’s (st,at,rt) … t=0 t=T1 t=T1+T2 t=T1+T2+T3 a1 a2 s1 rA(s1,a1) rA (s1,a2) a1 a2 s1 rB(s1,a1) rB(s1,a2) a1 a2 s1 c(s1,a1) c (s1,a2) s2 c (s2,a1) c (s2,a2) Private comparison decrypt c’(st, a) = c(st, a)･encpk(-rB(st, a)) a1 a2 s1 c’(s1 a1) c (s1,a2) a1 a2 s1 c’(s1 a1) c (s1,a2)

Distributed Reinforcement Learning Environment state sA, reward rA state sB, reward rB action aA action aB Alice Bob (sA, rA , aA) (sB, rB , aB) Distributed Value Function [Schneider99] Manage huge state-action space Suppress the memory consumption Policy gradient approach [Peshkin00][Moallemi03][Bagnell05] Limit the communication DRL learns good, but sub-optimal, policies with minimal or limited sharing of agents’ perceptions