Background on security
Definition of security Attacker’s knowledge/capability The attacker observes a set of encrypted values only – Ciphertext-only attack (COA) Suitable for most real life applications The attacker can generate the encrypted values of any plaintext of his choice – chosen-plaintext attack (CPA) Baseline for public key cryptosystem. The attacker can use the public key to generate as many as he wants Attacker’s goal To derive information about the plaintext, any information is fine – semantic security E.g., knowing one’s salary > 50k/month but not exact value may be a security concern (A malicious data mining service provider) To return a wrong answer to the user - integrity
Some facts There isn’t really a formal method to prove the security against COA People prefer provable security There is always a brute-force attack w.r.t. CPA Try all the keys and find the one that matches all plaintext-ciphertext pairs. Security under CPA means the attack is a (proven) hard problem
Views from crypto We do not know what the attacker knows Better prepare for the worst Require provable semantic security under a strong attack model (at least CPA)
Semantic security Definition: no information about the plaintext (except the size) is leaked to the attacker An proven equivalent definition – indistinguishability (IND) Given two encrypted values, the attacker cannot distinguish them Remark: Semantic security under CPA is often written as IND-CPA
Security game IND-XXX can be modeled as a game The attacker generates two messages m0 and m1 and send them to the key owner The key owner randomly chooses 1 message and encrypts it, c = E(mi) With c, the attacker guesses which plain message c corresponds to Secure if Pr(guess correct) <= 0.5 + ε Where ε is a negligible value, often in the form of 1/xk Note: x is a constant, k is key length
Security vs performance In general (but not proven), a more secure scheme is more expensive Fact 1: Non-deterministic encryption must be required for semantic security Deterministic encryption E(x1) = E(x2) iff x1 = x2 One-to-one mapping Onto function most of the time Simple attack The attacker generates g0 = E(m0), g1 = E(m1) If gi = c, answer i Pr(guess correct) = 100% 1 a 1 2 b 2 3 c 3 d
Security vs performance Non-deterministic encryption One-to-many mapping Problem: Ciphertext is longer Storage cost and processing cost are thus higher 1 a 1 b c d 2 2 e f g 3 3 h
Example RSA is a deterministic function RSA is not semantic secure Public key: <e, n>, private key <d, n> E(x) = xe mod n D(y) = yd mod n RSA is not semantic secure
RSA with padding When the industry refer to RSA, is it actually RSA with padding The padding scheme is optimal asymmetric encryption padding (OAEP) Proven IND-CCA2 (a high security definition) Example of simpler padding Encryption: Input: m Generate random r Let c = r xor m Ciphertext: c||E(r) Decryption y = c||E(r) Recover r from D(E(r)) Decrypted message: m = c xor r This padding doubles the size of an encrypted value
Secure database (SDB) problem Data Owner (DO) Service provider (SP) DB DB Database should be encrypted Compute query on encrypted data Query Query Return an encrypted answer Answer Answer
(In)-feasibility of IND in SDB problem Security game: The attacker generates two queries q0 and q1 and send them to the DO The DO randomly chooses 1 query and executes it with SP The (encrypted) result r is observed by the attacker With r, the attacker guesses which query r corresponds to
Attacker’s strategy Pick q0 = “SELECT count(*)” Pick q1 = “SELECT *” If r is just an encrypted value, it is q0 If r is a table, it is q1 To prevent the above attack, at least make the query results indistinguishable by its size each query result is at least Ω(n) where n is number of tuples Decryption cost by DO is then Ω(n) - not better than computing the query using a linear scan
Remark: Fully homormophic encryption with IND-CPA in SDB Selection processing requires the SP to observe whether an encrypted tuple satisfies the query condition or not All operations in terms of circuit can be supported (AND, OR, NOT) All input and output are encrypted Cannot jump to an encrypted address Discussion paper: Shiyuan Wang, Divyakant Agrawal, and Amr El Abbadi. Is homomorphic encryption the holy grail for database queries on encrypted data? Technical report, Department of Computer Science, UCSB 2012
Implication of knowing the result of a branch operation Unknown process Jump to b Jump to a Plain data: 10, 20, 21, 22, 23 Plain data: 24, 27, 28, 29, 40 Knowledge of plaintext from CPA
Implication of knowing the result of a branch operation Attack: Pick a = 50, b = 7 Unknown process Attacker answer: c = a Jump to b Jump to a Plain data: 10, 20, 21, 22, 23 Plain data: 24, 27, 28, 29, 40 E(c)
Re-writing the query may help If (x>10) { y = 20; } else { y = 100; } r = cmp_grt(x, 10) // return 1 if x > 10, 0 otherwise y = 20 + 80 * r Cannot solve all problems!
Leakage of knowing branch result in practice Assume now we allow the SP to observe the branch (i.e., comparison) results, what kind of information is leaked? Locality of data Derived knowledge – COA: 1. q2 q1 2. q2 t1[Y], t3[Y] q1 3. t5[Y] t1[Y], t3[Y] t9[Y] Result of cmp(Y, E(q1)) E(t1) E(t3) E(t5) E(t7) E(t10) Result of cmp(Y, E(q2)) E(t1) E(t3) E(t9) E(t13) So, we just protect the exact values in our scheme. And the use of index may make sense
Another way to prove IND (in SMC) Proof by simulation Background Each party received several messages from the other party Can they use these information to observe anything about the other party? Alice: Secret x = 3 Bob: Secret y = 7 Secure sum Result: x+y = 10
Simulation Say Bob is the attacker now Is there any difference on the messages Bob received if Alice provides different input? Indistinguishable Alice: Secret x = 3 Bob: Secret y = 7 Secure sum Result: x+y = 10
Secure Sum Public parameter: n=100 Alice: Secret x = 3 Bob: Secret y = 7 Generate r1 = 70 Send m1 = r1+x mod n= 73 Generate r2 = 50 Send m2 = r2+y+m1 mod n= 30 Keep m2-r1 as share Alice: Secret a = 60 Bob: Secret b = 50 Keep r2 as share Result: x+y = 10
Bob’s view Public parameter: n=100 Bob: Secret y = 7 Simulation: For any value of x Generate r1’ = m1 – x mod n The message m1 can be generated Send m1 = r1+x mod n= 73 Simulation succeeds. This protocol is secure w.r.t. IND. Result: x+y = 10
A not secure example Note: since it must be a specific XA so that YA = gXA Simulation fails. Observed: YA, XB How to derive XA? Bob Key agreement protocol Public parameters: p, g Note: This protocol is not for protecting parties’ input from the other party
Relaxed security definition Also the approach of our paper Bounded leakage of protocols Can be proven by the simulations Used a lot by Chris Clifton from Purdue University Jaideep Vaidya and Chris Clifton, Secure Set Intersection Cardinality with Application to Association Rule Mining, JCS 13(4), 2005. Jaideep Vaidya and Chris Clifton, Privacy-Preserving K-Means Clustering over Vertically Partitioned Data, SIGKDD, 2003. Murat Kantarcioglu and Chris Clifton, Privacy Preserving Data Mining of Association Rules on Horizontally Partitioned Data, TKDE 16(9), 2004.
Proof of relaxed definition Attacker’s knowledge Its own input Messages in the protocol Leaked knowledge If the above is enough to simulate the execution of the protocol, there is not other information leak Then, argue the leaked knowledge is not very harmful