Yitao Duan (Advisor Prof. John Canny) Berkeley Institute of Design

P4P: A Practical Framework for Privacy-Preserving Distributed Computation
Yitao Duan (Advisor Prof. John Canny) Berkeley Institute of Design Computer Science Division University of California, Berkeley 11/27/2007

Research Goal To provide practical solutions with provable privacy and adequate efficiency in a realistic adversary model at reasonably large scale

Research Goal To provide practical solutions with provable privacy and adequate efficiency in a realistic adversary model at reasonably large scale Existing solutions don’t have all these at the same time.

Challenge: standard cryptographic tools not feasible at large scale
Model f Must be obfuscated …… u1 d1 u2 d2 un-1 dn-1 un dn

A Practical Solution Provable privacy: Cryptography and statistical database security Efficiency: Minimize the number of expensive primitives and rely on probabilistic guarantee Realistic adversary model: Must handle malicious users who may try to bias the computation by inputting invalid data Existing solutions don’t have all these at the same time.

…… Basic Approach Σ f = di in D, gj:D  [0, 1], j = 1, 2, …, m
No leakage beyond final result or differential privacy under certain condition f Σ = di in D, gj:D  [0, 1], j = 1, 2, …, m Cryptographic privacy gj(di) gj(d2) gj(dn-1) gj(dn) …… u1 d1 u2 d2 un-1 dn-1 un dn

The Power of Addition A large number of popular algorithms can be run with addition-only steps Linear algorithms: voting and summation, nonlinear algorithm: regression, classification, SVD, PCA, k-means, ID3, EM etc All algorithms in the statistical query model [Kearns 93] Many other gradient-based numerical algorithms Addition-only framework has very efficient private implementation in cryptography and admits efficient ZKPs

Peers for Privacy: The Nomenclature
P4P has two meanings Privacy is a right that one must fight for. Some agents must act on behalf of user’s privacy in the computation. We call them privacy peers Our method aggregates across many user data. We can prove that the aggregation provides privacy: the data from the peers protects each other

Model No leakage beyond final result or differential privacy under certain condition f Σ = gj(di) gj(d2) gj(dn-1) gj(dn) Move on to the private computation of the sums. There are two things that we need to solve. 1 efficient scheme for performing the computation 2. a mechanism to deal with malicious users. …… u1 d1 u2 d2 un-1 dn-1 un dn Cryptographic privacy

Private Addition – P4P Style
The computation: secret sharing over small field Malicious users: efficient zero-knowledge proof to bound the L2-norm of the user vector Lack of ncentive: changing the paradigm means giving up their existing infrastructure.

Big Integers vs. Small Ones
Most applications work with “regular-sized” integers (e.g. 32- or 64-bit). Arithmetic operations are very fast when each operand fits into a single memory cell (~10-9 sec) Public-key operations (e.g. used in encryption and verification) must use keys with sufficient length (e.g bit) for security. Existing private computation solutions must work with large integers extensively (~10-3 sec) A 6 orders of magnitude difference!

Private Arithmetic: Two Paradigms
Homomorphism: User data is encrypted with a public key cryptosystem. Arithmetic on this data mirrors arithmetic on the original data, but the server cannot decrypt partial results. Secret-sharing: User sends shares of their data to several servers, so that no small group of servers gains any information about it. Homo can support server-based scheme

Arithmetic: Homomorphism vs VSS
+ Can tolerate t < n corrupted players as far as privacy is concerned - Use public key crypto, works with large fields (e.g bit), 10,000x more expensive than normal arithmetic (even for addition) Secret sharing + Addition is essentially free. Can use any size field - Can’t do two party multiplication - Most schemes also use public key crypto for verification - Doesn’t fit well into existing service architecture We want to use secret sharing for computation because it has the potential for efficiency. However we must overcome its problems.

P4P: Peers for Privacy Some parties, called Privacy Peers, actively participate in the computation, working for users’ privacy Privacy peers provide privacy when they are available, but cant access data themselves P U S Peer Group

P4P The server provides data archival, and synchronizes the protocol
Server only communicates with privacy peers occasionally (2AM) P U S Peer Group

Privacy Peers Roles of privacy peers
Anonymizing communication Sharing information Participating in computation Others infrastructure support They work on behalf of users privacy But we need a higher level of trust on privacy peers What can we achieve by introducing privacy peers

Candidates for Privacy Peers
Some players are more trustworthy than others In workspace, a union representative In a community, a few members with good reputation Or a third party commercial provider A very important source of security and efficiency The key is that privacy peers should have different incentives from the server, a mutual distrust between them Online community typically have their own way of ranking members studies have shown that these members are usually altruistic and care about the community. They will not betray their trust. Can we find reference? People regard their online reputation very seriously. The key is to create a barrier between the server and the privacy peers so that they are unlikely to collude. We need to identify and evaluate such barriers in different situations. We believe they exist in many. We believe that in many applications we can identify such players.

Security from Heterogeneity
Server is secure against outside attacks and won’t actively cheat Companies spend $$$ to protect their servers The server often holds much more valuable info than what the protocol reveals Server benefits from accurate computation Privacy peers won’t collude with the server Interests conflicts, mutual distrust, laws Server can’t trust clients can keep conspiracy secret Users can actively cheat Rely on server for protection against outside attacks, privacy peers for defending against a curious server How could we run these protocols with fewer participants but still be secure? How can we guarantee that there are enough good players for the protocols to succeed? We claim that our setting is secure and this comes from the following observation. Collusion: also laws. The server is colluding a user, it cannot be confident that info about the collusion will not be leaked, because it doesn’t trust the user. It knows that is it gets caught, it will be punished by law. The server has nowhere to hide. It’s a big company or an institute… Should we mention collusion? A common mechanism to protect the weakest link against the most severe attacks: for example the system may have to have a high threshold or large number of participant to ensure that enough number of nodes in the system are honest. If you pick say only 2 of them, the chance that they are both bad are pretty high. – but how is this relevant? Their reliability, their owners have different interest and incentives. If the server belongs to a reputable company or an institute, public scandal is something they can’t afford. Rely on server for protection against outside attacks because the server has the strongest protection. So our system is secure.

Private Addition Write the protocol. Draw pictures The basic P4P computation paradigm is as follows. There will be two talliers, the server and one of the privacy peers. ui vi di: user i’s private vector. ui,,vi and di are all in a small integer field ui + vi = di

Private Addition μ = Σui ν = Σvi ui + vi = di
Write the protocol. Draw pictures ui + vi = di

Private Addition μ ν μ = Σui ν = Σvi ui + vi = di
Write the protocol. Draw pictures ui + vi = di

Private Addition μ + ν Communication independent of n
Computation on PP ~ fully distributed version

P4P’s Private Addition Provable privacy
Computation on both the server and the privacy peer is over small field: same cost as non-private implementation Fits existing server-based schemes Server is always online. Users and privacy peers can be on and off. Only two parties performing the computation, users just submit their data (and provide a ZK proof, see later) Extra communication for the server is only with the privacy peer, independent of n

The Need for Verification
This scheme has a glaring weakness. Users can use any number in the small field as their data. Think of a voting scheme: “Please place your vote 0 or 1 in the envelope” Bush 100,000 Gore -100,000

Zero Knowledge Proofs I can prove that I know X without disclosing what X is. I can prove that a given encrypted number is a 0. Or I can prove that an encrypted number is a 1. I can prove that an encrypted number is a ZERO OR ONE, i.e. a bit. (6 extra numbers needed) I can prove that an encrypted number is a k-bit integer. I need 6k extra numbers to do this (!!!) These will be the building blocks for a very efficient proof that will be talked about later. Note that these all use large integers so we want to minimize the number of times we use them.

An Efficient ZKP of Boundedness
Luckily, we don’t need to prove that every number in a user’s vector is small, only that the vector is small. The server asks for some random projections of the user’s vector, and expects the user to prove that the square sum of them is small. Some details on this for job talks. At least mention the commitments and zkps so it is clear that it is zk. O(log m) public key crypto operations (instead of O(m)) to prove that the L-2 norm of an m-dim vector is smaller than L. Running time reduced from hours to seconds.

Performace Evaluation
(a) Verifier and (b) prover times in seconds for the validation protocol where (from top to bottom) L (the required bound) has 40, 20, or 10 bits. The x-axis is the vector length.

SVD: P4P Style Singular value decomposition is an extremely useful tool for a lot of IR and data mining tasks (CF, clustering …) SVD for a matrix A is a factorization A = UDVT. If A encodes users x items, then VT gives us the best least-squares approximations to the rows of A in a user-independent way. ATAV = VD  SVD is an eigenproblem

SVD: P4P Style

Distributed Association Rule Mining
n users, m items. User i has dataset Di Horizontally partitioned: Di contains the same attributes …… D1 …… Dn ……

Step k of apriori-gen in P4P
User i constructs an mk-Dimensional vector in small field (mk: number of candidate itemset at step k) Use P4P to compute the aggregate (with verification) The result encodes the supports of all candidate itemsets

Step k of apriori-gen in P4P
* …… D1 + d1[j] …… Dn + dn[j] …… cj: jth candidate itemset P4P Support for cj

Analysis Privacy guaranteed by P4P
Near optimal efficiency: cost comparable to that of a direct implementation of the algorithms Main aggregation in small field Only a small number of large field operations Deal with cheating users with P4P’s built-in ZK user data verification

Infrastructure Support
Multicast encryption [RSA 06] Scalable secure bidirectional communication [Infocom 07] Data protection scheme [PET 04]

Model No leakage beyond final result or differential privacy under certain condition f Σ = gj(di) gj(d2) gj(dn-1) gj(dn) Say we will be describing in the context of database privacy since we use and extend their results. The same conditoion can also be applied to distributed computation. …… u1 d1 u2 d2 un-1 dn-1 un dn Cryptographic privacy

SVD and Association Rule Mining
SVD: The intermediate sums are implied by the final results ATA = VDVT ARM: Treated as public by the applications Guaranteed privacy regardless data distribution or size

Differential Privacy [Dwork TCC06]
An algorithm A gives (δ, ε)-differential privacy if for all sets S of possible outputs, for all datasets d, d’ that differ by only one data record and Pr[A(d) in S] ≥Pr[A(d’) in S], Pr[A(d) in S] ≤ exp(ε)Pr[A(d’) in S] + δ

Achieving Differential Privacy
All existing schemes achieve differential privacy via noise [Dwork TCC 06, EUROCRYPT 06, Blum PODS 05, etc.] They all protect the privacy of the query, not individual data Answer = f(d) + Noise

Differential Privacy w/o Noise
In the definition of differential privacy, where does the randomness come from? Does it have to be external? Treating the unknown data records as random variables, we can utilize the inherent uncertainty for privacy protection. Summarize known results here.

Differential Privacy – an Eqv. Def for Sum Queries
An algorithm A gives (δ, ε)-differential privacy if for any i in [n], for all di, d’i in D, for any possible value t of g(di) Pr[g(di) = t] ≤ exp(ε)Pr[g(d’i) = t] + δ

Achieving Differential Privacy
All existing schemes: Answer = g(di) + Y Possible Y Gaussian: N(µ, σIm), σ ≥ 2m2log(2m/ δ)/ε2 Laplace: Lap(µ, λ), λ ≥ 1/ε, Independent for each element of g(di) Binomial: B(n, ½), n ≥ 64m2log(2m/ δ)/ε2, Independent for each element of g(di) Summarize known results here.

Sum Queries For any k in [n], s = g(dk) + Δ, Δ= ∑i≠k g(di)
Central Limit Theorem For large n, Δconverges in distribution to m-dimensional Gaussian N((n-1)µ, (n-1)Σ) Question: CanΔprovide enough perturbation?

Independent vs Non-independent Perturbation

Privacy via Non-independent Perturbation
Δ= Δ1+ Δ2, Δ1,Δ2 independent and Δ1: Gaussian N(0, σIm), σ = 2m2log(2m/δ)/ε2 Δ2: Gaussian N((n-1)µ, (n-1)Σ-σIm), Δ1 already provides differential privacy. Additional, independent perturbation won’t reduce it

Privacy Condition (n-1)Σ - σIm must be a proper covariance matrix for multivariate Gaussian Symmetric and positive definite λmin((n-1)Σ) > σ or λmin(Σ) > 2m2log(2m/ δ)/((n-1)ε2)

Application in Query Auditing
Online: Restrict queries if unsafe Offline: Determine if a set of answered queries caused privacy breach Existing solutions achieve weak privacy Full disclosure or probability on intervals Our condition can do both online and offline auditing, achieving differential privacy Also simulatable: decision based on the statistics of the distribution, not individual data.

Query Matrix Users x1 x2 X = … Xji = gj(di) … Queries xm
X is made zero mean by subtracting the mean Σ = XXT/(n-1). Singular values of X are the square roots of the eigenvalues of XXT

Weyl Theorem Let σi(A) be the ith singular value of a matrix A.
Let E := A – B, then |σi(A) - σi(B)| ≤ ||E||2 where ||E||2 is the spectral norm defined as ||E||2 = sqrt(λmax(ETE))

A Necessary Privacy Condition
Suppose the first k queries are all safe, the (k+1)th query is unsafe if

P4P: Current Status P4P has been implemented
In Java using native code for big integer Runs on Linux platform Will be made an open-source toolkit for building privacy-preserving real-world applications.

Conclusion We can provide strong privacy protection with little or no cost to a service provider for a broad class of problems in e-commerce and knowledge work. Responsibility for privacy protection shifts to privacy peers Within the P4P framework, private computation and many zero-knowledge verifications can be done with great efficiency

More info duan@cs.berkeley.edu http://www.cs.berkeley.edu/~duan
Thank You!

Yitao Duan (Advisor Prof. John Canny) Berkeley Institute of Design

Similar presentations

Presentation on theme: "Yitao Duan (Advisor Prof. John Canny) Berkeley Institute of Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yitao Duan (Advisor Prof. John Canny) Berkeley Institute of Design

Similar presentations

Presentation on theme: "Yitao Duan (Advisor Prof. John Canny) Berkeley Institute of Design"— Presentation transcript:

Similar presentations

About project

Feedback