Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,

Slides:

Advertisements

Similar presentations

Merkle Puzzles Are Optimal

Advertisements

Constant-Round Private Database Queries Nenad Dedic and Payman Mohassel Boston UniversityUC Davis.

Short Non-interactive Zero-Knowledge Proofs

When Random Sampling Preserves Privacy Kamalika Chaudhuri U.C.Berkeley Nina Mishra U.Virginia.

An Introduction to Randomness Extractors Ronen Shaltiel University of Haifa Daddy, how do computers get random bits?

Linear-Degree Extractors and the Inapproximability of Max Clique and Chromatic Number David Zuckerman University of Texas at Austin.

Are PCPs Inherent in Efficient Arguments? Guy Rothblum, MIT ) MSR-SVC ) IAS Salil Vadhan, Harvard University.

Efficiency vs. Assumptions in Secure Computation Yuval Ishai Technion & UCLA.

I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.

Computational Privacy. Overview Goal: Allow n-private computation of arbitrary funcs. –Impossible in information-theoretic setting Computational setting:

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

Foundations of Cryptography Lecture 2: One-way functions are essential for identification. Amplification: from weak to strong one-way function Lecturer:

Inaccessible Entropy Iftach Haitner Microsoft Research Omer Reingold Weizmann & Microsoft Hoeteck Wee Queens College, CUNY Salil Vadhan Harvard University.

Inaccessible Entropy Iftach Haitner Microsoft Research Omer Reingold Weizmann Institute Hoeteck Wee Queens College, CUNY Salil Vadhan Harvard University.

1 Reducing Complexity Assumptions for Statistically-Hiding Commitment Iftach Haitner Omer Horviz Jonathan Katz Chiu-Yuen Koo Ruggero Morselli Ronen Shaltiel.

Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.

Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,

Foundations of Cryptography Lecture 11 Lecturer: Moni Naor.

Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov,

Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb Ben Gurion University Research partially Supported by the Frankel.

Foundations of Cryptography Lecture 5 Lecturer: Moni Naor.

Foundations of Privacy Lecture 6 Lecturer: Moni Naor.

Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.

Kunal Talwar MSR SVC [Dwork, McSherry, Talwar, STOC 2007] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AA A.

Yi Wu (CMU) Joint work with Parikshit Gopalan (MSR SVC) Ryan O’Donnell (CMU) David Zuckerman (UT Austin) Pseudorandom Generators for Halfspaces TexPoint.

A Parallel Repetition Theorem for Any Interactive Argument Iftach Haitner Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before.

Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.

Simple Extractors for All Min-Entropies and a New Pseudo-Random Generator Ronen Shaltiel (Hebrew U) & Chris Umans (MSR) 2001.

Co-operative Private Equality Test(CPET) Ronghua Li and Chuan-Kun Wu (received June 21, 2005; revised and accepted July 4, 2005) International Journal.

Arithmetic Hardness vs. Randomness Valentine Kabanets SFU.

Asymmetric Cryptography part 1 & 2 Haya Shulman Many thanks to Amir Herzberg who donated some of the slides from

Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.

Foundations of Privacy Lecture 5 Lecturer: Moni Naor.

1 Constructing Pseudo-Random Permutations with a Prescribed Structure Moni Naor Weizmann Institute Omer Reingold AT&T Research.

Foundations of Privacy Lecture 11 Lecturer: Moni Naor.

On Everlasting Security in the Hybrid Bounded Storage Model Danny Harnik Moni Naor.

Foundations of Cryptography Lecture 9 Lecturer: Moni Naor.

Foundations of Cryptography Lecture 8 Lecturer: Moni Naor.

Foundations of Cryptography Lecture 2 Lecturer: Moni Naor.

Foundations of Cryptography Rahul Jain CS6209, Jan – April 2011

Current Developments in Differential Privacy Salil Vadhan Center for Research on Computation & Society School of Engineering & Applied Sciences Harvard.

The Complexity of Differential Privacy Salil Vadhan Harvard University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Foundations of Privacy Lecture 3 Lecturer: Moni Naor.

Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.

On Constructing Parallel Pseudorandom Generators from One-Way Functions Emanuele Viola Harvard University June 2005.

Private Approximation of Search Problems Amos Beimel Paz Carmi Kobbi Nissim Enav Weinreb (Technion)

Secure Computation (Lecture 5) Arpita Patra. Recap >> Scope of MPC > models of computation > network models > modelling distrust (centralized/decentralized.

Cryptography Lecture 2 Arpita Patra. Summary of Last Class  Introduction  Secure Communication in Symmetric Key setting >> SKE is the required primitive.

Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.

Foundations of Privacy Lecture 5 Lecturer: Moni Naor.

Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)

Differential Privacy Some contents are borrowed from Adam Smith’s slides.

Differential Privacy (1). Outline  Background  Definition.

Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,

Iftach Haitner and Eran Omri Coin Flipping with Constant Bias Implies One-Way Functions TexPoint fonts used in EMF. Read the TexPoint manual before you.

Cryptography Lecture 3 Arpita Patra © Arpita Patra.

Computational Differential Privacy Ilya Mironov (MICROSOFT) Omkant Pandey (UCLA) Omer Reingold (MICROSOFT) Salil Vadhan (HARVARD)

Introduction to Machine Learning

Understanding Generalization in Adaptive Data Analysis

Privacy-preserving Release of Statistics: Differential Privacy

Modern symmetric-key Encryption

Differential Privacy in Practice

Vitaly (the West Coast) Feldman

CS 154, Lecture 6: Communication Complexity

Current Developments in Differential Privacy

Summarizing Data by Statistics

Cryptography Lecture 5.

The Zig-Zag Product and Expansion Close to the Degree

Emanuele Viola Harvard University June 2005

CS639: Data Management for Data Science

Presentation transcript:

Computational Complexity & Differential Privacy Salil Vadhan Harvard University Joint works with Cynthia Dwork, Kunal Talwar, Andrew McGregor, Ilya Mironov, Moni Naor, Omkant Pandey, Toni Pitassi, Omer Reingold, Guy Rothblum, Jon Ullman TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A

Data Privacy: The Problem Given a dataset with sensitive information, such as: Health records Census data Search engine logs How can we: Compute and release useful functions of the dataset (utility) Without compromising info about individuals (privacy)?

Data Privacy: The Challenge Traditional approach (e.g. in HIPAA): “anonymize” by removing “personally identifying information (PII)” Many supposedly anonymized datasets have been subject to reidentification: – Gov. Weld’s medical record re-identified by linkage with voter records [Swe97]. – Netflix Challenge database reidentified by linkage with IMDb [NS08] – AOL search users reidentified by contents of their queries [BZ06] – …

Differential Privacy [DN03,DN04,BDMN05,DMNS06] A strong, new notion of privacy that: Is robust to auxiliary information possessed by an adversary Degrades gracefully under repetition/composition Allows for many useful computations

Differential Privacy: Definition Def: A randomized algorithm C : X n  Y is (  ) differentially private iff  databases D 1, D 2 that differ on one row, and  set T  Y, Pr[C(D 1 )  T]  e   Pr[C(D 2 )  T] +   Pr[C(D 2  T] Think of  as a small constant, e.g.  =.01,  as cryptographically small, e.g.  = Database D  X n C(D) C curator

Differential Privacy: Interpretations Def: A randomized algorithm C : X n  Y is (  ) differentially private iff  databases D 1, D 2 that differ on one row, and  T  Y, Pr[C(D 1 )  T].  Pr[C(D 2  T] My data affects what an adversary sees by at most . Whatever an adversary learns about me, it could have learned from everyone else’s data. Above interpretations hold regardless of adversary’s auxiliary information. Composes gracefully (k repetitions ) k  differentially private) But no protection for information that is not localized to a few rows.

D = (x 1,…,x n )  X n Goal: given ¼ : X ! {0,1} estimate counting query  (D):=  i  (x i ) within error   Example: X = {0,1} d  = conjunction on  k variables Counting query = k-way marginal e.g. How many people in D smoke and have cancer? Differential Privacy: Example >35Smoker?Cancer?

D = (x 1,…,x n )  X n Goal: estimate counting query  (D):=  i  (x i ) within error   Solution: C(D) =  (D) + Lap(1/ ² ), where Lap( ¾ ) has density / exp(- ¾ |x|) Can answer k queries with error k/ ² or even  k 1/2 / ²® Differential Privacy: Example

Other Differentially Private Algorithms histograms [DMNS06] contingency tables [BCDKMT07, GHRU11], machine learning [BDMN05,KLNRS08], logistic regression [CM08] clustering [BDMN05] social network analysis [HLMJ09] approximation algorithms [GLMRT10] …

Computational Complexity When do computational resource constraints change what is possible? Examples: Computational Learning Theory [Valiant `84]: small VC dimension  learnable with efficient algorithms (bad news) Cryptography [Diffie & Hellman `76]: don’t need long shared secrets against a computationally bounded adversary (good news)

This talk: Computational Complexity in Differential Privacy Q: Do computational resource constraints change what is possible? Computationally bounded curator – Makes differential privacy harder – Differentially private & accurate synthetic data infeasible to construct, even preserving 2-way marginals to within o(n) error (proof uses PCPs & Digital Signatures) – Open: release other types of summaries/models? Computationally bounded adversary – Makes differential privacy easier – Provable gain in accuracy for multi-party protocols – Connections to randomness extractors, communication complexity, secure multiparty computation

Computational Differential Privacy: Definition Def [MPRV09]: A randomized algorithm C k : X n  Y is  computationally differentially private iff  databases D 1, D 2 that differ on one row, and  nonuniform poly(k)-time algorithm T, Pr[T(C k (D 1 ))=1]  e   Pr[T(C k (D 2 )  ] + negl(k) – k=security parameter. – Allow C k running time poly(n,log|X|,k). – negl(k): function vanishing faster than 1/poly(k). – Implicit in works on distributed implementations of differential privacy [DKMMN06,BKO08].

Computational Differential Privacy: Example Def [MPRV09]: A randomized algorithm C k : X n  Y is  computationally differentially private iff  databases D 1, D 2 that differ on one row, and  nonuniform poly(k)-time algorithm T, Pr[T(C k (D 1 ))=1]  e   Pr[T(C k (D 2 )  ] + negl(k) Example: C k = [differentially private C + a PRG to generate randomness] C k (D) computationally indistinguishable from C(D)  C k computationally differentially private

Computational Differential Privacy: Definitional Questions Stronger Def: C k is  computationally differentially private’ if  ( ,negl(k))-differentially private C : X n  Y s.t.  databases D, C(D) & C k (D) computationally indistinguishable. Implies accuracy of C k (D)  accuracy of C(D)  not much gain over information-theoretic d.p. Q [MPRV09]: are the two definitions equivalent? – Related to “Dense Model Theorems” in additive combinatorics [GT04] and pseudorandomness [RTTV08] – Still open!

Computational Differential Privacy: When can it Help? [GKY11]: in many cases, can convert a computationally differentially private C k into an information-theoretically differentially private C with negligible loss in accuracy. – Does not cover all functionalities & utility measures of interest – Is restricted to case of single curator computing on entire dataset Today [BNO08,MPRV09,MMPRTV10]: Computational differential privacy can help when database distributed among two or more parties.

2-Party Computational Differential Privacy each party has a sensitive dataset, want to do a joint computation f(D A,D B ) m1m1 m2m2 m3m3 m k-1 mkmk DADA x1x1 x2x2  xnxn DBDB y1y1 y2y2  ymym out(m 1,…,m k )  f(D A,D B )

2-Party Computational Differential Privacy m1m1 m2m2 m3m3 m k-1 mkmk DBDB y1y1 y2y2  ymym 0/1  computational differential privacy (for B):  nonuniform poly(k)-time A *,  databases D B, D’ B that differ on one row, Pr[out A * (A *,B(D  ))=1]  e   Pr[out A * (A *,B(D’ B ))=1] + negl(k) and similarly for A

Multiparty Computational Differential Privacy Require:  nonuniform poly(k)-time P -i *,  databases D i, D’ i that differ on one row, Pr[out P -i * (P -i *,P i (D i ))=1]  e   Pr[out P -i * (P -i *,P i (D’ i ))=1] + negl(k) P 1 (D 1 ) P 2 (D 2 ) P 3 (D 3 ) P 4 (D 4 ) P 5 (D 5 ) P -5 *` 0/1

Constructing CDP Protocols Given a function f(D 1,…,D n ) we wish to compute – Example: each D i  {0,1} and f(D 1,…,D n )=  i  D i Step 1: Design a centralized mechanism C – Example: C(D 1,…,D n ) =  i  D i + Lap(1/  ) Step 2: Use secure multiparty computation [Yao86,GMW86] to implement C with a distributed protocol (P 1,…,P n ). – Adversary’s view can be simulated (up to computational indistinguishability) given only access to “ideal functionality” C  computational differential privacy – Can be done more efficiently in specific cases [DKMMN06,BKO08]. Q: Did we really need computational differential privacy?

Multiparty Computational Differential Privacy Require:  nonuniform poly(k)-time P -i *,  databases D i, D’ i that differ on one row, Pr[out P -i * (P -i *,P i (D i ))=1]  exp(  )  Pr[out P -i * (P -i *,P i (D’ i ))=1] + negl(k) P 1 (D 1 ) P 2 (D 2 ) P 3 (D 3 ) P 4 (D 4 ) P 5 (D 5 ) P -5 *` 0/1

Differentially Private Protocol for SUM: “Randomized Response” [W65] P 1,…,P n have bits D 1,…,D n  {0,1}, want to estimate  i D i 1.Each P i broadcasts N i = 2.Everyone computes Z = (1/  )  i (N i -(1-  )/2) Differential Privacy: ((1+  )/2)/((1-  )/2) = 1+O(  Accuracy: E[N i ] = (1-  )/2+  D i  whp error is O(n 1/2 /  ) – Nontrivial but worse than O(1/  ) with computational d.p. D i w.p. (1+  )/2  D i w.p. (1-  )/2

Lower Bound for Computing SUM Thm: Every n-party differentially private protocol for SUM incurs error  (n 1/2 ) whp – Assuming  =O(1),  =o(1/n) – Improves lower bound of  (n 1/2 /(# rounds)) of [BNO08]. Proof: Assume  =0 for simplicity. Let (D 1,…,D n ) be uniform, independent bits, T=transcript(P 1 (D 1 ),…,P n (D n )) Claim: conditioned T=t, D 1,…,D n are still independent bits, each with bias O(  )

Lower Bound for Computing SUM

Thm: Every n-party differentially private protocol for SUM incurs error  (n 1/2 ) whp Proof: (D 1,…,D n ) = uniformly random, T= trans(P 1 (D 1 ),…,P n (D n )) Claim: conditioned T=t, D 1,…,D n are still independent bits, each with bias O(  ) Claim: sum of n independent bits, each with constant bias, falls outside any interval of size o(n 1/2 ) whp.  Whp  i  D i  [output(T) – o(n 1/2 ), output(T)+o(n 1/2 )]

Separation for Two-Party Protocols A’s input: x=(x 1,…,x n )  {0,1} n, B’s input: y=(y 1,…,y n )  {0,1} n Goal [MPRV09]: estimate (set intersection) Can be computed by a computational differentially private protocol with error O(1/  ) Can be computed by an differentially private protocol with error O(n 1/2 /  ) Thm [MMPTRV10]: Every 2-party differentially private protocol for incurs error  (n 1/2 /log n) whp. -

Lower Bound for Inner Product

Thm: Every 2-party differentially private protocol for has error  (n 1/2 /log n) whp Proof: X, Y = uniformly random, T= trans(A(X),B(Y)) Claim: conditioned T=t, X,Y are independent unpredictable (Santha-Vazirani) sources. Claim: If X,Y independent, unpredictable sources on {0,1} n, then mod m almost-uniform in Z m for some m =  (n 1/2 /log n) – Randomness extractor! – Generalizes [Vaz86] result for m=2. – Fourier analysis over Z m

Lower Bound for Inner Product Thm: Every 2-party differentially private protocol for has error  (n 1/2 /log n) whp Proof: X, Y = uniformly random, T= trans(A(X),B(Y)) Claim: conditioned T=t, X,Y are independent unpredictable (Santha-Vazirani) sources. Claim: If X,Y independent, unpredictable sources on {0,1} n, then mod m almost-uniform in Z m for some m =  (n 1/2 /log n)  Whp  [output(T)–o(m), output(T)+o(m)]

Connections with Communication Complexity [MMPRTV10] Let (A(x),B(y)) be a 2-party protocol, x  X n,y  Y n, CC(A,B) = maximum number bits communicated (X,Y) distributed in X n  Y n according to  T = transcript(A(X),B(Y)) Thm: (A,B)  -differentially private  information cost I(XY;T) = O(  n)  can simulate w/CC(A’,B’) = O(  n  polylog(CC(A,B))) [BBCR10] – Also holds for ( ,o(1/n))-differential privacy if symbols of X are independent given Y and vice-versa. – Can preserve differential privacy if #rounds bounded. Thm: (A,B) deterministic protocol for approximating f(x,y) to within     -differentially private protocol for approximating f(x,y) to within   +O(CC(A,B)  Sensitivity(f)  (#rounds)/  )

Conclusions Computational complexity is relevant to differential privacy. Bad news: producing synthetic data is intractable Good news: better protocols against bounded adversaries Interaction with differential privacy likely to benefit complexity theory too.

A More Ambitious Goal: Noninteractive Data Release Original Database DSanitization C(D) C Goal: From C(D), can answer many questions about D, e.g. all counting queries associated with a large family of predicates P = { ¼ : X ! {0,1}}

Noninteractive Data Release: Desidarata ( ,  )-differential privacy: for every D 1, D 2 that differ in one row and every set T, Pr[C(D 1 )  T]  exp(  )  Pr[C(D 2 )  T]+  with  negligible Utility: C(D) allows answering many questions about D Computational efficiency: C is polynomial-time computable.

D = (x 1,…,x n )  X n P = {  : X  {0,1}} For any  P, want to estimate (from C(D)) counting query  (D):=(  i  (x i ))/n within accuracy error   Example: X = {0,1} d P = {conjunctions on  k variables} Counting query = k-way marginal e.g. What fraction of people in D smoke and have cancer? Utility: Counting Queries >35Smoker?Cancer?

Form of Output Ideal: C(D) is a synthetic dataset –   P |  (C(D))-  (D)|   – Values consistent – Use existing software Alternatives? – Explicit list of |P| answers (e.g. contingency table) – Median of several synthetic datasets [RR10] – Program M s.t.   P |M(  )-  (D)|   >35Smoker?Cancer?

Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) N D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) [BLR08] O(d  log|P|/    )Õ(dk/    ) Y D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) [BLR08] O(d  log|P|/    )Õ(dk/    ) Yqpoly(n,|P|,2 d )qpoly(n,2 d ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  accuracy error  = privacy

Positive Results minimum database sizecomputational complexity referencegeneral Pk-way marginalssyntheticgeneral Pk-way marginals [DN03,DN04, BDMN05] O(|P| 1/2 /  O(d k/2 /  ) Npoly(n,|P|)poly(n,d k ) [BDCKMT07] Õ((2d) k /  ) Ypoly(n,2 d ) [BLR08] O(d  log|P|/    )Õ(dk/    ) Yqpoly(n,|P|,2 d )qpoly(n,2 d ) [DNRRV09, DRV10, HR10] O(d  log 2 |P|/     )Õ(dk 2 /     ) Ypoly(n,|P|,2 d ) D = (x 1,…,x n )  ({0,1} d ) n P = {  : {0,1} d  {0,1}}  (D):=(1/n)  i  (x i )  error in accuracy  = privacy Summary: Can construct synthetic databases accurate on huge families of counting queries, but complexity may be exponential in dimensions of data and query set P. Question: is the complexity inherent?

Our Result: Intractability of Synthetic Data Informally: Producing accurate & differentially private synthetic data is as hard as breaking cryptography (e.g. factoring large integers). Inherently exponential in dimensionality of data (and in dimensionality of queries).

Our Result: Intractability of Synthetic Data Thm [UV11, building on DNRRV10]: Under standard crypto assumptions (OWF), there is no n=poly(d) and curator that: – Produces synthetic databases. – Is differentially private. – Runs in time poly(n,d). – Achieves error  =.01 for 2-way marginals. Proof overview: 1.Use digital signatures to show hardness for a complicated (but efficiently computable) family of counting queries [DNRRV10] 2.Use PCPs to reduce complicated queries to simple ones [UV11].

Tool 1: Digital Signature Schemes A digital signature scheme consists of 3 poly-time algorithms (Gen,Sign,Ver): On security parameter d, Gen(d) = (SK,PK)  {0,1} d  {0,1} d On m  {0,1} d, can compute  =Sign SK (m)  {0,1} d s.t. Ver PK (m,  )=1 Given many (m,  ) pairs, infeasible to generate new (m’,  ’) satisfying Ver PK Thm [NY89,Rom90]: Digital signature schemes exist iff one-way functions exist.

Hard-to-Sanitize Databases I [DNRRV10] Ver PK (D) = 1 m1m1 Sign SK (m 1 ) m2m2 Sign SK (m 2 ) m3m3 Sign SK (m 3 )  mnmn Sign SK (m n ) Generate random (PK,SK)  Gen(d), m 1, m 2,…, m n  {0,1} d D m’ 1 11 m’ 2 22  m’ k kk curator C(D) Case 1: m’ i  D  Forgery! Case 2: m’ i  D  Reidentification! If error · ® wrt Ver PK : Ver PK (C(D))  1-  > 0  i Ver PK (m’ i,  i )=1

Tool 2: Probabilistically Checkable Proofs The PCP Theorem:  efficient algorithms (Red,Enc,Dec) s.t. w s.t. V(w)=1 Circuit V of size poly(d) Enc Red Set of 3-clauses on d’=poly(d) vars  V ={x 1  x 5   x 7,  x 1  v 5  x d’,…} z  {0,1} d’ satisfying all of  V z’  {0,1} d’ satisfying.99 fraction of  V Decw’ s.t. V(w’)=1

Hard-to-Sanitize Databases II [UV11] Let  PK = Red(Ver PK ) Each clause in  PK is satisfied by all z i m1m1 Sign SK (m 1 ) m2m2 Sign SK (m 2 ) m3m3 Sign SK (m 3 )  mnmn Sign SK (m n ) D z’ 1 z’ 2  z’ k curator C(D) Case 1: m’ i  D  Forgery! Case 2: m’ i  D  Reidentification! If error ·.01 wrt 3-way marginals: Each clause in  PK is satisfied by .99 fraction of z’ i  i s.t. z’ i satisfies  1-  fraction of clauses Dec(z’ i ) = valid (m’ i,  i ) z1z1 z2z2 z3z3  znzn Enc Ver PK

Conclusions Producing private, synthetic databases that preserve simple statistics requires computation exponential in the dimension of the data. How to bypass? Average-case accuracy: Heuristics that don’t give good accuracy on all databases, only those from some class of models. Non-synthetic data: – Hardness extends to some generalizations of synthetic data (e.g. medians of synthetic data). – Thm [DNRRV09]: For unnatural but efficiently computable P  efficient curators “iff”  efficient “traitor-tracing” schemes – But for natural P (e.g. P={all marginals}), wide open!