Foundations of Privacy Lecture 5 Lecturer: Moni Naor
Recap of last week’s lecture The Exponential Mechanism –Differential privacy – May yield utility/approximation –Is defined and evaluated by considering all possible answers Counting Queries –The BLR Algorithm –Efficient Algorithm
query 1, query 2,... Synthetic DB: Output is a DB Database answer 1 answer 3 answer 2 ? Sanitizer Synthetic DB: output also a DB (of entries from same universe X ), user reconstructs answers by evaluating query on output DB Software and people compatible Consistent answers
Counting Queries Queries with low sensitivity Counting-queries C is a set of predicates c: U {0,1} Query : how many D participants satisfy c ? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: error anyway inherent in statistical analysis Assume all queries given in advance U Database D of size n Query c Non-interactive
The BLR Algorithm For DBs F and D dist(F,D) = max q 2 C |q(F) – q(D)| Intuition: far away DBs get smaller probability Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D) Blum Ligett Roth08
Counting Queries Queries with low sensitivity Counting-queries C is a set of predicates c: U {0,1} Query : how many D participants satisfy c ? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: error anyway inherent in statistical analysis U Database D of size n Query c Sample F of size m approx D on all given predicates c
The BLR Algorithm: Error Õ(n 2/3 log|C|) There exists F good of size m =Õ((n\α) 2· log|C|) s.t. dist(F good,D) ≤ α Pr[F good ] / e -εα For any F bad with dist 2α, Pr[F bad ] / e -2εα Union bound : ∑ bad DB F bad Pr[F bad ] / |U| m e -2εα For α=Õ(n 2/3 log|C|), Pr[F good ] >> ∑ Pr[F bad ] Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)
The BLR Algorithm: Running Time Generating the distribution by enumeration: Need to enumerate every size- m database, where m = Õ((n\α) 2· log|C|) Running time ≈ |U| Õ((n\α) 2 ·log|c|) Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)
Conclusion Offline algorithm, 2ε- Differential Privacy for any set C of counting queries Error α is Õ(n 2/3 log|C|/ε) Super-poly running time: |U| Õ((n\α) 2 ·log|C|)
Can we Efficiently Sanitize? The good news If the universe is small, Can sanitize EFFICIENTLY The bad news cannot do much better, namely sanitize in time: sub-poly(|C|) AND sub-poly(|U|) Time poly(|C|,|U|)
How Efficiently Can We Sanitize? |C| |U| subpolypoly subpoly poly ? Good news! ? ??
The Good News: Can Sanitize When Universe is Small Efficient Sanitizer for query set C DB size n ¸ Õ(|C| o(1) log|U|) error is ~ n 2/3 Runtime poly(|C|,|U|) Output is a synthetic database Compare to [Blum Ligget Roth]: n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|)
Recursive Algorithm C 0 =CC1C1 C2C2 CbCb Start with DB D and large query set C Repeatedly choose random subset C i+1 o f C i : shrink query set by (small) factor
Recursive Algorithm Start with DB D and large query set C Repeatedly choose random subset C i+1 o f C i : shrink query set by (small) factor End recursion: sanitize D w.r.t. small query set C b Output is good for all queries in small set C i+1 Extract utility on almost-all queries in large set C i Fix remaining “underprivileged” queries in large set C i C 0 =CC1C1 C2C2 CbCb
Recursive Algorithm Overview Want to sanitize DB D for query set C Say we have a small sanitizer A’ for smaller subsets C’ ½ C, and A’ outputs small synthetic database Choose random C’ ½ C, sanitize D for C’ using A’ “Magic”: Sanitization gives accurate answers on all but small subset B ½ C Fix “ underprivileged ” queries in B “manually” C C’ B A’ sanitizes Fix manually Why? How? Where?
Sanitize for few queries, get utility for almost all Consider m -bit synthetic DB output y of A’ vs. DB D : If y is “bad” for query set B y of fractional size ≥m/s : Pr C’ [C’ B y = φ ] ≤ (1-m/s) |C’| ≈ e -m W.h.p. simultaneously for all y ‘s with large set B y of bad queries, C’ intersects B y C’ C y*=A’(D) good for all of C’ y* good for almost all C y : potential m -bit output DB ByBy Occam’s Razor B y*
How to get Synthetic DB? Syntheticizer Problem: need small synthetic DB, have large other output Lemma [“Syntheticizer”] Given sanitizer A with α -accuracy and arbitrary output Produce sanitizer A’ with 2α -accuracy and synthetic DB output of size Õ(log|C|/α 2) Runtime is poly(|U|,|C|) Transform output to synthetic DB using linear programming Variable per item in U, constraint per query in C
The Linear Program Run the sanitizer A and then use it to get differentially private counts v c on all the concepts in C –Database never used again - privacy Come up with a low-weight fractional database that approximates these counts. Transform this fractional database into a standard synthetic database by rounding the fractional counts.
For all i 2 U variable x i For all c 2 C constraint v c - · i s.t c(i)=1 x i · v c +
The Linear Program Why is there a fractional solution? –The real one integer solution is one example! Rounding: –scale the fractional database so that its total weight is 1, –Round down each fractional point to closest multiple of /|U| –Treat the rounded fractional database, as an integer synthetic database of size at most |U| / –If too large -sample
How Do We Use Synthetic DB? Why Synthetic DB? 1.Easy to “shrink” DBs by sub-sampling Õ(log|C|/α 2 ) DB items 2.Gives counts for every query output is well-defined even for queries that were not around when sanitizing
Utility for all queries: First Attempt Sanitizing small C’ is easy ( “brute force” ), can “shrink” using syntheticizer Sub-sample small C’, work for all but a few queries Repeat many times, take majority Doesn’t work: Underprivileged queries C’ C B C’’
Utility for all queries: fix “underpriveleged” Lemma Given query set C, diff. private sanitizer A that: 1.Works for every C’ ½ C, |C’|=s 2.Outputs synthetic DB of size ≤ m Get sanitizer for C, utility on all queries Need DB size n ≥ Õ(|C|m/s)
Proof Outline Subsample small C’, get synthetic DB that works for all but a few ( ~|C|m/s ) “underprivileged” queries Now “manually” correct those few : “brute force” : release noisy counts v c (noise ~|C|m/s ) Also need to say which ones are underprivileged … depends on DB D. What about privacy ? Key point: regardless of D, almost all queries strongly privileged. Release noisy indicator vector. For privacy analysis, need only consider the ~|C|m/s potentially underprivileged queries
Recursive Algorithm: Recap C 0 =CC1C1 C2C2 CbCb Start with DB D and large query set C Repeatedly choose rand. subset C i+1 o f C i : shrink by f factor v
Recursive Algorithm: Recap Start with DB D and large query set C Repeatedly choose rand. subset C i+1 o f C i : shrink by f factor Sanitize D w.r.t. small C b (use “brute force” sanitizer) Syntheticizer transforms output to small synthetic DB Fix “underprivileged” (need n ≥ Õ(f) ) Lose 2 b accuracy, “brute force” needs n ≥ 2 b |C b | C 0 =CC1C1 C2C2 CbCb n ≥ |C| o(1) by trading off b, f
And Now… Bad News Runtime cannot be subpoly in |C| or |U| Output is synthetic DB (as in positive result) General output Exponential Mechanism cannot be implemented Want hardness… Got Crypto?
The Bad News For large C and U can’t get efficient sanitizers! Output is synthetic DB (as in positive result) General output Exponential Mechanism cannot be implemented Want hardness… Got Crypto?
Digital Signatures Digital Signatures ( sk, vk ) Can build from one-way function [NaYu,Ro] m1m1 sig( m 1 ) m2m2 sig( m 2 ) mnmn sig( m n ) m’ sig( m’ ) valid signatures under vk Hard to forge new signature
Signatures ! No Synthetic DB Universe: ( m, s ) msg,sig pair Queries: c vk ( m, s ) output 1 iff s valid sig of m under vk m1m1 sig( m 1 ) m2m2 sig( m 2 ) mnmn sig( m n ) sanitizer m’ 1 s1s1 m’ k sksk most are valid signatures under vk inputs appear in output, no privacy! valid signatures under same vk
Can We output Synthetic DB Efficiently? |C| |U| subpolypoly subpoly poly ?? ?
Where is Hardness Coming From? Signature example: Hard to satisfy a given query Easy to maintain utility for all queries but one More natural: Easy to satisfy each individual query Hard to maintain utility for most queries
Hardness on Average Universe: ( vk, m, s ) key,msg,sig Queries: c i ( vk, m, s ) - i -th bit of ECC(vk) c v ( vk, m, s ) - 1 iff valid sig under vk sanitizer valid signatures under vk m’ 1 s1s1 vk’ 1 m1m1 sig( m 1 ) vkm2m2 sig( m 2 ) vk mnmn sig( m n ) vkm’ k sksk vk’ k are these keys related to vk ? Yes! At least one is vk !
Hardness on Average Samples: ( vk, m, s ) key,msg,sig Queries: c i ( vk, m, s ) - i -th bit of ECC(vk) c v ( vk, m, s ) - 1 iff valid sig under vk m’ 1 s1s1 m’ k sksk vk’ 1 vk’ k 8 i 3/4 of vk’ j agree w. ECC(vk)[i] 9 vk’ j s.t. ECC(vk’ j ), ECC(vk) are 3/4-close vk’ j = vk (error-correcting code) m’ j appears in input. No privacy! are these keys related to vk ? Yes! At least one is vk !
Where is Hardness Coming From? Signature example: Hard to satisfy a given query Easy to maintain utility for all queries but one More natural: Easy to satisfy each individual query Hard to maintain utility for most queries
Can We output Synthetic DB Efficiently? |C| |U| subpolypoly subpoly poly ?? ? Signatures Hard on Avg. Using PRFs
General output sanitizers Theorem Traitor tracing schemes exist if and only if sanitizing is hard Tight connection between |U|, |C| hard to sanitize and key, ciphertext sizes in traitor tracing Separation between efficient/non-efficient sanitizers uses [BoSaWa] scheme
Traitor Tracing: The Problem Center transmits a message to a large group Some Users leak their keys to pirates Pirates construct a clone: unauthorized decryption devices Given a Pirate Box want to find who leaked the keys E(Content) K 1 K 3 K 8 Content Pirate Box Traitors ``privacy” is violated!
Equivalence of TT and Hardness of Sanitizing Ciphertext Key Traitor Tracing Database entry Query Sanitizing hard TT PirateSanitizer for distribution of DBs (collection of)
Traitor Tracing ! Hard Sanitizing Theorem If exists TT scheme –cipher length c(n), –key length k(n), can construct: 1.Query set C of size ≈2 c(n) 2.Data universe U of size ≈2 k(n) 3.Distribution D on n -user databases with entries from U D is “ hard to sanitize ”: exists tracer that can extract an entry in D from any sanitizer’s output Separation between efficient/non-efficient sanitizers uses [BoSaWa06] scheme Violate its privacy!