Foundations of Privacy Lecture 4 Lecturer: Moni Naor
Recap of last week’s lecture Differential Privacy Sensitivity: – Global sensitivity of query q:U n → R d GS q = max D,D’ ||q(D) – q(D’)|| 1 – Local sensitivity of query q at point D LS q (D)= max D’ |q(D) – q(D’)| – Smooth sensitivity S f *(X)= max Y {LS f (Y) e - dist(x,y) } Histograms Differential privacy of median Exponential Mechanism
Histograms Inputs x 1, x 2,..., x n in domain U Domain U partitioned into d disjoint bins S 1,…,S d q(x 1, x 2,..., x n ) = (n 1, n 2,..., n d ) where n j = #{i : x i in j-th bin} Can view as d queries: q i counts # spoints in set S i For adjacent D, D’, only one answer can change - it can change by 1 Global sensitivity of answer vector is 1 Sufficient to add Lap(1/ε) noise to each query, still get ε -privacy
The Exponential Mechanism [ McSherry Talwar] A general mechanism that yields Differential privacy May yield utility/approximation Is defined and evaluated by considering all possible answers The definition does not yield an efficient way of evaluating it Application/original motivation: Approximate truthfulness of auctions Collusion resistance Compatibility
Side bar: Digital Goods Auction Some product with 0 cost of production n individuals with valuation v 1, v 2, … v n Auctioneer wants to maximize profit
Example of the Exponential Mechanism Data: x i = website visited by student i today Range: Y = {website names} For each name y, let q(y, X) = #{i : x i = y} Goal: output the most frequently visited site Procedure: Given X, Output website y with probability prop to e q(y,X) Popular sites exponentially more likely than rare ones Website scores don’t change too quickly Size of subset
Setting U n RFor input D 2 U n want to find r 2 R R -Base measure on R - usually uniform U n £ R Score function q’: U n £ R R assigns any pair (D,r) a real value –Want to maximize it (approximately) The exponential mechanism R –Assign output r 2 R with probability proportional to e q’(D,r) (r) Normalizing factor r e q’(D,r) (r)
The exponential mechanism is private Let = max D,D’,r |q(D,r)-q(D’,r)| Claim: The exponential mechanism yields a 2 ¢ ¢ differentially private solution Prob [output = r on input D ] = e q’(D,r) (r)/ r e q’(D,r) (r) Prob [output = r on input D’ ] = e q’(D’,r) (r)/ r e q’(D’,r) (r) adjacent Ratio is bounded by e
Laplace Noise as Exponential Mechanism On query q:U n → R let q’(D,r) = -|q(D)-r| Prob noise = y e - y / 2 y e - y = /2 e - y Laplace distribution Y=Lap(b) has density function Pr[Y=y] =1/2b e -|y|/b y
Any Differentially Private Mechanism is an instance of the Exponential Mechanism Let M be a differentially private mechanism Take q’(D,r) to be log Prob[M(D) =r] Remaining issue: Accuracy
Private Ranking Each element i 2 {1, … n} has a real valued score S D (i) based on a data set D. Goal: Output k elements with highest scores. Privacy Data set D consists of n entries in domain D. – Differential privacy: Protects privacy of entries in D. Condition: Insensitive Scores –for any element i, for any data sets D, D’ that differ in one entry: |S D (i)- S D’ (i)| · 1
Approximate ranking Let S k be the k th highest score based on data set D. An output list is -useful if: Soundness : No element in the output has score less than S k - Completeness : Every element with score greater than S k + is in the output. Score · S k - S k + · Score S k - · Score · S k +
Two Approaches Score perturbation –Perturb the scores of the elements with noise –Pick the top k elements in terms of noisy scores. –Fast and simple implementation Question: what sort of noise should be added? What sort of guarantees? Exponential sampling –Run the exponential mechanism k times. –more complicated and slower implementation What sort of guarantees? Each input affects all scores Homework
Exponential Mechanism: Simple Example (almost free) private lunch Database of n individuals, lunch options {1…k}, each individual likes or dislikes each option ( 1 or 0 ) Goal: output a lunch option that many like For each lunch option j 2 [k], ℓ(j) is # of ind. who like j Exponential Mechanism: Output j with probability e εℓ(j) Actual probability: e εℓ(j) /(∑ i e εℓ(i) ) Normalizer
query 1, query 2,... Synthetic DB: Output is a DB Database answer 1 answer 3 answer 2 ? Sanitizer Synthetic DB: output also a DB (of entries from same universe X ), user reconstructs answers by evaluating query on output DB Software and people compatible Consistent answers
Answering More Queries Using exponential mechanism Differential Privacy for every set C of counting queries Error is Õ(n 2/3 log|C|) Remarkable Hope for rich private analysis of small DBs! Quantitative: #queries >> DB size, Qualitative:output of sanitizer -synthetic DB- output is a DB itself
Counting Queries Queries with low sensitivity Counting-queries C is a set of predicates c: U {0,1} Query : how many D participants satisfy c ? Relaxed accuracy: answer query within α additive error w.h.p Not so bad: error anyway inherent in statistical analysis Assume all queries given in advance U Database D of size n Query c Non-interactive
Utility and Privacy Can’t Always Be Achieved Simultaneously Impossibility results for counting queries: DB with n participants can’t have o(√n) error, O(n) queries [DiNi, DwMcTa07,DwYe08] In all these cases, strong privacy violation What can we do? almost entire DB compromised
Huge DBs [Dwork Nissim] DB of size n >> # queries |C|: Add independent noise to answer on every query Noise per query ~ #queries For accuracy, need #queries ≤ n huge May be reasonable for huge internet-scale DBs, Privacy “for free”
What about smaller DBs? DB of size n < #queries |C|, impossibility results: can’t have o(√n) error Error must be Ω(√n)
The BLR Algorithm For DBs F and D dist(F,D) = max q 2 C |q(F) – q(D)| Intuition: far away DBs get smaller probability Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D) Blum Ligett Roth08
The BLR Algorithm Idea: In general: Do not use large DB –Sample and answer accordingly DB of size m guaranteeing hitting each query with sufficient accuracy
The BLR Algorithm: 2 ε -Privacy For adjacent D, D’ for every F |dist(F,D) – dist(F,D’)| ≤ 1 Probability of F by D : e -ε·dist(F,D) /∑ G of size m e -ε·dist(G,D) Probability of F by D’ : numerator and denominator can change by e ε -factor 2ε -privacy Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)
The BLR Algorithm: Error Õ(n 2/3 log|C|) There exists F good of size m =Õ((n\α) 2· log|C|) s.t. dist(F good,D) ≤ α Pr [F good ] ~ e -εα For any F bad with dist 2α, Pr [F bad ] ~ e -2εα Union bound: ∑ bad DB F bad Pr [F bad ] ~ |U| m e -2εα For α=Õ(n 2/3 log|C|), Pr [F good ] >> ∑ Pr [F bad ] Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)
The BLR Algorithm: Running Time Generating the distribution by enumeration: Need to enumerate every size- m database, where m = Õ((n\α) 2· log|C|) Running time ≈ |U| Õ((n\α) 2 ·log|c|) Algorithm on input DB D : Sample from a distribution on DBs of size m : ( m < n ) DB F gets picked w.p. / e -ε·dist(F,D)
Conclusion Offline algorithm, 2ε- Differential Privacy for any set C of counting queries Error α is Õ(n 2/3 log|C|/ε) Super-poly running time: |U| Õ((n\α) 2 ·log|C|)
Can we Efficiently Sanitize? The good news If the universe is small, Can sanitize EFFICIENTLY The bad news cannot do much better, namely sanitize in time: sub-poly(|C|) AND sub-poly(|U|) Time poly(|C|,|U|)
How Efficiently Can We Sanitize? |C| |U| subpolypoly subpoly poly ? Good news! ? ??
The Good News: Can Sanitize When Universe is Small Efficient Sanitizer for query set C DB size n ¸ Õ(|C| o(1) log|U|) error is ~ n 2/3 Runtime poly(|C|,|U|) Output is a synthetic database Compare to [Blum Ligget Roth]: n ¸ Õ(log|C| log|U|), runtime super-poly(|C|,|U|)
Recursive Algorithm C 0 =CC1C1 C2C2 CbCb Start with DB D and large query set C Repeatedly choose random subset C i+1 o f C i : shrink query set by (small) factor
Recursive Algorithm Start with DB D and large query set C Repeatedly choose random subset C i+1 o f C i : shrink query set by (small) factor End recursion: sanitize D w.r.t. small query set C b Output is good for all queries in small set C i+1 Extract utility on almost-all queries in large set C i Fix remaining “underprivileged” queries in large set C i C 0 =CC1C1 C2C2 CbCb