Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research
The Problem Exploit Data, eg, Medical Insurance Database — Does smoking contribute to heart disease? — Was there a rise in asthma emergency room cases this month? — What fraction of the admissions during 2004 were men 25-35? …while preserving privacy of individuals
Holistic Statistics Is the dataset well clustered? What is the single best predictor for risk of stroke? How are attributes X and Y correlated; what is the cov(X,Y)? Are the data inherently low-dimensional?
Statistical Database Query (f,S) f: row [0,1] S µ [n] Exact Answer f(row r) Database (D 1, … D n ) (D 1, … D n ) f f f f + noise
Statistical Database f f f f + noise Under control of interlocutor: Noise generation Number of queries T permitted
Why Bother With Noise? Limiting interface to queries about large sets is insufficient: A = {1, …, n} and B = {2, …, n} a2 A f(row a) - b2 B f(row b) = f(row 1)
Previous (Modern) Work in this Model Dinur, Nissim [2003] Single binary attribute (query function f = identity) Non-privacy: whp adversary guesses 1- rows — Theorem: Polytime non-privacy if whp |noise| is o(√n) — Theorem: Privacy with o(√n) noise if #queries is << n Privacy “for free” ! Rows » samples from underlying distribution: Pr[row i = 1] = p E[# 1’s] = pn, Var = (n) Acutal #1’s » pn § (√n) |Privacy-preserving noise| is o(sampling error)
Real Power in this Model Dwork, Nissim [2004] Multiple binary attributes q=(S,f), f:{0,1} d ! {0,1} — Definition of privacy appropriate to enriched query set — Theorem: Privacy with o(√n) noise if #queries is << n — Coined term SuLQ Vertically Partitioned Databases — Learn joint statistics from independently operated SuLQ databases: Given SulQ A, SuLQ B learn if A implies B in probability Eg, heart disease risk increases with smoking Enables learning statistics for all Boolean fns of attributes
Still More Power [Blum, Dwork, McSherry, Nissim 05] Extend Privacy Proofs — Real-valued functions f: [0,1] d ! [0,1] — Per row analysis: drop dependence on n! How many queries has THIS row participated in? Our Data, Ourselves Holistic Statistics: A Calculus of Noisy Computation — Beyond statistics: (not too) noisy versions of k-means, perceptron, ID3 algs (not too) noisy optimal projections SVD, PCA All of STAT learning
Towards Defining Privacy: “Facts of Life” vs Privacy Breach Diabetes is more likely in obese persons — Does not imply THIS obese person has or will have diabetes Sneaker color preference is correlated with political party — Does not imply THIS person in red sneakers is a Republican Half of all marriages result in divorce — Does not imply Pr [ THIS marriage will fail ] = ½
( , T)-Privacy Power of adversary: Phase 0: Specify a goal function g: row {0,1} Actually, a polynomial number of functions; Adversary will try to learn this information about someone Phase 1: Adaptively make T queries Phase 2: Choose a row i to attack; get entire database except for row i Privacy Breach: Occurs if adversary’s “confidence” in g( row i ) changes by Notes: Adversary chooses goal My privacy is preserved even if everybody else tells their secrets to the adversary
Flavor of Privacy Proofs Define confidence in value of g( row i ) — c 0 = log [p 0 /(1-p 0 )] — 0 when p = ½, skyrockets as p moves toward 0 or 1 Model evolution of confidence as a martingale — Argue expected difference at each step is small — Compute absolute upper bound on difference — Plug these two parameters into Azuma’s inequality Obtain probabilistic statement regarding change in confidence, equivalently, change from prior to posterior probabilities about value of g( row i ) c0c0
Remainder of This Talk Description of SuLQ Algorithm + Statement of Main Theorem Examples — k means — SVD, PCA — Perceptron — STAT learning Vertically Partitioned Data — Determining if ) in probability: Pr[ | ] ¸ Pr[ ]+ when and are in different SuLQ databases Summary
Azuma’s Inequality Let s 1, … s T be i.i.d. such that E[s j ] · and |s j | · . Then Pr[| i s i | > ( + ) T 1/2 + T ] · 2e - /2 We will take = 1/2R and = (2 log (T/ )/R) 1/2 + 1/2R 2
The SuLQ Algorithm Algorithm: — Input: query (S µ [n], f: [0,1] d ! [0,1]) — Output: i 2 S f( row i ) + N(0, R) Theorem: 8 , with probability at least 1- , choosing R > 32 log(2/ ) log (T/ )T/ 2 ensures that for each (target, predicate) pair, after T queries the probability that the confidence has increased by more than is at most . R is independent of n. Bigger n means better stats.
k Means Clustering physics, OR, machine learning, data mining, etc.
SuLQ k Means Estimate size of each cluster Estimate average of points in cluster — Estimate their sum; and — Divide estimated sum by estimated average
Side by Side: k Means and SuLQ k-Means Basic step: Input: data points p 1,…,p n and k ‘centers’ c 1,…,c k in [0,1] d S j = points for which c j is the closest center Output: c’ j = average of points in S j, j=1, … k Basic step: Input: data points p 1,…,p n and k ‘centers’ c 1,…,c k in [0,1] d s j = SuLQ( f(d i ) := 1 if j = arg min j ||c j – d i || 0 otherwise) ’ j = SuLQ( f(d i ) := d i if j = arg min j ||c j - d i || 0 otherwise) / s j k(1+d) queries total
Small Error! For each 1 · j · k, if |S j | >> R 1/2 then with high probability || ’ j – c’ j || is O( (|| j || + d 1/2 ) R 1/2 /|S j |). Inaccuracies: — Estimating |S j | — Summing points in S j Even with just the first: (1/s j - 1/|S j |) I 2 S j d i = (1/s j - 1/|S j |) ( j |S j |) = ((|S j | - s j )/s j ) j ¼ (noise/size) j
Reducing Dimensionality Reduce Dimensionality in a dataset while retaining those characteristics that contribute most to its variance Find Optimal Linear Projections — Latent semantic indexing, spectral clustering, etc., employ best rank k approximations to A Singular Value Decomposition uses top k eigenvectors of A T A Principal Component Analysis uses top k eigenvectors of cov(A) Approach — Approximate A T A and cov(A) using SuLQ, then compute eigenvectors
Optimal Projections A T A = i d i T d i = ( i d i )/n cov(A) = i (d i - ) T (d i - ) SuLQ (f(i) = d i T d i ) = A T A + N(0,R) d £ d ’ = SuLQ(f(i)=d i )/n SuLQ( f(i) = (d i - ’) T (d i - ’) ) d 2 and d 2 +d queries, respectively
Perceptron [Rosenblatt 57] Input: n points p 1,…,p n in [-1,1] d, and labels b 1,…,b n in {-1,1} — Assumed linearly separable, with a plane through the origin Initialize w randomly h w, p i b > 0 iff label b agrees with sign of h w, p i While 9 labeled point (p i,b i ) s.t. h w i, p i i b i · 0, s et w = w + p i ·b i Output: w pipipipi w w
SuLQ Perceptron Initialize w 0 = 0 d and s 0 = n. For j = 0, 1, 2, …, repeating so long as s j >> R 1/2 Count the misclassified rows (1 query): s j = SuLQ(f(d i ) := 1 if h d i, w j i b i · 0 and 0 ow) Synthesize a misclassified vector (d queries): v j = SuLQ(f(d i ) := b i d i if h d i, w j i ¢ b i · 0 and 0 ow) / s j. Update w: Set w j+1 = w j + v j. Return the final value of w.
SuLQ Perceptron Initialize w = 0 d and s= n. Repeat while s >> R 1/2 Count the misclassified rows (1 query) : s = SuLQ(f(d i ) := 1 if h d i, w i b i · 0 and 0 ow) Synthesize a misclassified vector (d queries) : v = SuLQ(f(d i ) := b i d i if h d i, w i ¢ b i · 0 and 0 ow) / s Update w: Set w = w + v Return the final value of w.
How Many Rounds? Theorem: If there exists a unit vector w’ and scalar such that for all i hw',d i i b i ¸ and for all j, >> (dR) 1/2 /|S j | then with high probability the algorithm terminates in at most 32 max i |d i | 2 / rounds. |S j | = number of misclassified vectors at iteration j In each round j, hw', wi increases by more than |w| does. Since hw', wi · |w'| ¢ |w| = |w|, this must stop. Otherwise hw', wi would overtake |w|.
The Statistical Queries Learning Model [Kearns93] Concept c: {0,1} d {0,1} Distribution D on {0,1} d STAT(c,D) Oracle — Query: (p, ) where p:{0,1} d+1 {0,1} and =1/poly(d) — Answer: Pr x D [p(x,c(x))] + for | |
Capturing STAT Each row contains a labeled example (x, c(x)) Input: predicate p and accuracy Initialize tally = 0. Reduce variance: Repeat t ¸ R/ n 2 times tally = tally + SuLQ(f(d i ) := p(d i )) Output: tally / tn
Capturing STAT Theorem: For any algorithm that -learns a class C using at most q statistical queries of accuracy { 1, …, q }, the adapted algorithm can -learn C on a SuLQ database of n elements, provided that n 2 ¸ R log(q / )}/(T-q) £ j · q 1/ j
Probabilistic Implication: Two SuLQ Databases implies in probability: Pr[ | ] ≥ Pr[ ]+ Construct a tester for distinguishing 2 (for constants 1 < 2 ) — Estimate by binary search In the analysis we consider deviations from an expected value, of magnitude (√n) — As perturbation << √n, it does not mask out these deviations Results generalize to functions and of attributes in two distinct SuLQ databases
Key Insight: Test for Pr[ | ] ≥ Pr[ ]+ Assume T chosen so that noise = o(√n). 1. Find a “heavy” set S for : a subset of rows that have more than |S| a +[a(1-a) |S] 1/2 ones in database. Here, a = Pr[ ] and |S| = (n). Find S s.t. a S, > |S| a + √ [|S|( a (1- a ))]. Let excess = a S, - |S| a. Note that excess is (n 1/2 ). 2. Query the SuLQ database for , on S If a S, ¸ |S| Pr[ ] + excess ( / (1 - a )) then return 1 else return 0 If is constant then noise is too small to hide the correlation.
Probabilistic Implication – The Tester Pr[ | ] ≥ Pr[ ]+ Distinguishing 2 : — Find a `heavy’ query (S, ) s.t. a S, > |S| p + √(n a (1- a )) Let bias = a S, - |S| p — Issue query (S, ) If a S, > threshold(bias , p , 1 ) output 1 random S <1<1<1<1 >2>2>2>2 (2-1)(n)(2-1)(√n)(2-1)(n)(2-1)(√n) 10 Pr[a S, ] a S,
Summary SuLQ framework for privacy-preserving statistical databases — real-valued query functions — Variance for noise depends (roughly linearly) on number of queries, not size of database Examples of power of SuLQ calculus Vertically Partitioned Databases
Sources C. Dwork and K. Nissim, Privacy-Preserving Datamining on Vertically Partitioned Databases A. Blum, C. Dwork, F. McSherry, and K. Nissim, Practical Privacy: The SuLQ Framework See