Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research.

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Kunal Talwar MSR SVC [Dwork, McSherry, Talwar, STOC 2007] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AA A.
Machine Learning Week 2 Lecture 2.
x – independent variable (input)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Probably Approximately Correct Model (PAC)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Ensemble Learning: An Introduction
The Goldreich-Levin Theorem: List-decoding the Hadamard code
Evaluating Hypotheses
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Privacy-Preserving Datamining on Vertically Partitioned Databases Kobbi Nissim Microsoft, SVC Joint work with Cynthia Dwork.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Calibrating Noise to Sensitivity in Private Data Analysis
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Radial Basis Function Networks
Evaluating Performance for Data Mining Techniques
Data mining and machine learning A brief introduction.
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Differentially Private Marginals Release with Mutual Consistency and Error Independent of Sample Size Cynthia Dwork, Microsoft TexPoint fonts used in EMF.
Submodular Functions Learnability, Structure & Optimization Nick Harvey, UBC CS Maria-Florina Balcan, Georgia Tech.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Additive Data Perturbation: the Basic Problem and Techniques.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
CpSc 881: Machine Learning Evaluating Hypotheses.
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Machine Learning Chapter 5. Evaluating Hypotheses
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Lecture 2: Statistical learning primer for biologists
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Slide 1 Vitaly Shmatikov CS 380S Privacy-Preserving Data Mining.
An Introduction to Differential Privacy and its Applications 1 Ali Bagherzandi Ph.D Candidate University of California at Irvine 1- Most slides in this.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Clustering Data Streams A presentation by George Toderici.
Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,
Sergey Yekhanin Institute for Advanced Study Lower Bounds on Noise.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Stochastic Streams: Sample Complexity vs. Space Complexity
Understanding Generalization in Adaptive Data Analysis
Privacy-preserving Release of Statistics: Differential Privacy
Spectral Clustering.
Parallelization of Sparse Coding & Dictionary Learning
Parametric Methods Berlin Chen, 2005 References:
Feature Selection Methods
Presentation transcript:

Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research

The Problem Exploit Data, eg, Medical Insurance Database — Does smoking contribute to heart disease? — Was there a rise in asthma emergency room cases this month? — What fraction of the admissions during 2004 were men 25-35? …while preserving privacy of individuals

Holistic Statistics Is the dataset well clustered? What is the single best predictor for risk of stroke? How are attributes X and Y correlated; what is the cov(X,Y)? Are the data inherently low-dimensional?

Statistical Database Query (f,S) f: row  [0,1] S µ [n] Exact Answer  f(row r) Database (D 1, … D n ) (D 1, … D n ) f f f f  + noise

Statistical Database f f f f  + noise Under control of interlocutor: Noise generation Number of queries T permitted

Why Bother With Noise? Limiting interface to queries about large sets is insufficient: A = {1, …, n} and B = {2, …, n}  a2 A f(row a) -  b2 B f(row b) = f(row 1)

Previous (Modern) Work in this Model Dinur, Nissim [2003] Single binary attribute (query function f = identity) Non-privacy: whp adversary guesses 1-  rows — Theorem: Polytime non-privacy if whp |noise| is o(√n) — Theorem: Privacy with o(√n) noise if #queries is << n Privacy “for free” ! Rows » samples from underlying distribution: Pr[row i = 1] = p E[# 1’s] = pn, Var =  (n) Acutal #1’s » pn §  (√n) |Privacy-preserving noise| is o(sampling error)

Real Power in this Model Dwork, Nissim [2004] Multiple binary attributes q=(S,f), f:{0,1} d ! {0,1} — Definition of privacy appropriate to enriched query set — Theorem: Privacy with o(√n) noise if #queries is << n — Coined term SuLQ Vertically Partitioned Databases — Learn joint statistics from independently operated SuLQ databases: Given SulQ A, SuLQ B learn if A implies B in probability Eg, heart disease risk increases with smoking Enables learning statistics for all Boolean fns of attributes

Still More Power [Blum, Dwork, McSherry, Nissim 05] Extend Privacy Proofs — Real-valued functions f: [0,1] d ! [0,1] — Per row analysis: drop dependence on n! How many queries has THIS row participated in? Our Data, Ourselves Holistic Statistics: A Calculus of Noisy Computation — Beyond statistics: (not too) noisy versions of k-means, perceptron, ID3 algs (not too) noisy optimal projections SVD, PCA All of STAT learning

Towards Defining Privacy: “Facts of Life” vs Privacy Breach Diabetes is more likely in obese persons — Does not imply THIS obese person has or will have diabetes Sneaker color preference is correlated with political party — Does not imply THIS person in red sneakers is a Republican Half of all marriages result in divorce — Does not imply Pr [ THIS marriage will fail ] = ½

( , T)-Privacy Power of adversary: Phase 0: Specify a goal function g: row  {0,1} Actually, a polynomial number of functions; Adversary will try to learn this information about someone Phase 1: Adaptively make T queries Phase 2: Choose a row i to attack; get entire database except for row i Privacy Breach: Occurs if adversary’s “confidence” in g( row i ) changes by  Notes: Adversary chooses goal My privacy is preserved even if everybody else tells their secrets to the adversary

Flavor of Privacy Proofs Define confidence in value of g( row i ) — c 0 = log [p 0 /(1-p 0 )] — 0 when p = ½, skyrockets as p moves toward 0 or 1 Model evolution of confidence as a martingale — Argue expected difference at each step is small — Compute absolute upper bound on difference — Plug these two parameters into Azuma’s inequality Obtain probabilistic statement regarding change in confidence, equivalently, change from prior to posterior probabilities about value of g( row i ) c0c0

Remainder of This Talk Description of SuLQ Algorithm + Statement of Main Theorem Examples — k means — SVD, PCA — Perceptron — STAT learning Vertically Partitioned Data — Determining if  )  in probability: Pr[  |  ] ¸ Pr[  ]+  when  and  are in different SuLQ databases Summary

Azuma’s Inequality Let s 1, … s T be i.i.d. such that E[s j ] ·  and |s j | · . Then Pr[|  i s i | > (  +  ) T 1/2 + T  ] · 2e - /2 We will take  = 1/2R and  = (2 log (T/  )/R) 1/2 + 1/2R 2

The SuLQ Algorithm Algorithm: — Input: query (S µ [n], f: [0,1] d ! [0,1]) — Output:  i 2 S f( row i ) + N(0, R) Theorem: 8 , with probability at least 1- , choosing R > 32 log(2/  ) log (T/  )T/  2 ensures that for each (target, predicate) pair, after T queries the probability that the confidence has increased by more than  is at most . R is independent of n. Bigger n means better stats.

k Means Clustering physics, OR, machine learning, data mining, etc.

SuLQ k Means Estimate size of each cluster Estimate average of points in cluster — Estimate their sum; and — Divide estimated sum by estimated average

Side by Side: k Means and SuLQ k-Means Basic step: Input: data points p 1,…,p n and k ‘centers’ c 1,…,c k in [0,1] d S j = points for which c j is the closest center Output: c’ j = average of points in S j, j=1, … k Basic step: Input: data points p 1,…,p n and k ‘centers’ c 1,…,c k in [0,1] d s j = SuLQ( f(d i ) := 1 if j = arg min j ||c j – d i || 0 otherwise)  ’ j = SuLQ( f(d i ) := d i if j = arg min j ||c j - d i || 0 otherwise) / s j k(1+d) queries total

Small Error! For each 1 · j · k, if |S j | >> R 1/2 then with high probability ||  ’ j – c’ j || is O( (||  j || + d 1/2 ) R 1/2 /|S j |). Inaccuracies: — Estimating |S j | — Summing points in S j Even with just the first: (1/s j - 1/|S j |)  I 2 S j d i = (1/s j - 1/|S j |) (  j |S j |) = ((|S j | - s j )/s j )  j ¼ (noise/size)  j

Reducing Dimensionality Reduce Dimensionality in a dataset while retaining those characteristics that contribute most to its variance Find Optimal Linear Projections — Latent semantic indexing, spectral clustering, etc., employ best rank k approximations to A Singular Value Decomposition uses top k eigenvectors of A T A Principal Component Analysis uses top k eigenvectors of cov(A) Approach — Approximate A T A and cov(A) using SuLQ, then compute eigenvectors

Optimal Projections A T A =  i d i T d i  = (  i d i )/n cov(A) =  i (d i -  ) T (d i -  ) SuLQ (f(i) = d i T d i ) = A T A + N(0,R) d £ d  ’ = SuLQ(f(i)=d i )/n SuLQ( f(i) = (d i -  ’) T (d i -  ’) ) d 2 and d 2 +d queries, respectively

Perceptron [Rosenblatt 57] Input: n points p 1,…,p n in [-1,1] d, and labels b 1,…,b n in {-1,1} — Assumed linearly separable, with a plane through the origin Initialize w randomly h w, p i b > 0 iff label b agrees with sign of h w, p i While 9 labeled point (p i,b i ) s.t. h w i, p i i b i · 0, s et w = w + p i ·b i Output: w pipipipi w w

SuLQ Perceptron Initialize w 0 = 0 d and s 0 = n. For j = 0, 1, 2, …, repeating so long as s j >> R 1/2 Count the misclassified rows (1 query): s j = SuLQ(f(d i ) := 1 if h d i, w j i b i · 0 and 0 ow) Synthesize a misclassified vector (d queries): v j = SuLQ(f(d i ) := b i d i if h d i, w j i ¢ b i · 0 and 0 ow) / s j. Update w: Set w j+1 = w j + v j. Return the final value of w.

SuLQ Perceptron Initialize w = 0 d and s= n. Repeat while s >> R 1/2 Count the misclassified rows (1 query) : s = SuLQ(f(d i ) := 1 if h d i, w i b i · 0 and 0 ow) Synthesize a misclassified vector (d queries) : v = SuLQ(f(d i ) := b i d i if h d i, w i ¢ b i · 0 and 0 ow) / s Update w: Set w = w + v Return the final value of w.

How Many Rounds? Theorem: If there exists a unit vector w’ and scalar  such that for all i hw',d i i b i ¸  and for all j,  >> (dR) 1/2 /|S j | then with high probability the algorithm terminates in at most 32 max i |d i | 2 /  rounds. |S j | = number of misclassified vectors at iteration j In each round j, hw', wi increases by more than |w| does. Since hw', wi · |w'| ¢ |w| = |w|, this must stop. Otherwise hw', wi would overtake |w|.

The Statistical Queries Learning Model [Kearns93] Concept c: {0,1} d  {0,1} Distribution D on {0,1} d STAT(c,D) Oracle — Query: (p,  ) where p:{0,1} d+1  {0,1} and  =1/poly(d) — Answer: Pr x  D [p(x,c(x))] +  for |  |  

Capturing STAT Each row contains a labeled example (x, c(x)) Input: predicate p and accuracy  Initialize tally = 0. Reduce variance: Repeat t ¸ R/  n 2 times tally = tally + SuLQ(f(d i ) := p(d i )) Output: tally / tn

Capturing STAT Theorem: For any algorithm that  -learns a class C using at most q statistical queries of accuracy {  1, …,  q }, the adapted algorithm can  -learn C on a SuLQ database of n elements, provided that n 2 ¸ R log(q /  )}/(T-q) £  j · q 1/  j

Probabilistic Implication: Two SuLQ Databases  implies  in probability: Pr[  |  ] ≥ Pr[  ]+  Construct a tester for distinguishing   2 (for constants  1 <  2 ) — Estimate  by binary search In the analysis we consider deviations from an expected value, of magnitude  (√n) — As perturbation << √n, it does not mask out these deviations Results generalize to functions  and  of attributes in two distinct SuLQ databases

Key Insight: Test for Pr[  |  ] ≥ Pr[  ]+  Assume T chosen so that noise = o(√n). 1. Find a “heavy” set S for  : a subset of rows that have more than |S| a +[a(1-a) |S] 1/2 ones in  database. Here, a = Pr[  ] and |S| =  (n). Find S s.t. a S,  > |S| a  + √ [|S|( a (1- a ))]. Let excess  = a S,  - |S|  a. Note that excess is  (n 1/2 ). 2. Query the SuLQ database for , on S If a S,  ¸ |S| Pr[  ] + excess  (  / (1 - a )) then return 1 else return 0 If  is constant then noise is too small to hide the correlation.

Probabilistic Implication – The Tester Pr[  |  ] ≥ Pr[  ]+  Distinguishing   2 : — Find a `heavy’ query (S,  ) s.t. a S,  > |S|  p  + √(n a (1- a )) Let bias  = a S,  - |S|  p  — Issue query (S,  ) If a S,  > threshold(bias , p ,  1 ) output 1 random S <1<1<1<1 >2>2>2>2 (2-1)(n)(2-1)(√n)(2-1)(n)(2-1)(√n) 10 Pr[a S,  ] a S, 

Summary SuLQ framework for privacy-preserving statistical databases — real-valued query functions — Variance for noise depends (roughly linearly) on number of queries, not size of database Examples of power of SuLQ calculus Vertically Partitioned Databases

Sources C. Dwork and K. Nissim, Privacy-Preserving Datamining on Vertically Partitioned Databases A. Blum, C. Dwork, F. McSherry, and K. Nissim, Practical Privacy: The SuLQ Framework See