CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
Cipher Techniques to Protect Anonymized Mobility Traces from Privacy Attacks Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip and Nageswara S. V. Rao.
Advertisements

Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Simulatability “The enemy knows the system”, Claude Shannon CompSci Instructor: Ashwin Machanavajjhala 1Lecture 6 : Fall 12.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Estimating from Samples © Christine Crisp “Teach A Level Maths” Statistics 2.
Privacy Enhancing Technologies
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
Seminar in Foundations of Privacy 1.Adding Consistency to Differential Privacy 2.Attacks on Anonymized Social Networks Inbal Talgam March 2008.
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
Why sample? Diversity in populations Practicality and cost.
Presenting: Assaf Tzabari
1 Pinning Down “Privacy” Defining Privacy in Statistical Databases Adam Smith Weizmann Institute of Science
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Calibrating Noise to Sensitivity in Private Data Analysis
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Cramer-Shoup is Plaintext Aware in the Standard Model Alexander W. Dent Information Security Group Royal Holloway, University of London.
So are how the computer determines the size of the intercept and the slope respectively in an OLS regression The OLS equations give a nice, clear intuitive.
CMSC 414 Computer and Network Security Lecture 3 Jonathan Katz.
The Complexity of Differential Privacy Salil Vadhan Harvard University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Defining and Achieving Differential Privacy Cynthia Dwork, Microsoft TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Foundations of Privacy Lecture 3 Lecturer: Moni Naor.
Statistical Databases – Query Auditing Li Xiong CS573 Data Privacy and Anonymity Partial slides credit: Vitaly Shmatikov, Univ Texas at Austin.
PARAMETRIC STATISTICAL INFERENCE
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Data Privacy Vitaly Shmatikov CS slide 2 uHealth-care datasets Clinical studies, hospital discharge databases … uGenetic datasets $1000 genome,
Slide 1 Differential Privacy Xintao Wu slides (P2-20) from Vitaly Shmatikove, then from Adam Smith.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Foundations of Privacy Lecture 5 Lecturer: Moni Naor.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
1 WP 10 On Risk Definitions and a Neighbourhood Regression Model for Sample Disclosure Risk Estimation Natalie Shlomo Hebrew University Southampton University.
Inference: Probabilities and Distributions Feb , 2012.
Slide 1 Vitaly Shmatikov CS 380S Privacy-Preserving Data Mining.
An Introduction to Differential Privacy and its Applications 1 Ali Bagherzandi Ph.D Candidate University of California at Irvine 1- Most slides in this.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
From the population to the sample The sampling distribution FETP India.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Towards Robustness in Query Auditing Shubha U. Nabar Stanford University VLDB 2006 Joint Work With B. Marthi, K. Kenthapadi, N. Mishra, R. Motwani.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Database Privacy (ongoing work) Shuchi Chawla, Cynthia Dwork, Adam Smith, Larry Stockmeyer, Hoeteck Wee.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Slide 1 CS 380S Differential Privacy Vitaly Shmatikov most slides from Adam Smith (Penn State)
University of Texas at El Paso
Private Data Management with Verification
Sampling Why use sampling? Terms and definitions
Bayes Net Learning: Bayesian Approaches
Privacy-preserving Release of Statistics: Differential Privacy
Approximate Inference
Differential Privacy in Practice
Scott Aaronson (UT Austin) MIT, November 20, 2018
CS498-EA Reasoning in AI Lecture #20
Gentle Measurement of Quantum States and Differential Privacy
Parametric Methods Berlin Chen, 2005 References:
Scott Aaronson (UT Austin) UNM, Albuquerque, October 18, 2018
Published in: IEEE Transactions on Industrial Informatics
CS639: Data Management for Data Science
Mathematical Foundations of BME Reza Shadmehr
Gentle Measurement of Quantum States and Differential Privacy *
Some contents are borrowed from Adam Smith’s slides
Differential Privacy (1)
Differential Privacy.
Presentation transcript:

CS639: Data Management for Data Science Lecture 26: Privacy [slides from Vitaly Shmatikov] Theodoros Rekatsinas

Reading Dwork. “Differential Privacy” (invited talk at ICALP 2006).

Basic Setting Users San DB= query 1 x1 x2 answer 1 xn xn-1  x3 x2 x1 query 1 San answer 1 Users (government, researchers, marketers, …) DB=  query T answer T ¢ ¢ ¢ random coins

Examples of Sanitization Methods Input perturbation Add random noise to database, release Summary statistics Means, variances Marginal totals Regression coefficients Output perturbation Summary statistics with noise Interactive versions of the above methods Auditor decides which queries are OK, type of noise

Strawman Definition Assume x1,…,xn are drawn i.i.d. from unknown distribution Candidate definition: sanitization is safe if it only reveals the distribution Implied approach: Learn the distribution Release description of distribution or re-sample points This definition is tautological! Estimate of distribution depends on data… why is it safe?

Blending into a Crowd Intuition: “I am safe in a group of k or more” Frequency in DB or frequency in underlying population? Intuition: “I am safe in a group of k or more” k varies (3… 6… 100… 10,000?) Many variations on theme Adversary wants predicate g such that 0 < #{i | g(xi)=true} < k Why? Privacy is “protection from being brought to the attention of others” [Gavison] Rare property helps re-identify someone Implicit: information about a large group is public E.g., liver problems more prevalent among diabetics

Clustering-Based Definitions Given sanitization S, look at all databases consistent with S Safe if no predicate is true for all consistent databases k-anonymity Partition D into bins Safe if each bin is either empty, or contains at least k elements Cell bound methods Release marginal sums brown blue  blond 2 10 12 6 18 14 16 brown blue  blond [0,12] 12 [0,14] [0,16] 18 14 16

Issues with Clustering Purely syntactic definition of privacy What adversary does this apply to? Does not consider adversaries with side information Does not consider probability Does not consider adversarial algorithm for making decisions (inference)

“Bayesian” Adversaries Adversary outputs point z  D Score = 1/fz if fz > 0, 0 otherwise fz is the number of matching points in D Sanitization is safe if E(score) ≤  Procedure: Assume you know adversary’s prior distribution over databases Given a candidate output, update prior conditioned on output (via Bayes’ rule) If maxz E( score | output ) < , then safe to release

Issues with “Bayesian” Privacy Restricts the type of predicates adversary can choose Must know prior distribution Can one scheme work for many distributions? Sanitizer works harder than adversary Conditional probabilities don’t consider previous iterations Remember simulatable auditing?

Classical Intution for Privacy “If the release of statistics S makes it possible to determine the value [of private information] more accurately than is possible without access to S, a disclosure has taken place.” [Dalenius 1977] Privacy means that anything that can be learned about a respondent from the statistical database can be learned without access to the database Similar to semantic security of encryption Anything about the plaintext that can be learned from a ciphertext can be learned without the ciphertext

Problems with Classic Intuition Popular interpretation: prior and posterior views about an individual shouldn’t change “too much” What if my (incorrect) prior is that every UTCS graduate student has three arms? How much is “too much?” Can’t achieve cryptographically small levels of disclosure and keep the data useful Adversarial user is supposed to learn unpredictable things about the database

Impossibility Result Privacy: for some definition of “privacy breach,” [Dwork] Privacy: for some definition of “privacy breach,”  distribution on databases,  adversaries A,  A’ such that Pr(A(San)=breach) – Pr(A’()=breach) ≤  For reasonable “breach”, if San(DB) contains information about DB, then some adversary breaks this definition Example Paris knows that Theo is 2 inches taller than the average Greek DB allows computing average height of a Greek This DB breaks Theos’s privacy according to this definition… even if his record is not in the database!

(Very Informal) Proof Sketch Suppose DB is uniformly random Entropy I( DB ; San(DB) ) > 0 “Breach” is predicting a predicate g(DB) Adversary knows r, H(r ; San(DB))  g(DB) H is a suitable hash function, r=H(DB) By itself, does not leak anything about DB (why?) Together with San(DB), reveals g(DB) (why?)

Differential Privacy (1) xn xn-1  x3 x2 x1 query 1 San answer 1 DB=  query T answer T ¢ ¢ ¢ Adversary A random coins Example with Greeks and Theo Adversary learns Theo’s height even if he is not in the database Intuition: “Whatever is learned would be learned regardless of whether or not Theoparticipates” Dual: Whatever is already known, situation won’t get worse

Differential Privacy (2) xn xn-1  x2 x1 query 1 San answer 1 DB=  query T answer T ¢ ¢ ¢ Adversary A random coins Define n+1 games Game 0: Adv. interacts with San(DB) Game i: Adv. interacts with San(DB-i); DB-i = (x1,…,xi-1,0,xi+1,…,xn) Given S and prior p() on DB, define n+1 posterior distrib’s

Differential Privacy (3) xn xn-1  x2 x1 query 1 San answer 1 DB=  query T answer T ¢ ¢ ¢ Adversary A random coins Definition: San is safe if  prior distributions p(¢) on DB,  transcripts S,  i =1,…,n StatDiff( p0(¢|S) , pi(¢|S) ) ≤ 

Indistinguishability xn xn-1  x3 x2 x1 query 1 transcript S San answer 1 DB=  query T answer T ¢ ¢ ¢ Distance between distributions is at most  random coins Differ in 1 row xn xn-1  y3 x2 x1 query 1 transcript S’ San answer 1 DB’=  query T answer T ¢ ¢ ¢ random coins

Which Distance to Use? Problem:  must be large Any two databases induce transcripts at distance ≤ n To get utility, need  > 1/n Statistical difference 1/n is not meaningful! Example: release random point in database San(x1,…,xn) = ( j, xj ) for random j For every i , changing xi induces statistical difference 1/n But some xi is revealed with probability 1

Formalizing Indistinguishability ? transcript S query 1 transcript S’ query 1 answer 1 answer 1 Adversary A Definition: San is -indistinguishable if  A,  DB, DB’ which differ in 1 row,  sets of transcripts S p( San(DB)  S )  (1 ± ) p( San(DB’)  S ) p( San(DB) = S ) p( San(DB’)= S ) Equivalently,  S:  1 ± 

Indistinguishability  Diff. Privacy Definition: San is safe if  prior distributions p(¢) on DB,  transcripts S,  i =1,…,n StatDiff( p0(¢|S) , pi(¢|S) ) ≤  For every S and DB, indistinguishability implies This implies StatDiff( p0(¢|S) , pi(¢| S) ) ≤ 

Sensitivity with Laplace Noise

Differential Privacy: Summary San gives -differential privacy if for all values of DB and Me and all transcripts t: Pr[ San (DB - Me) = t] ≤ e  1 Pr[ San (DB + Me) = t] Pr [t]