Differential Privacy (1)

Slides:



Advertisements
Similar presentations
I have a DREAM! (DiffeRentially privatE smArt Metering) Gergely Acs and Claude Castelluccia {gergely.acs, INRIA 2011.
Advertisements

Operating System Security
Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia.
Differentially Private Recommendation Systems Jeremiah Blocki Fall A: Foundations of Security and Privacy.
Privacy Enhancing Technologies
Differential Privacy 18739A: Foundations of Security and Privacy Anupam Datta Fall 2009.
Privacy without Noise Yitao Duan NetEase Youdao R&D Beijing China CIKM 2009.
Calibrating Noise to Sensitivity in Private Data Analysis
Differential Privacy (2). Outline  Using differential privacy Database queries Data mining  Non interactive case  New developments.
Preserving Privacy in Clickstreams Isabelle Stanton.
R 18 G 65 B 145 R 0 G 201 B 255 R 104 G 113 B 122 R 216 G 217 B 218 R 168 G 187 B 192 Core and background colors: 1© Nokia Solutions and Networks 2014.
Overview of Privacy Preserving Techniques.  This is a high-level summary of the state-of-the-art privacy preserving techniques and research areas  Focus.
Differential Privacy - Apps Presented By Nikhil M Chandrappa 1.
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
CS573 Data Privacy and Security Statistical Databases
Slide 1 Differential Privacy Xintao Wu slides (P2-20) from Vitaly Shmatikove, then from Adam Smith.
Personalized Social Recommendations – Accurate or Private? A. Machanavajjhala (Yahoo!), with A. Korolova (Stanford), A. Das Sarma (Google) 1.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Privacy Framework for RDF Data Mining Master’s Thesis Project Proposal By: Yotam Aron.
Differential Privacy Some contents are borrowed from Adam Smith’s slides.
Privacy-preserving data publishing
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Private Release of Graph Statistics using Ladder Functions J.ZHANG, G.CORMODE, M.PROCOPIUC, D.SRIVASTAVA, X.XIAO.
1 Differential Privacy Cynthia Dwork Mamadou H. Diallo.
Database Privacy (ongoing work) Shuchi Chawla, Cynthia Dwork, Adam Smith, Larry Stockmeyer, Hoeteck Wee.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose.
Slide 1 CS 380S Differential Privacy Vitaly Shmatikov most slides from Adam Smith (Penn State)
INTRODUCTION TO STATISTICS
Chapter 8: Estimating with Confidence
CHAPTER 8 Estimating with Confidence
University of Texas at El Paso
Distribution of the Sample Means
SocialMix: Supporting Privacy-aware Trusted Social Networking Services
Privacy-preserving Release of Statistics: Differential Privacy
Presented by: Dr. Munam Ali Shah
Introduction to Summary Statistics
Introduction to Summary Statistics
Course Business I am traveling April 25-May 3rd
Differential Privacy in Practice
Introduction to Summary Statistics
Inference and Flow Control
Introduction to Summary Statistics
CHAPTER 8 Estimating with Confidence
CHAPTER 8 Estimating with Confidence
Differential Privacy (2)
Chapter 10: Estimating with Confidence
Gentle Measurement of Quantum States and Differential Privacy
CHAPTER 8 Estimating with Confidence
Chapter 7: Sampling Distributions
Privacy preserving cloud computing
Chapter 8: Estimating with Confidence
CHAPTER 8 Estimating with Confidence
CHAPTER 8 Estimating with Confidence
CHAPTER 8 Estimating with Confidence
Chapter 1 Stats Starts Here.
Published in: IEEE Transactions on Industrial Informatics
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Helen: Maliciously Secure Coopetitive Learning for Linear Models
CHAPTER 8 Estimating with Confidence
CHAPTER 8 Estimating with Confidence
Chapter 1 Stats Starts Here.
Chapter 8: Estimating with Confidence
CS639: Data Management for Data Science
Generating Random Variates
Some contents are borrowed from Adam Smith’s slides
Differential Privacy.
Presentation transcript:

Differential Privacy (1)

Outline Background Definition

Background Interactive database query Non-interactive A classical research problem for statistical databases Prevent query inferences – malicious users submit multiple queries to infer private information about some person Has been studied since decades ago Non-interactive publishing statistics then destroy data micro-data publishing

Background: Database Privacy Alice Users (government, researchers, marketers, …) Collection and “sanitization” Bob  You “Census problem” Two conflicting goals Utility: Users can extract “global” statistics Privacy: Individual information stays hidden How can these be formalized? OLD NOTES! This talk is about database privacy. The term can mean many things but for this talk, the example to keep in mind is a government census. Individuals provide information to a trusted government agency, which processes the information and makes some sanitized version of it available for public use. - privacy is required by law - ethical - pragmatic: people won’t answer unless they trust you There are two goals: we want users to be able to extract global statistics about the population being studied. However, for legal, ethical and pragmatic reasons, we also want to protect the privacy of the individuals who participate. And so we have a fundamental tradeoff between privacy on one hadn and utility on the other. The extremes are easy: publishing nothing at all provides complet eprivacy, but no utility, and publishing the raw data exactly provides the most utility but no privacy. Thus the first-order goal of this paper is to plot some middle course between the extremes; that is, to find a compromise which allows users to obtain useful information while also providing a meaningful guarantee of privacy. This problem is not new: it is often called the "statistical database" problem. I would say a second-order goal of this paper is to change the way the problem is approached and treated in the literature… Graphically, this is what is going on. As I said, there are two goals, utility and privacy. Utility is easy to understand, and to explain to a user. To prove that your scheme provides a particular utility, just give an algoriithm and an analysis. Privacy is much harder to get a handle on…

Database Privacy Variations on model studied in Statistics Data mining Theoretical CS Cryptography Different traditions for what “privacy” means OLD NOTES! This talk is about database privacy. The term can mean many things but for this talk, the example to keep in mind is a government census. Individuals provide information to a trusted government agency, which processes the information and makes some sanitized version of it available for public use. - privacy is required by law - ethical - pragmatic: people won’t answer unless they trust you There are two goals: we want users to be able to extract global statistics about the population being studied. However, for legal, ethical and pragmatic reasons, we also want to protect the privacy of the individuals who participate. And so we have a fundamental tradeoff between privacy on one hadn and utility on the other. The extremes are easy: publishing nothing at all provides complet eprivacy, but no utility, and publishing the raw data exactly provides the most utility but no privacy. Thus the first-order goal of this paper is to plot some middle course between the extremes; that is, to find a compromise which allows users to obtain useful information while also providing a meaningful guarantee of privacy. This problem is not new: it is often called the "statistical database" problem. I would say a second-order goal of this paper is to change the way the problem is approached and treated in the literature… Graphically, this is what is going on. As I said, there are two goals, utility and privacy. Utility is easy to understand, and to explain to a user. To prove that your scheme provides a particular utility, just give an algoriithm and an analysis. Privacy is much harder to get a handle on…

Two types of privacy protection methods Data sanitization Anonymization

Sanitization approaches Input perturbation Add noise to data Generalize data Output perturbation Add noise to summary statistics Count, sum, max, min Means, variances Marginal totals Model parameters

Blending/hiding into a crowd K-anonymity, l-diversity, etc. approaches Adversary may have various background knowledge to breach privacy Privacy models often assume “the adversary’s background knowledge is given”, which is impractical

Classic intuition for privacy Privacy means that anything can be learned about a respondent from the statistical database can be learned without access to the database A very strong definition Defined by T. Dalenius, 1977 Equivalent to security of encryption Anything about the plaintext that can be learned from a ciphertext can be learned without the ciphertext.

Impossibility result The Dalenius definition cannot be achieved. Example: If I know Alice’s height is 2 inches higher than the average American’s height, by looking at the census database, I can find the average and then calculate Alice’s exact height. Therefore, Alice’s privacy is breached. We need to revise the privacy definiton… Remove Gavison def?

Differential Privacy The risk to my privacy should not substantially increase as a result of participating in a statistical database. With or without including me in the database, my privacy risk should not change much (In contrast, the Dalenius definition requires that using the database will not increase my privacy risk, including the case that the database does not even include my record).

Definition Mechanism: K(x) = f(x) + D, D is some noise. It is an output perturbation method.

Sensitivity function How to design the noise D? It is actually linked back to the function f(x) Captures how great a difference must be hidden by the additive noise

LAP distribution noise Using laplacian distribution to generate noise.

Similar to Guassian noise

Adding LAP noise Why does this work?

Proof sketch Let K(x) = f(x) + D =r. Thus, r-f(x) has Lap distribution with the scale df/e. Similarly, K(x’) = f(x’)+D=r, and r-f(x’) has the same distribution P(K(x) = r) = exp(-|f(x)-r|(e/df)) P(K(x’)= r) = exp(-|f(x’)-r|(e/df)) P(K(x)=r)/P(K(x’)=r) = exp( (|f(x’)-r|-|f(x)-r|)(e/df)) apply triangle inequality <= exp( |f(x’)-f(x)|(e/df)) = exp(e)

Delta_f=1, epsilon varies Noise samples

Delta_f=1 epsilon=0.01

Delta_f=1 epsilon=0.1

Delta_f=1 epsilon=1

Delta_f=1 epsilon=2

Delta_f=1 epsilon=10

Delta_f=2, epsilon varies

Delta_f=3, epsilon varies

Delta_f=10000, epsilon varies

Extended definition Let be the non-shared part of datasets A and B The previous definition is the special case that A and B differs only one record. This definition is for “a group of persons” included or not included In a dataset

Differential privacy under transformations

Composition (in PINQ paper) Sequential composition

Parallel composition