University of Texas at El Paso

Slides:



Advertisements
Similar presentations
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006.
K Beyond k-Anonimity: A Decision Theoretic Framework for Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino.
A new sampling method: stratified sampling
Information Theory and Security
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Chapter 1 Basics of Probability.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Statistics for Data Miners: Part I (continued) S.T. Balke.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Introduction to: 1.  Goal[DEN83]:  Provide frequency, average, other statistics of persons  Challenge:  Preserving privacy[DEN83]  Interaction between.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Data Anonymization (1). Outline  Problem  concepts  algorithms on domain generalization hierarchy  Algorithms on numerical data.
Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security.
Chapter 3 Data Description Section 3-2 Measures of Central Tendency.
Differential Privacy (1). Outline  Background  Definition.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Towards Robust Revenue Management: Capacity Control Using Limited Demand Information Michael Ball, Huina Gao, Yingjie Lan & Itir Karaesmen Robert H Smith.
Dr. Justin Bateh. Point of Estimate the value of a single sample statistics, such as the sample mean (or the average of the sample data). Confidence Interval.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Nan Zhang Texas A&M University
Confidence Intervals about a Population Proportion
ACHIEVING k-ANONYMITY PRIVACY PROTECTION USING GENERALIZATION AND SUPPRESSION International Journal on Uncertainty, Fuzziness and Knowledge-based Systems,
Sampling Why use sampling? Terms and definitions
Probability Axioms and Formulas
12. Principles of Parameter Estimation
Assessing Disclosure Risk in Microdata
Privacy-preserving Release of Statistics: Differential Privacy
Haim Kaplan and Uri Zwick
Sampling Distributions
Context-based Data Compression
Chapter 7: Sampling Distributions
Module #16: Probability Theory
Multi - Way Number Partitioning
Differential Privacy in Practice
Analysis and design of algorithm
CONCEPTS OF ESTIMATION
Chapter 9 Hypothesis Testing.
Alternating tree Automata and Parity games
The Curve Merger (Dvir & Widgerson, 2008)
Data Mining – Chapter 3 Classification
Summarizing Data by Statistics
Differential Privacy (2)
Module #16: Probability Theory
Chapter 11 Limitations of Algorithm Power
Classification Trees for Privacy in Sample Surveys
Keller: Stats for Mgmt & Econ, 7th Ed Sampling Distributions
Sampling Distributions
CHAPTER 2: Basic Summary Statistics
Chapter 4 SURVIVAL AND LIFE TABLES
(-4)*(-7)= Agenda Bell Ringer Bell Ringer
12. Principles of Parameter Estimation
CS639: Data Management for Data Science
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Module #16: Probability Theory
Differential Privacy (1)
CHAPTER – 1.2 UNCERTAINTIES IN MEASUREMENTS.
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department DCFS 2006 UTEP Computer Science Dept.

Database with Confidential Information Examples: census data medical information Privacy: protect the confidentiality of individuals Usefulness: want to derive meaningful statistics UTEP Computer Science Dept.

The Need for Privacy Safeguards Per person available disk space: 1983: 0.02Mb 1996: 28Mb 2000: 472Mb 2006: ? UTEP Computer Science Dept.

The Need for Privacy Safeguards Misuse of personal health information Misuse of financial information Misuse of identification information UTEP Computer Science Dept.

Approaches Access control, encryption: Problem Only fixes who has access to what Does not protect disclosures based on inference Problem Sometimes it may be possible to derive confidential information from voluntarily released information UTEP Computer Science Dept.

Examples Salary database Query: what’s the average salary of white male professors with 2 children living El Paso Texas since 1994 and in Boston from 1987 to 1994? UTEP Computer Science Dept.

Examples 87% of population of the US are unique under ID made of: 5 digit ZIP, gender, date of birth UTEP Computer Science Dept.

Linking to Re-Identify Data Medical database: Ethnicity, visit date, diagnosis, procedure, medication, ZIP, Birth date, Gender Voter list: Name, address, date registered, ZIP, Birth date, Gender UTEP Computer Science Dept.

Approaches to solutions Sample size K-anonymity Noise Query restrictions Static Dynamic (general problem NP-hard) Cell suppression UTEP Computer Science Dept.

Previous Work (defining privacy) Denning (1982) Medical data (Sweeny 1996) Privacy Preserving Data Mining (since about 2000) Privacy based on estimations (AS2000) Interval computations (KL2003) Game theoretical setting (DN2003) Blending in a crowd (CDMSW2005) UTEP Computer Science Dept.

A simple case Assume a 1 dimensional database (salary) Only allow queries of the type: # of records where salary < x? where x is selected from a finite set. Asking all possible queries provides an interval for each salary. UTEP Computer Science Dept.

Example $64,000 $80,000 $80,000 $90,000 $96,000 $122,000 $124,000 $144,000 $150,000 $150,000 Allow queries of type “# <$x” for x multiple of $10,000 UTEP Computer Science Dept.

Perfect privacy Perfect privacy is maintained if the answer to queries does not allow to narrow any interval for a given salary UTEP Computer Science Dept.

Example $64,000 $80,000 $80,000 $90,000 $96,000 $122,000 $124,000 $144,000 $150,000 $150,000 Queries: #<$100,000 is ... #<$120,000 is ... know: Leung’s salary is < $120,000 UTEP Computer Science Dept.

Proposition For perfect privacy, between 2 allowed queries, there must be at least one salary. Result: asking all the possible queries, we get an interval for each salary. UTEP Computer Science Dept.

Interval computations How can we derive statistics from intervals instead of values? Problem: given n intervals for values x1…xn, compute the intervals a and s of possible values for the average and variance. UTEP Computer Science Dept.

Interval computations Average: computing the interval of possible values for the average easy Variance: computing the interval of possible values for the variance Computing the lower bound: O(n2) Computing the upper bound is NP-hard UTEP Computer Science Dept.

UB for variance is NP-hard Reduction from subset sum: given x1,…,xn,can we split into two sets with the same sum? Take all intervals [-xi,xi]. Max variance occurs at interval extremities Variance is Sxi2-E2 Need to minimize E UTEP Computer Science Dept.

UB for variance in db Restriction: all intervals are either disjoint or coincide. In this case, the upper bound can be computed in O(n2) UTEP Computer Science Dept.

Quantifying Information Many definitions only describe whether or not privacy loss occurred. Need a formal model to measure loss of privacy Could measure in bits or in percentage. UTEP Computer Science Dept.

Kolmogorov Complexity K(x): the size of the smallest program that can generate x K(x/y): complexity of x relative to y A way to measure quantity of information UTEP Computer Science Dept.

Kolmogorov Complexity to measure privacy loss? K(r): Quantity of information in a record K(r/s): Quantity of information relative to the statistical release Privacy of the record: K(r) – K(r/s) Maximize over records UTEP Computer Science Dept.

Problem with this definition Suppose the released average salary happens to coincide with a record. Cannot measure fractions of bits. Subject to additive constants. Does provide an asymptotic upper bound. UTEP Computer Science Dept.

Shannon entropy Set of events E = {e1, e2, …, en} Source S Entropy of S: H(S) = Σi pi log2(1/pi) A measure of amount of information in bits contained in each output symbol generated by S. UTEP Computer Science Dept.

Shannon first theorem Suppose one wants to encode n consecutive symbols output by S. Let Ln be the minimum expected number of bits of the encoding. Then, nH(s) ≤ Ln ≤ nH(s) + 1 UTEP Computer Science Dept.

Defining privacy loss with entropy Assume a database is generated according to some known probability distribution D. Induces a probability distribution on each record. Statistical release modifies the probability distribution. Privacy loss is H(r) – H’(r), maximized over all records. UTEP Computer Science Dept.

Example 100 records database with membership field 0: non member 1: member If average is 0, total loss (1 bit) If average is 0.5, no loss If average is 0.25, loss of about 0.2 bit. Expected loss is 0.008 bits. UTEP Computer Science Dept.

Considerations Some data is more sensitive than others Example: bits in salary Common knowledge, information from other databases Could define entropy conditional to available information Very impractical in applications Some people know some of the records UTEP Computer Science Dept.

Properties of definition Privacy loss is non additive Depends on prior distribution Can model partial knowledge Makes this less practical Statistical release may actually cause gain in privacy! Does not incorporate computational resources restrictions UTEP Computer Science Dept.

Future work Incorporate data sensitivity measure In a value, differentiate lower and higher order bits Some fields may have one sided sensitivity UTEP Computer Science Dept.

Future work Gauge privacy loss of existing privacy preserving algorithms Use effective entropy (Yao 2002) to deal with computational resources Incorporate privacy robustness UTEP Computer Science Dept.

Summary Needs for studying privacy in databases Methods for preserving privacy Interval computations Definition of measure of privacy loss based on entropy Analysis of definition and notions not yet captured Suggestions on how improve this definition UTEP Computer Science Dept.