Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Texas at El Paso

Similar presentations


Presentation on theme: "University of Texas at El Paso"— Presentation transcript:

1 University of Texas at El Paso
Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department DCFS 2006 UTEP Computer Science Dept.

2 Database with Confidential Information
Examples: census data medical information Privacy: protect the confidentiality of individuals Usefulness: want to derive meaningful statistics UTEP Computer Science Dept.

3 The Need for Privacy Safeguards
Per person available disk space: 1983: 0.02Mb 1996: 28Mb 2000: 472Mb 2006: ? UTEP Computer Science Dept.

4 The Need for Privacy Safeguards
Misuse of personal health information Misuse of financial information Misuse of identification information UTEP Computer Science Dept.

5 Approaches Access control, encryption: Problem
Only fixes who has access to what Does not protect disclosures based on inference Problem Sometimes it may be possible to derive confidential information from voluntarily released information UTEP Computer Science Dept.

6 Examples Salary database
Query: what’s the average salary of white male professors with 2 children living El Paso Texas since 1994 and in Boston from 1987 to 1994? UTEP Computer Science Dept.

7 Examples 87% of population of the US are unique under ID made of:
5 digit ZIP, gender, date of birth UTEP Computer Science Dept.

8 Linking to Re-Identify Data
Medical database: Ethnicity, visit date, diagnosis, procedure, medication, ZIP, Birth date, Gender Voter list: Name, address, date registered, ZIP, Birth date, Gender UTEP Computer Science Dept.

9 Approaches to solutions
Sample size K-anonymity Noise Query restrictions Static Dynamic (general problem NP-hard) Cell suppression UTEP Computer Science Dept.

10 Previous Work (defining privacy)
Denning (1982) Medical data (Sweeny 1996) Privacy Preserving Data Mining (since about 2000) Privacy based on estimations (AS2000) Interval computations (KL2003) Game theoretical setting (DN2003) Blending in a crowd (CDMSW2005) UTEP Computer Science Dept.

11 A simple case Assume a 1 dimensional database (salary)
Only allow queries of the type: # of records where salary < x? where x is selected from a finite set. Asking all possible queries provides an interval for each salary. UTEP Computer Science Dept.

12 Example $64,000 $80,000 $80,000 $90,000 $96,000 $122,000 $124,000 $144,000 $150,000 $150,000 Allow queries of type “# <$x” for x multiple of $10,000 UTEP Computer Science Dept.

13 Perfect privacy Perfect privacy is maintained if the answer to queries does not allow to narrow any interval for a given salary UTEP Computer Science Dept.

14 Example $64,000 $80,000 $80,000 $90,000 $96,000 $122,000 $124,000 $144,000 $150,000 $150,000 Queries: #<$100,000 is ... #<$120,000 is ... know: Leung’s salary is < $120,000 UTEP Computer Science Dept.

15 Proposition For perfect privacy, between 2 allowed queries, there must be at least one salary. Result: asking all the possible queries, we get an interval for each salary. UTEP Computer Science Dept.

16 Interval computations
How can we derive statistics from intervals instead of values? Problem: given n intervals for values x1…xn, compute the intervals a and s of possible values for the average and variance. UTEP Computer Science Dept.

17 Interval computations
Average: computing the interval of possible values for the average easy Variance: computing the interval of possible values for the variance Computing the lower bound: O(n2) Computing the upper bound is NP-hard UTEP Computer Science Dept.

18 UB for variance is NP-hard
Reduction from subset sum: given x1,…,xn,can we split into two sets with the same sum? Take all intervals [-xi,xi]. Max variance occurs at interval extremities Variance is Sxi2-E2 Need to minimize E UTEP Computer Science Dept.

19 UB for variance in db Restriction: all intervals are either disjoint or coincide. In this case, the upper bound can be computed in O(n2) UTEP Computer Science Dept.

20 Quantifying Information
Many definitions only describe whether or not privacy loss occurred. Need a formal model to measure loss of privacy Could measure in bits or in percentage. UTEP Computer Science Dept.

21 Kolmogorov Complexity
K(x): the size of the smallest program that can generate x K(x/y): complexity of x relative to y A way to measure quantity of information UTEP Computer Science Dept.

22 Kolmogorov Complexity to measure privacy loss?
K(r): Quantity of information in a record K(r/s): Quantity of information relative to the statistical release Privacy of the record: K(r) – K(r/s) Maximize over records UTEP Computer Science Dept.

23 Problem with this definition
Suppose the released average salary happens to coincide with a record. Cannot measure fractions of bits. Subject to additive constants. Does provide an asymptotic upper bound. UTEP Computer Science Dept.

24 Shannon entropy Set of events E = {e1, e2, …, en} Source S
Entropy of S: H(S) = Σi pi log2(1/pi) A measure of amount of information in bits contained in each output symbol generated by S. UTEP Computer Science Dept.

25 Shannon first theorem Suppose one wants to encode n consecutive symbols output by S. Let Ln be the minimum expected number of bits of the encoding. Then, nH(s) ≤ Ln ≤ nH(s) + 1 UTEP Computer Science Dept.

26 Defining privacy loss with entropy
Assume a database is generated according to some known probability distribution D. Induces a probability distribution on each record. Statistical release modifies the probability distribution. Privacy loss is H(r) – H’(r), maximized over all records. UTEP Computer Science Dept.

27 Example 100 records database with membership field
0: non member 1: member If average is 0, total loss (1 bit) If average is 0.5, no loss If average is 0.25, loss of about 0.2 bit. Expected loss is bits. UTEP Computer Science Dept.

28 Considerations Some data is more sensitive than others
Example: bits in salary Common knowledge, information from other databases Could define entropy conditional to available information Very impractical in applications Some people know some of the records UTEP Computer Science Dept.

29 Properties of definition
Privacy loss is non additive Depends on prior distribution Can model partial knowledge Makes this less practical Statistical release may actually cause gain in privacy! Does not incorporate computational resources restrictions UTEP Computer Science Dept.

30 Future work Incorporate data sensitivity measure
In a value, differentiate lower and higher order bits Some fields may have one sided sensitivity UTEP Computer Science Dept.

31 Future work Gauge privacy loss of existing privacy preserving algorithms Use effective entropy (Yao 2002) to deal with computational resources Incorporate privacy robustness UTEP Computer Science Dept.

32 Summary Needs for studying privacy in databases
Methods for preserving privacy Interval computations Definition of measure of privacy loss based on entropy Analysis of definition and notions not yet captured Suggestions on how improve this definition UTEP Computer Science Dept.


Download ppt "University of Texas at El Paso"

Similar presentations


Ads by Google