UTEPComputer Science Dept.1 University of Texas at El Paso Privacy in Statistical Databases Dr. Luc Longpré Computer Science Department Spring 2006
UTEPComputer Science Dept.2 Database with Confidential Information Examples: –census data –medical information Privacy: protect the confidentiality of individuals Usefulness: want to derive meaningful statistics
UTEPComputer Science Dept.3 The Need for Privacy Safeguards Per person available disk space: –1983: 0.02Mb –1996: 28Mb –2000: 472Mb Equivalent of one page per 3 minutes of life
UTEPComputer Science Dept.4 Misuse of personal health information: –banker cross-referencing cancer patients with outstanding loans –using medical records to make decisions about employees –snooping in hospital computer network –40% of insurers disclose personal health information to lenders, employers, marketers, without customer permission The Need for Privacy Safeguards
UTEPComputer Science Dept.5 Approaches Access control, encryption: –Only fixes who has access to what –Does not protect disclosures based on inference Problem –Sometimes it may be possible to derive confidential information from released information
UTEPComputer Science Dept.6 Examples Salary database Query: what’s the average salary of white male professors with 2 children living El Paso Texas since 1994 and in Boston from 1987 to 1994?
UTEPComputer Science Dept.7 Examples 87% of population of the US are unique under ID made of: –5 digit ZIP, –gender, –date of birth
UTEPComputer Science Dept.8 Linking to Re-Identify Data Medical database: –Ethnicity, visit date, diagnosis, procedure, medication, ZIP, Birth date, Sex Voter list: –Name, address, date registered, ZIP, Birth date, Sex
UTEPComputer Science Dept.9 Statistical Database Data collected with the purpose of releasing statistical information. Important for research, policy Facing tremendous demand for person- specific data –data mining, fraud detection, homeland security
UTEPComputer Science Dept.10 Sample Size Possible solution: do not release any statistics on any set of less than, say,10 records
UTEPComputer Science Dept.11 Problem Remains Query 1: What’s the average salary of every male age 89 in zip code 79912? Query 2: What’s the average salary of people age 89 in zip code 79912?
UTEPComputer Science Dept.12 K-anonymity Release only information where at least k records are identical (work by Sweeney) Attacks are still possible: –Unsorted matching: use the order of records solution: randomize order
UTEPComputer Science Dept.13 K-anonymity –Complementary release: combining k-anonymous releases may not be k- anonymous solution: consider all releases together –Temporal attack: data is dynamic, adding and removing data affects k-anonymous properties solution: analyze k-anonymous properties of dynamic data
UTEPComputer Science Dept.14 Other Solutions Add noise in the answers Add noise in the data Limit the kinds of queries allowed to the statistical database
UTEPComputer Science Dept.15 Quantifying Information Need a formal model, possibly based on information theory Measure entropy in database records before and after a statistical release
UTEPComputer Science Dept.16 Further Complications Some data is more sensitive than others –Example: bits in salary Common knowledge, information from other databases –Could define entropy conditional to available information –Very impractical in applications Some people know some of the records
UTEPComputer Science Dept.17 Non Additivity Data sensitivity is non additive –Ex: don’t mind either digit of SSN to be released, but not all digits Privacy loss is non additive –Ex: There could be 2 sets of information, each of which, if released, gives no information, but which, if together released, reveals all the information
UTEPComputer Science Dept.18 Past Research Denning: “Cryptography and data security”, 1982 Sweeney: Ph.D. thesis, Applications to medical data, 1996 A few more stray results, topics becoming popular again in “privacy preserving data mining”.