- A hospital has a database of patient records, each record containing a binary value indicating whether or not the patient has cancer. -suppose an adversary is only allowed to use a particular form of query S(i) that returns the sum of the first i rows of the second column patienthas cancer Amy0 Tom1 Jack1 Differential privacy address the question of, given the total number of patients with cancer, whether or not an adversary can learn if a particular individual has cancer. -Suppose he also knows Jack is in the last row of the database -If Jack has cancer? S(3)-S(2)
Difference privacy model is derived from a very simple observation: When the dataset D contains an individual, for example, Alice. Then do arbitrary query f (for example, count, sum, average, median, or other queries etc.) and get result f (D). If after deleting Alice from D the result of the query is still f(D). This means Alice’s message won’t be leaked. Differential privacy aims to provide means to maximize the accuracy of queries from datasets while minimizing the chances of identifying its records. Differential Privacy
xi xi’ D1 D2 Database neighbors
k-anonymity and its expansion model (l-diversity 、 t-closeness…) can’t provide enough security Differential privacy doesn’t consider any possible background attackers have
Laplace Mechanism Gaussian Mechanism (probabilistic) Exponential Mechanism
A sports event is going to be held. The items are selected from the set {football, volleyball, basketball, tennis}. Participants voted for this. Now choosing an item and ensure that the entire decision-making process to meet the ε- difference privacy. Set the number of votes as the utility function, obviously Δu = 1. According to the exponential mechanism, given privacy budget ε, we can calculate the output probability of various items, as shown in the Table. itemuε=0ε=0.1ε=1 Football Volleyball Basketball e-05 Tennis e-07
Privacy Preserving Data Release (PPDR) Interactive data release Non-interactive data release Query 1 Query i … DP Raw data all query results Purified datasets
Gergely Acs INRIA Claude Castelluccia INRIA
Record linkage Attribute linkage Table linkage A probabilistic attack
CDR dataset -- French telecom company Orange 1,992,846 users 1303 towers 989 IRIS cells 10/09/ /09/2007 (one week)
Aim : Release the time series of IRIS cells without leaking privacy : the number of individuals at L in the (t+1)th hour of the week Method: time series of all IRIS cells sanitized version satisfies Differential Privacy
for a given privacy level, the magnitude of noise can be substantially reduced by using several optimizations and by customizing the anonymization mechanisms to the public characteristics of datasets and applications.
Pre-sampling = l Computing the largest covering cells select at most l of visits per user
Perturbation the similarity of geographically close time series → Clustering cells periodic nature → add Gaussian noise to DCT low-frequency components