Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

Similar presentations


Presentation on theme: "Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008."— Presentation transcript:

1 Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008

2 2 Privacy Legal interpretation View of privacy in terms of access that others have to us and our information. A general definition of privacy must be one that is measureable, of value, and actionable. Measuring privacy Secrecy: it concerns information that others may gather about us.  The probability of a data item being accessed  The change in knowledge of an adversary upon seeing the data Anonymity: it addresses how much in the public gaze we are.  The privacy leakage is measured in terms of the size of the blurring accompanying the release of data. Solitude: it measures the degree to which others have physical access to us.

3 3 Privacy vs. Utility Encryption does not work in publishing scenario. Utility The goal of privacy preservation measures is to secure access to confidential information while at the same time releasing aggregate information to the public.

4 4 Data anonymization methods Random perturbation Input perturbation Output perturbation Generalization The data domain has a natural hierarchical structure. The degree of perturbation can be measured in terms of the height of the resulting generalization above the leaf values. Suppression Permutation Destroying the link between identifying and sensitive attributes that could lead to a privacy leakage.

5 5 Statistical measures of anonymity Query restriction For a database of size N, and a fixed parameter k, all queries that returned either fewer than k or more than N-k records were rejected. Could be subverted by requesting a specific sequence of queries Anonymity via variance Lower bound the variance for estimators of sensitive attributes  Utility is measured (by combining the perturbation scheme with a query restriction method) as the fraction of queries that are permitted after perturbation. Confidence interval  How hard it is to reconstruct the original data distribution Anonymity via multiplicity K-anonymity

6 6 Probabilistic measures of anonymity Knowing aggregate information about the data as well as the method of perturbation Perturb X with a random value from [-1,1], the privacy achieved is 2. The distribution of X is revealed, [0,1] with prob. 0.5 and [4,5] with prob. 0.5 The privacy achieved is reduced to 1 Mutual information P(A|B) = 1 – 2^H(A|B)/2^H(A) = 1-2^(-I(A;B)) H(A) encodes the amount of uncertainty (the degree of privacy) in a random variable. H(A|B) the amount of privacy left in A after B is released. I(A;B) = H(A)- H(A|B) mutual information between A and B Utility Statistical distance between the source distribution of the data and perturbed distribution.

7 7 On the design and quantification of ppdm algorithms, PODS01

8 8

9 9 Market basket data A privacy breach is defined as one in which the probability of some property of the input data is high, conditioned on the output perturbed data having certain properties. (Evfimievski et al.) Privacy is measured in terms of the probability of correctly reconstructing the original bit, given a perturbed bit. (Rizvi and Haristsa) Utility is the problem of reconstructing itemset frequencies accurately.

10 10 Measuring of transfer information Limiting privacy breaches in privacy preserving data mining, PODS03 If we look back from y, there is no easy way of telling whether the source is x1 or x2

11 11 Measured based on generalization K-anonymity L-diversity P-sensitive-k-anonymity T-closeness L-diversity may be difficult and unnecessary to achieve  The sensitive attribute is the rest result for a virus. 99% of them being negative. The positive/negative have different degrees of sensitivity. L-diversity is insufficient to prevent atytribute disclosure  Skewness attack, e.g, one equivalence class has an equal number of positive/negative records.  Similarity attack when the sensitive attribute values in an equivalence class are distinct but semantically similar. T-closeness if the distance between the distribution of a sensitive attribute in this class and that of the attribute in the whole table is no more than t.

12 12 Measuring distribution difference

13 13 Earth mover’s distance

14 14 EMD for numerical attribute

15 15 EMD for categorical attribute

16 16 EMD for categorical attribute

17 17 Permutation The goal of the k-anonymous blocks is that the diameter of the range of sensitive attributes is larger than a parameter e. Permutation based anonymization can answer aggregate queries more accurately than generalization based anonymization.

18 18 Anonymizing inference To protect the possible inferences that can be made from the data A privacy template is an inference on the data, coupled with a confidence bound. The requirement is that in the anonymized data, this inference not be valid with a confidence larger than the provided bound. Wang et al. Handicapping attacker’s confidence: an alternate to k-anonymization

19 19 Measuring utility in generalization based anonymity The precision of a generalization scheme is 1 – the average height of a generalization (measured over all cells). Bayardo and Agrawal ICDE 05

20 20 Utility vs. privacy Most of the schemes for ensuring data anonymity focus on defining measures of anonymity, while using ad hoc measures of utility. After performing a standard anonymization, they publish carefully chosen marginals of the source data. From these marginals, they then construct a consistent maximum entropy distribution, and measure utility as the KL-distance between this distribution and the source. Kifer & Gehrke. Injecting utility into anonymized datasets. SIGMOD06 Rastogi et al. The boundary between privacy and utility in data publishing

21 21 Computational measures of anonymity Privacy statements are phrased in terms of the power of an adversary., rather than the amount of background knowledge they possess. Dinur & Nissim. Revealing information while preserving privacy. PODS03 Measuring anonymity via information transfer Indistinguishability A database is private if anything learnable from it can be learned in the absence of the database

22 22 Anonymity via isolation A record is private if it cannot be singled out from its neighbors. An adversary is defined as an algorithm that takes an anonymized database and some auxiliary information, and outputs a single point q. An anonymization is successful if the adversary, combining the anonymization with auxiliary information, can do no better at isolation than a weaker adversary with no access to the anonymized data.

23 23 Metrics for quantifying data quality Quality of the data resulting from the ppdm process Accuracy Completeness consistency Quality of the data mining results Chapter 8.4

24 24 measures Oliveira & Zaiane, privacy preserving frequent itemset mining, 2002

25 25 Generalization based The data quality metric is based on the height of generalization hierarchies. Data should be generalized as fewer steps as possible to preserve maximum utility. Not every generalization steps are equal in the sense of information loss. General loss metric Classification metric Iyengar KDD02 Discernibility metric Bayado & Agarwal ICDE05

26 26 Statistical based perturbation


Download ppt "Privacy vs. Utility Xintao Wu University of North Carolina at Charlotte Nov 10, 2008."

Similar presentations


Ads by Google