k Beyond k-Anonimity: A Decision Theoretic Framework for Assessing Privacy Risk M.Scannapieco, G.Lebanon, M.R.Fouad and E.Bertino
Introduction Release of data –Private organizations can benefit from sharing data with others –Public organizations see data as a value for the society Privacy preservation –Data disclosure can lead to economic damages, threats to national security, etc. –Regulated by law in both private and public sectors
Two Facets of Data Privacy Identity disclosure –Uncontrolled data release: even presence of identifiers –Anonymous data release: identifiers suppressed, but no control on possible linking with other sources
PrivateIDSSNDOBZIPHealth_Problem a11/20/ Shortness of breath b02/07/ Headache c02/07/ Obesity d08/07/ Shortness of breath PrivateIDSSNDOBZIPEmploymentMarital Status 1A11/20/ ResearcherMarried 5E08/07/ Private Employee Married 3C02/07/ Public Employee Widow T1 T2 Linkage of Anonymous Data QUASI-IDENTIFIER
Two Facets of Data Privacy (cont.) Sensitive information disclosure –Once identity disclosure occurs, the loss due to such disclosure depends on how much sensitive are the related data –Data sensitivity is subjective E.g.: for women the age is in general more sensitive than for men
Our proposal A framework for assessing privacy risk that takes into accounts both facets of privacy –based on statistical decision theory Definition and analysis of: disclosure policies modelled by disclosure rules and several privacy risk functions Estimated risk as an upper-bound of true risk and realted complexity analysis Algorithm for finding the disclosure rule minimizing the privacy risk
Disclosure rules A disclosure rule is a function that maps a record to a new record in which some attributes may have been suppressed Z j = The j-th attribute is suppressed otherwise
Loss function Let be the side information used by the attacker in the identification attempt The loss function Measures the loss incurred by disclosing the data (z) due to possible identification based on Empirical distribution p associated with records x 1 …x n
Risk Definition The risk of the disclosure rule in the presence of the side information is the average loss of disclosing x 1 …x n :
Putting the pieces together so far… An hypothetical attacker performs an indentification attempt on a disclosed record y= (x) on the basis of a side information , that can be a dictionary The dictionary is used to link y with some entry present in the dictionary Example: –y has the form (name, surname,phone#), is a phone book – if all attributes revealed, it is likely y linked with one entry –If phone# suppressed (or missing) y may or may not be linked to a single entity, depending on the popularity of (name, surname)
Risk formulation Let’s decompose the loss function into an identification part and into a sensitivity part Identification part: formalized by the random variable Z otherwise
Risk formulation (cont.) Sensitivity part: where higher value indicate higher sensitivity Therefore the loss is:
Risk formulation (cont.) Risk:
Disclosure Rule vs. Privacy Risk Suppose that true is the true attacker’s dictionary which is publicly available and that * is the actual database starting from which data will be published Under the following assumptions: – true contains more records than * ( * <= true ) –The non- in true will be more limited than the non- in * Theorem: If θ* contains records that correspond to x1,...,xn and θ*<=θ true, then: R( , θ true )<= R( , θ*)
Disclosure Rule vs. Privacy Risk (cont.) The theorem proves that the true risk is bounded by R( , θ*) Under the hypothesis that the distribution underlying factorizes into a product form Theorem: The rule that minimizes the risk *=arg min R( , θ) can be found in O(nNm) computation
K-anonimity K anonimity is SIMPLY a special case of our framework in whcih: –θ true =T – is a costant – is underspecified Our framework underlies some questionable hypotheses of k-anonimity!!!
Conclusions New framework for privacy risk taking into account sensitivity Risk estimation as an upperbound for the true privacy risk Efficient algorithm for risk computation K-anonimity generalization