Download presentation
1
Preserving Privacy in Published Data
Dimitris Sacharidis
2
outline introduction k-anonymity l-diversity other issues motivation
definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues
3
motivation several agencies, institutions, bureaus, organizations make (sensitive) data involving people publicly available termed microdata (vs. aggregated macrodata) used for analysis often required and imposed by law to protect privacy microdata are sanitized explicit identifiers (SSN, name, phone #) are removed is this sufficient for preserving privacy? no! susceptible to link attacks publicly available databases (voter lists, city directories) can reveal the “hidden” identity
4
link attack example Sweeney [S01a] managed to re-identify the medical record of the governor of Massachussetts MA collects and publishes sanitized medical data for state employees (microdata) left circle voter registration list of MA (publicly available data) right circle looking for governor’s record join the tables: 6 people had his birth date 3 were men 1 in his zipcode regarding the US 1990 census data 87% of the population are unique based on (zipcode, gender, dob)
5
privacy in microdata the role of attributes in microdata
explicit identifiers are removed quasi identifiers can be used to re-identify individuals sensitive attributes (may not exist!) carry sensitive information identifier quasi identifiers sensitive Name Birthdate Sex Zipcode Disease Andre 21/1/79 male 53715 Flu Beth 10/1/81 female 55410 Hepatitis Carol 1/10/44 90210 Brochitis Dan 21/2/84 02174 Sprained Ankle Ellen 19/4/72 02237 AIDS Name Birthdate Sex Zipcode Disease Andre 21/1/79 male 53715 Flu Beth 10/1/81 female 55410 Hepatitis Carol 1/10/44 90210 Brochitis Dan 21/2/84 02174 Sprained Ankle Ellen 19/4/72 02237 AIDS goal of privacy preservation (rough definition) de-associate individuals from sensitive information
6
outline introduction k-anonymity l-diversity other issues motivation
definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues
7
k-anonymity preserve privacy via k-anonymity, proposed by Sweeney and Samarati [S01, S02a, S02b] k-anonymity: intuitively, hide each individual among k-1 others each QI set of values should appear at least k times in the released microdata linking cannot be performed with confidence > 1/k sensitive attributes are not considered (going to revisit this...) how to achieve this? generalization and suppression value perturbation is not considered (we should remain truthful to original values ) privacy vs utility tradeoff do not anonymize more than necessary
8
k-anonymity example tools for anonymization generalization suppression
publish more general values, i.e., given a domain hierarchy, roll-up suppression remove tuples, i.e., do not publish outliers often the number of suppressed tuples is bounded original microdata 2-anonymous data Birthdate Sex Zipcode 21/1/79 male 53715 10/1/79 female 55410 1/10/44 90210 21/2/83 02274 19/4/82 02237 Birthdate Sex Zipcode group 1 */1/79 person 5**** suppressed 1/10/44 female 90210 group 2 */*/8* male 022**
9
generalization lattice
assume domain hierarchies exist for all QI attributes sex zipcode birthdate objective find the minimum generalization that satisfies k-anonymity construct the generalization lattice for the entire QI set more generalization i.e., maximize utility by finding minimum distance vector with k-anonymity less
10
generalization lattice incognito [LDR05]
exploit monotonicity properties regarding frequency of tuples in lattice reminiscent of OLAP hierarchies and frequent itemset mining (I) generalization property (~rollup) if at some node k-anonymity holds, then it also holds for any ancestor node e.g., <S1, Z0> is k-anonymous and, thus, so is <S1, Z1> and <S1, Z2> (II) subset property (~apriori) if for a set of QI attributes k-anonymity doesn’t hold then it doesn’t hold for any of its supersets note: the entire lattice, which includes three dimensions <S,Z,B>, is too complex to show e.g., <S0, Z0> is not k-anonymous and, thus <S0, Z0, B0> and <S0, Z0, B1> cannot be k-anonymous incognito [LDR05] considers sets of QI attributes of increasing cardinality (~apriori) and prunes nodes in the lattice using the two properties above
11
seen in the domain space
consider the multi-dimensional domain space QI attributes are the dimensions tuples are points in this space attribute hierarchies partition dimensions zipcode hierarchy sex hierarchy
12
seen in the domain space
not 2-anonymous incognito example 2 QI attributes, 7 tuples, hierarchies shown with bold lines zipcode rollup sex 2-anonymous sex rollup zipcode
13
seen in the domain space taxonomy [LDR05, LDR06]
generalization taxonomy according to groupings allowed single dimensional global recoding incognito [LDR05] multi dimensional global recoding mondrian [LDR06] multi dimensional local recoding topdown [XWP+06] generalization strength
14
mondrian [LDR06] define utility measure: discernability metric (DM)
penalizes each tuple with the size of the group it belongs intuitively, the ideal grouping is the one in which all groups have size k mondrian tries to construct groups of roughly equal size k what else (besides Mondrian) does this painting remind you? it’s reminiscent of the kd tree: cycle among dimensions median splits 2-anonymous
15
measuring group quality
DM depends only on the cardinality of the group no measure of how tight the group is a good group is one that contains tuples with similar QI values define a new metric [XWP+06]: normalized certainty penalty (NCP) measures the perimeter of the group bad generalization long boxes good generalization square-like boxes
16
topdown [XWP+06] start with the entire data set
iteratively split in two reminiscent of R-tree quadratic split continue until left with groups which contain <2k-1 tuples split algorithm find seeds, 2 points that are furthest away heuristic, not complete quadratic search the seeds will become the 2 split groups examine points randomly (unlike quadratic split) assign point to the group whose NCP will increase the least
17
boosting privacy with external data
external databases (e.g., voter list) are used by attackers can we use them to our benefit? try to improve the utility of anonymized data join k-anonymity (JKA) [SMP] 3-anonymous microdata k-anonymity join public data joined microdata join 3-anonymous JKA
18
outline introduction k-anonymity l-diversity other issues motivation
definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues
19
k-anonymity problems k-anonymity example
homogeneity attack: in the last group everyone has cancer background knowledge: in the first group, Japanese have low chance of heart disease we need to consider the sensitive values microdata 4-anonymous data id Zipcode Sex National. Disease 1 13053 28 Russian Heart Disease 2 13068 29 American 3 21 Japanese Viral Infection 4 23 5 14853 50 Indian Cancer 6 55 7 14850 47 8 49 9 31 10 37 11 36 12 35 id Zipcode Sex National. Disease 1 130** <30 ∗ Heart Disease 2 3 Viral Infection 4 5 1485* ≥40 Cancer 6 7 8 9 3∗ 10 11 12
20
l-diversity (simplified definition)
l-diversity [MGK+06] make sure each group contains well represented sensitive values protect from homogeneity attacks protect from background knowledge l-diversity (simplified definition) a group is l-diverse if the most frequent sensitive value appears at most 1/l times in group l-diversity definition has monotonicity properties similar to k-anonymity simply adapt existing algorithms to count frequencies of sensitive values
21
anatomy [XT06] algorithm fast l-diversity algorithm
anatomy is not generalization seperates sensitive values from tuples shuffles sensitive values among groups algorithm assign sensitive values to buckets create groups by drawing from l largest buckets id Age Sex Zipcode Group ID 1 23 M 11000 2 27 13000 3 35 59000 4 59 12000 5 61 F 54000 6 65 25000 7 8 70 30000 Group-ID Disease Count 1 dyspepsia 2 pneumonia bronchitis flu gastritis
22
outline introduction k-anonymity l-diversity other issues motivation
definitions k-anonymity incognito mondrian topdown l-diversity anatomy other issues
23
other issues how to preserve privacy in dynamic data
[XT07] protection against minimality attack [WFW+07] anonymity in location-based services spatial cloaking
24
references [LDR05] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Incognito: Efficient Full-domain k-Anonymity. SIGMOD, [LDR06] LeFevre, K., DeWitt, D.J., Ramakrishnan, R. Mondrian Multidimensional k-Anonymity. ICDE, [S01] Samarati, P. Protecting Respondents' Identities in Microdata Release. IEEE TKDE, 13(6): , [S02a] Sweeney, L. k-Anonymity: A Model for Protecting Privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, [S02b] Sweeney, L. k-Anonymity: Achieving k-Anonymity Privacy Protection using Generalization and Suppresion. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, [SMP] Sacharidis, D., Mouratidis, K., Papadias, D. k-Anonymity in the Presence of External Databases (submitted) [WFWP07] Wong, R. C., Fu, A. W., Wang, K., Pei, J. Minimality Attack in Privacy Preserving Data Publishing. VLDB, [XT06] Xiao, X, Tao, Y. Anatomy: Simple and Effective Privacy Preservation. VLDB, [XT07] Xiao, X, Tao, Y. m-Invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets. SIGMOD, [XWP+06] Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A., Utility-Based Anonymization Using Local Recoding. SIGKDD, 2006.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.