Versatile Publishing For Privacy Preservation Xin Jin, Mingyang Zhang, Nan Zhang George Washington University Gautam Das University of Texas at Arlington
Outline Introduction Inference For Multiple Privacy Rules Guardian Normal Form GD and UAD Algorithms Experimental Results Conclusion
Privacy Preserving Data Publishing QI SA, i.e., an adversary knowing QI cannot infer the SA of a tuple (beyond a privacy guarantee). A privacy guarantee example: l–diversity Quasi-identifier (QI) Sensitive Attribute (SA) Age Gender Disease Allen [30-80] * HIV Bob diabetes Calvin [35-55] F David flu Eve [20-40] M drug Grace Give an example of 2-diversity in this particular example. Generally, protect privacy for any individual. 2 – diversity Published Table
A Sneak Peek at Real Application The Texas Department of State Health Services publishes every year a table of all patients discharged from more than 450 state-licensed hospitals. www. shtm Defines 9 privacy requirements. Example: If a hospital has fewer than five discharges of a particular gender, then suppress the zipcode of its patients of that gender. Race is changed to ‘Other’ and ethnicity is suppressed if a hospital has fewer than ten discharges of a race. The entire zipcode and gender code are suppressed if the ICD code indicates alcohol or drug use or an HIV diagnosis. … Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
Texas Inpatient Discharge Data Example: If a hospital has fewer than five discharges of a particular gender, then suppress the zipcode of its patients of that gender. Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group. hospital, gender zipcode
Multiple SA Publishing [MKGV06] defines multiple SA attributes Treats Si as the sole SA attribute and {Q1, Q2, …, Qm, S1, …, Si-1, Si+1, …, Sn} is treated as QI. Lack of flexibility: provides stronger privacy definition than necessary. age, ICD, state, gender race age, ICD, hospital, race state SA: race and state Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
A Novel Problem: Versatile Publishing Allows the privacy requirement of publishing a table to be defined as an arbitrary set of privacy rules. Each rule: {Q1, Q2, …, Qp} {S1, S2, …, Sr} LHS attributes RHS attributes Assures that an adversary learning the LHS attributes cannot learn the RHS attributes beyond a pre-defined privacy guarantee such as l-diversity, t-closeness, etc.. Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
A Running Example hospital age gender ICD state race A 37 F HIV TX asian 71 M diabetes MN white B 55 CA black flu VA C 23 drug Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group. Rule #1: age, ICD race Rule #2: gender, ICD state Rule #3: hospital, race state Privacy guarantee: 2-diversity
Simple Solution #1: Straight Decomposition age, ICD race gender, ICD state hospital, race state age ICD 37 HIV 23 drug flu 55 diabetes 71 race asian white black gender ICD F HIV M drug flu diabetes state TX MN VA CA hospital race A asian B black white C state TX CA MN VA join Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group. Asian is linked with TX or MN Asian is linked with TX or CA Intersection Attack [GKS08] asian TX, violating hospital, race state
Multiple SA Publishing Method Defines as SA all attributes that appear on the RHS of at least one privacy rule, and QI as the set of all other attributes. Rule #1: age, ICD race Rule #2: gender, ICD state Rule #3: hospital, race state 2 SA: race, state 4 SA: ICD, state, race,hospital Curse of dimensionality Rule #4: hospital, age ICD Rule #5: gender, race hospital Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
Traditional Data Normalization Step 1: Obtain irreducible functional dependencies (FD). Step 2: Test whether there is any FD violates the normal form over the large table. Step 3: Decompose the table to remove the violation if there is any.
Inference For Multiple Rules Inference on multiple privacy rules. Example: AB C implies that A C and B C Completeness of Inference Rules
Guardian Normal Form (GNF) Non-triviality: a privacy rule satisfied by two anonymized table might be broken by the combination of these two, due to intersection attack. Guardian Normal Form (GNF): a normal form for the schema of published tables which guarantees that all privacy rules are guaranteed over the collection of published tables. GNF is defined at the schema-level of published tables rather than tuple-level. Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
age, hospital, gender race no privacy rule enforced An Example ICD, gender hospital hospital state age, hospital, gender race no privacy rule enforced hospital state age race Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group. gender ICD Rule #1: age, ICD race
age, hospital, gender race no privacy rule enforced An Example ICD, gender hospital hospital state age, hospital, gender race no privacy rule enforced hospital state race is unreachable from age or ICD age race gender ICD Rule #1: age, ICD race
age, hospital, gender race no privacy rule enforced An Example ICD, gender hospital hospital state age, hospital, gender race no privacy rule enforced hospital state state is reachable from either gender or ICD age race Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group. gender ICD Rule #2: gender, ICD state
Guardian Decomposition Algorithm Similar in spirit to the database normalization algorithm [EN03] (decomposition into BCNF) Find a privacy rule which violates GNF, decompose the existing sub-tables to address the privacy rule, and continue until no more offending privacy rule exists. Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group. Greedily add attributes if GNF remains End: no further decomposition, publish T11 and T12
Utility Aware Decomposition Algorithm Leverage the link between utility optimization and as the MIN-VERTEX-COLORING problem.
Experimental Results Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
Conclusion Defined novel problem of versatile publishing which captures the real-world requirement of multiple privacy rules. Derived the sound and complete set of inference axioms for privacy rules. Defined guardian normal form (GNF). Developed two decomposition algorithms GD and UAD and conducted comprehensive experiments. Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
Reference [1] Texas Department of State Health Services, User manual of texas hospital inpatient discharge public use data file, 2008 [2] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Vekitasubramaniam. l-diversity: Privacy beyond k-anonymization, in ICDE, 2006. [3] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxililary information in data privacy, in KDD 2008 [3] R. Elmasri and S.B. Navathe. Fundamentals of Database Systems. (4th Edition), Addison Wesley, 2003. Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
Thank You Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.
age, hospital, gender race no privacy rule enforced ICD, gender hospital hospital state age, hospital, gender race no privacy rule enforced
Inference For Multiple Rules Introduce utility first. All the published table wants to optimize the utility. Focus on Eve’s group.