Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.

Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept. of Computer Engineering and Mathematics Universitat Rovira i Virgili

Diversity of microdata masking mechanisms  A wide variety of microdata masking mechanisms are available  Rounding  Microaggregation  Noise infusion  Additive  Multiplicative  Model based  Data swapping  Data shuffling  And many others

Diversity is good, but …  Diversity in mechanisms also means that comparing across mechanisms can be very difficult

Traditional approaches for comparison  Based on parameters  Based on performance

Comparison based on parameters  Syntactic approach  This approach has been criticized since it does not reflect the true security offered by the mechanism  Difficulty of comparing across different mechanisms  How do you compare microaggregation with aggregation parameter = 5, noise addition with 10% of variance, multiplicative perturbation with parameter = 10%?  Difficulty of comparing across data sets even for the same mechanism  Two data sets with different characteristics may yield completely different levels of protection for the same parameter selection

Impact of data characteristics ID Data set 1 Data set 2 Masking value Masked data set 1 Masked data set 2 Rank of masked data set 1 Rank of masked data set 2 1110011.091 1092.36117 21110110.9079.981917.35621 32110211.00421.0771024.74234 43110311.00331.0881033.93145 54110411.09044.6761134.341510 65110511.05153.6011104.59468 76110610.92956.677985.81472 87110710.98569.9641055.37186 98110810.91273.901986.25893 109110911.01592.3341106.988109

Comparison based on performance  Analyze the masked data for disclosure risk  Identity disclosure  Value disclosure  Comparison based on results  Many different approaches for assessing identity and value disclosure  One alternative is to aggregate the different measures  Raises more questions about how to aggregate

Example of comparison based on performance  An empirical evaluation  Score = 0.5(IL)+ 0.125(DLD)+0.125(PLD)+ 0.25 (ID)  While this is a reasonable approach, it can be argued that the weights should be different  If we remove (or modify the weight of) a criterion, the results may be different  If we use only disclosure risk measures, the results would be different  What about alternative measures of information loss and disclosure risk?  These are typical problems with any empirically based evaluation “A Quantitative Comparison of Disclosure Control Methods for Microdata” in Confidentiality, Disclosure, and Data Access

Desiridata  A common basis of comparison for microdata masking mechanisms that is  Applicable to all mechanisms,  Meaningful,  Independent of the  parameters of the mechanism  risk assessment measure  characteristics of the data

Our proposal: The permutation model  All microdata masking mechanisms can be viewed as permutations of the original data  The permutation model is  Meaningful  Independent of the  parameters of the masking mechanism  risk assessment measures  characteristics of the data

Traditional view of microdata masking IDX Masking mechanism  Masked 14424.76 21421.51 34253.97 42425.93 59394.36 64136.66 79484.38 85458.22 91634.35 102622.80 Original Data Masking Mechanism Masked Data

Reverse mapping  (Reverse) Map the masked data back to the original data  Compute rank of masked value  Replace the masked value with the value of the original data with the same rank  Rank of the first masked observation is 3  Replace this value with the value of X with rank of 3  Repeat for all masked records  The reverse mapped values represent the permuted version of the original data IDX Rank of X Masked Rank of Masked Reverse mapping  Permuted 144 7 24.76 324 214 1 21.51 114 342 6 53.97 744 424 3 25.93 426 593 9 94.36 1094 641 5 36.66 642 794 10 84.38 993 854 8 58.22 854 916 2 34.35 541 1026 4 22.80 216

Permuted + Residual Noise = Masked IDX Rank of X Masked Rank of Masked Reverse mapping  PermutedNoiseMasked 144 7 24.76 324 0.7624.76 214 1 21.51 114 7.5121.51 342 6 53.97 744 9.9753.97 424 3 25.93 426 -0.0725.93 593 9 94.36 1094 0.3694.36 641 5 36.66 642 -5.3436.66 794 10 84.38 993 -8.6284.38 854 8 58.22 854 4.2258.22 916 2 34.35 541 -6.6534.35 1026 4 22.80 216 6.8022.80

The permutation model  Any masking mechanism can be represented by the permutation model  The masked output from any microdata masking mechanism is conceptually viewed as (or functionally equivalent to) permutation plus residual noise.  We are not suggesting a new masking mechanism. IDX Permuted X NoiseMasked 144 24 0.7624.76 214 7.5121.51 342 44 9.9753.97 424 26 -0.0725.93 593 94 0.3694.36 641 42 -5.3436.66 794 93 -8.6284.38 854 4.2258.22 916 41 -6.6534.35 1026 16 6.8022.80

Magnitude of the residual noise IDX Permuted X NoiseMasked 144240.7624.76 214 7.5121.51 342449.9753.97 42426-0.0725.93 593940.3694.36 64142-5.3436.66 79493-8.6284.38 854 4.2258.22 91641-6.6534.35 1026166.8022.80

For large data sets …  Disclosure prevention is achieved primarily through permutation  The residual noise provides additional (but small level of) masking to prevent the original values from being released  With procedures such as swapping and shuffling, there is no residual noise since the original values are released unmodified

The permutation model Original Data Permuted Data Masked Data Permute Residual Noise

Protection level  Meaningful interpretation of protection  No permutation = No protection  Randomly sorted data = Maximum protection  Simple, meaningful explanation of the protection level  Actual: Level of permutation resulting from the masking mechanism

The adversary model  The permutation model also leads to a natural maximum knowledge adversary  We assume that the adversary has the ability to perform reverse mapping on the masked data  Reverse mapping can be performed if the adversary has access to the entire original data set  This assumption is the same as that used in record linkage – the adversary has access to both the original and masked data set (but not the individual record linkages)  Consistent with Kerckhoff’s principle that the adversary knows everything but the “key”

Cryptographic equivalent  Ciphertext-only  Adversary has access only to ciphertext (i.e. masked records).  Known-plaintext  Adversary has access to pairs plaintext/ciphertext (i.e. pairs original and masked records)  In a non-interactive setting (microdata release), known-plaintext is the strongest possible attack  Our adversary model  Chosen-plaintext  Adversary can choose a plaintext (original records) and get the corresponding ciphertext (masked records)  Chosen-ciphertext  Adversary can choose a ciphertext (masked records) and get the corresponding plaintext (original records)

Adversary with malicious intent  One of the difficulties with microdata release was the inability to distinguish between the user and adversary  A practical way of thinking of this adversary is that the intent of this adversary is purely malicious  Since the adversary has access to the entire original data set, they cannot learn anything new from the data set  Our adversary model differentiates the malicious adversary (who does not learn anything from the released data) from the user (who learns something from the data)

Adversary model  The adversary is able to eliminate residual noise through reverse mapping  The only protection against this adversary is permutation Original Data Masked Data Permuted Data Residual Noise Permute Adversary’s copy of the original data Masked Data Permuted Data Rank Reverse Map

Adversary objective  The objective of the adversary is to break the key (recreate the linkage between the original and permuted data)  The adversary wishes to show provable linkage  Provable linkage eliminates plausible deniability

Auxiliary information  One of the advantages of our adversary model is that it eliminates the need to consider auxiliary information  Our adversary has maximum knowledge (has access to the entire original data)  No auxiliary information (other than the random number seed) will help the adversary improve the linkage

Important clarification  We are suggesting the adversary model for comparison benchmarks  For risk assessment; NOT necessarily risk mitigation

On-going work  Formalizing a measure of the permutation level  Formalizing a measure of disclosure  Multivariate scenario

Conclusion  The permutation model offers a new approach for evaluating the efficacy and effectiveness of masking mechanisms  It allows the data administrator to compare different masking mechanisms using the same benchmark  More work remains

Questions, comments, or suggestions? Thank you

Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.

Similar presentations

Presentation on theme: "Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept.

Similar presentations

Presentation on theme: "Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept."— Presentation transcript:

Similar presentations

About project

Feedback