Download presentation
Presentation is loading. Please wait.
Published byGabriel Heath Modified over 9 years ago
1
Microdata masking as permutation Krish Muralidhar Price College of Business University of Oklahoma Josep Domingo-Ferrer UNESCO Chair in Data Privacy Dept. of Computer Engineering and Mathematics Universitat Rovira i Virgili
2
Diversity of microdata masking mechanisms A wide variety of microdata masking mechanisms are available Rounding Microaggregation Noise infusion Additive Multiplicative Model based Data swapping Data shuffling And many others
3
Diversity is good, but … Diversity in mechanisms also means that comparing across mechanisms can be very difficult
4
Traditional approaches for comparison Based on parameters Based on performance
5
Comparison based on parameters Syntactic approach This approach has been criticized since it does not reflect the true security offered by the mechanism Difficulty of comparing across different mechanisms How do you compare microaggregation with aggregation parameter = 5, noise addition with 10% of variance, multiplicative perturbation with parameter = 10%? Difficulty of comparing across data sets even for the same mechanism Two data sets with different characteristics may yield completely different levels of protection for the same parameter selection
6
Impact of data characteristics ID Data set 1 Data set 2 Masking value Masked data set 1 Masked data set 2 Rank of masked data set 1 Rank of masked data set 2 1110011.091 1092.36117 21110110.9079.981917.35621 32110211.00421.0771024.74234 43110311.00331.0881033.93145 54110411.09044.6761134.341510 65110511.05153.6011104.59468 76110610.92956.677985.81472 87110710.98569.9641055.37186 98110810.91273.901986.25893 109110911.01592.3341106.988109
7
Comparison based on performance Analyze the masked data for disclosure risk Identity disclosure Value disclosure Comparison based on results Many different approaches for assessing identity and value disclosure One alternative is to aggregate the different measures Raises more questions about how to aggregate
8
Example of comparison based on performance An empirical evaluation Score = 0.5(IL)+ 0.125(DLD)+0.125(PLD)+ 0.25 (ID) While this is a reasonable approach, it can be argued that the weights should be different If we remove (or modify the weight of) a criterion, the results may be different If we use only disclosure risk measures, the results would be different What about alternative measures of information loss and disclosure risk? These are typical problems with any empirically based evaluation “A Quantitative Comparison of Disclosure Control Methods for Microdata” in Confidentiality, Disclosure, and Data Access
9
Desiridata A common basis of comparison for microdata masking mechanisms that is Applicable to all mechanisms, Meaningful, Independent of the parameters of the mechanism risk assessment measure characteristics of the data
10
Our proposal: The permutation model All microdata masking mechanisms can be viewed as permutations of the original data The permutation model is Meaningful Independent of the parameters of the masking mechanism risk assessment measures characteristics of the data
11
Traditional view of microdata masking IDX Masking mechanism Masked 14424.76 21421.51 34253.97 42425.93 59394.36 64136.66 79484.38 85458.22 91634.35 102622.80 Original Data Masking Mechanism Masked Data
12
Reverse mapping (Reverse) Map the masked data back to the original data Compute rank of masked value Replace the masked value with the value of the original data with the same rank Rank of the first masked observation is 3 Replace this value with the value of X with rank of 3 Repeat for all masked records The reverse mapped values represent the permuted version of the original data IDX Rank of X Masked Rank of Masked Reverse mapping Permuted 144 7 24.76 324 214 1 21.51 114 342 6 53.97 744 424 3 25.93 426 593 9 94.36 1094 641 5 36.66 642 794 10 84.38 993 854 8 58.22 854 916 2 34.35 541 1026 4 22.80 216
13
Permuted + Residual Noise = Masked IDX Rank of X Masked Rank of Masked Reverse mapping PermutedNoiseMasked 144 7 24.76 324 0.7624.76 214 1 21.51 114 7.5121.51 342 6 53.97 744 9.9753.97 424 3 25.93 426 -0.0725.93 593 9 94.36 1094 0.3694.36 641 5 36.66 642 -5.3436.66 794 10 84.38 993 -8.6284.38 854 8 58.22 854 4.2258.22 916 2 34.35 541 -6.6534.35 1026 4 22.80 216 6.8022.80
14
The permutation model Any masking mechanism can be represented by the permutation model The masked output from any microdata masking mechanism is conceptually viewed as (or functionally equivalent to) permutation plus residual noise. We are not suggesting a new masking mechanism. IDX Permuted X NoiseMasked 144 24 0.7624.76 214 7.5121.51 342 44 9.9753.97 424 26 -0.0725.93 593 94 0.3694.36 641 42 -5.3436.66 794 93 -8.6284.38 854 4.2258.22 916 41 -6.6534.35 1026 16 6.8022.80
15
Magnitude of the residual noise IDX Permuted X NoiseMasked 144240.7624.76 214 7.5121.51 342449.9753.97 42426-0.0725.93 593940.3694.36 64142-5.3436.66 79493-8.6284.38 854 4.2258.22 91641-6.6534.35 1026166.8022.80
16
For large data sets … Disclosure prevention is achieved primarily through permutation The residual noise provides additional (but small level of) masking to prevent the original values from being released With procedures such as swapping and shuffling, there is no residual noise since the original values are released unmodified
17
The permutation model Original Data Permuted Data Masked Data Permute Residual Noise
18
Protection level Meaningful interpretation of protection No permutation = No protection Randomly sorted data = Maximum protection Simple, meaningful explanation of the protection level Actual: Level of permutation resulting from the masking mechanism
19
The adversary model The permutation model also leads to a natural maximum knowledge adversary We assume that the adversary has the ability to perform reverse mapping on the masked data Reverse mapping can be performed if the adversary has access to the entire original data set This assumption is the same as that used in record linkage – the adversary has access to both the original and masked data set (but not the individual record linkages) Consistent with Kerckhoff’s principle that the adversary knows everything but the “key”
20
Cryptographic equivalent Ciphertext-only Adversary has access only to ciphertext (i.e. masked records). Known-plaintext Adversary has access to pairs plaintext/ciphertext (i.e. pairs original and masked records) In a non-interactive setting (microdata release), known-plaintext is the strongest possible attack Our adversary model Chosen-plaintext Adversary can choose a plaintext (original records) and get the corresponding ciphertext (masked records) Chosen-ciphertext Adversary can choose a ciphertext (masked records) and get the corresponding plaintext (original records)
21
Adversary with malicious intent One of the difficulties with microdata release was the inability to distinguish between the user and adversary A practical way of thinking of this adversary is that the intent of this adversary is purely malicious Since the adversary has access to the entire original data set, they cannot learn anything new from the data set Our adversary model differentiates the malicious adversary (who does not learn anything from the released data) from the user (who learns something from the data)
22
Adversary model The adversary is able to eliminate residual noise through reverse mapping The only protection against this adversary is permutation Original Data Masked Data Permuted Data Residual Noise Permute Adversary’s copy of the original data Masked Data Permuted Data Rank Reverse Map
23
Adversary objective The objective of the adversary is to break the key (recreate the linkage between the original and permuted data) The adversary wishes to show provable linkage Provable linkage eliminates plausible deniability
24
Auxiliary information One of the advantages of our adversary model is that it eliminates the need to consider auxiliary information Our adversary has maximum knowledge (has access to the entire original data) No auxiliary information (other than the random number seed) will help the adversary improve the linkage
25
Important clarification We are suggesting the adversary model for comparison benchmarks For risk assessment; NOT necessarily risk mitigation
26
On-going work Formalizing a measure of the permutation level Formalizing a measure of disclosure Multivariate scenario
27
Conclusion The permutation model offers a new approach for evaluating the efficacy and effectiveness of masking mechanisms It allows the data administrator to compare different masking mechanisms using the same benchmark More work remains
28
Questions, comments, or suggestions? Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.