Download presentation
Presentation is loading. Please wait.
1
Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical Engineering & Computer Science Syracuse University, Syracuse, New York.
2
Introduction Privacy-Preserving Data Publishing. The impact of background knowledge: How does it affect privacy? How to measure its impact on privacy? Integrate background knowledge in privacy quantification. Privacy-MaxEnt: A systematic approach. Based on well-established theories. Evaluation.
3
Privacy-Preserving Data Publishing Data disguise methods Randomization Generalization (e.g. Mondrian) Bucketization (e.g. Anatomy) Our Privacy-MaxEnt method can be applied to Generalization and Bucketization. We pick Bucketization in our presentation.
4
Data Sets IdentifierQuasi-Identifier (QI)Sensitive Attribute (SA)
5
Bucketized Data P( Breast cancer | { female, college }, bucket=1 ) = 1/4 P( Breast cancer | { female, junior }, bucket=2 ) = 1/3 Quasi-Identifier (QI)Sensitive Attribute (SA)
6
Impact of Background Knowledge Background Knowledge: It’s rare for male to have breast cancer. This analysis is hard for large data sets.
7
Previous Studies Martin, et al. ICDE’07. First formal study on background knowledge Chen, LeFevre, Ramakrishnan. VLDB’07. Improves the previous work. They deal with rule-based knowledge. Deterministic knowledge. Background knowledge can be much more complicated. Uncertain knowledge
8
Complicated Background Knowledge Rule-based knowledge: P (s | q) = 1. P (s | q) = 0. Probability-Based Knowledge P (s | q) = 0.2. P (s | Alice) = 0.2. Vague background knowledge 0.3 ≤ P (s | q) ≤ 0.5. Miscellaneous types P (s | q 1 ) + P (s | q 2 ) = 0.7 One of Alice and Bob has “Lung Cancer”.
9
Challenges How to analyze privacy in a systematic way for large data sets and complicated background knowledge? Directly computing P( S | Q ) is hard. What do we want to compute? P( S | Q ), given the background knowledge and the published data set. P(S | Q ) is primitive for most privacy metrics.
10
Our Approach Background Knowledge Published Data Public Information Constraints on x Constraints on x Solve x Consider P( S | Q ) as variable x (a vector). Most unbiased solution
11
Maximum Entropy Principle “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.
12
The MaxEnt Approach Background Knowledge Published Data Public Information Constraints on P( S | Q ) Constraints on P( S | Q ) Estimate P( S | Q ) Maximum Entropy Estimate
13
Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables
14
Maximum Entropy Estimate Let vector x = P(Q, S, B). Find the value for x that maximizes its entropy H(Q, S, B), while satisfying h 1 (x) = c 1, …, h u (x) = c u : equality constraints g 1 (x) ≤ d 1, …, g v (x) ≤ d v : inequality constraints A special case of Non-Linear Programming.
15
Constraints from Knowledge Linear model: quite generic. Conditional probability: P (S | Q) = P(Q, S) / P(Q). Background knowledge has nothing to do with B: P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m). Background Knowledge Constraints on P(Q, S, B)
16
Constraints from Published Data Constraints Truth and only the truth. Absolutely correct for the original data set. No inference. Published Data Set D’ Constraints on P(Q, S, B)
17
Assignment and Constraints Observation: the original data is one of the assignments Constraint: true for all possible assignments
18
QI Constraint Constraint: Example:
19
SA Constraint Constraint: Example:
20
Zero Constraint P(q, s, b) = 0, if q or s does not appear in Bucket b. We can reduce the number of variables.
21
Theoretic Properties Soundness: Are they correct? Easy to prove. Completeness: Have we missed any constraint? See our theorems and proofs. Conciseness: Are there redundant constraints? Only one redundant constraint in each bucket. Consistency: Is our approach consistent with the existing methods (i.e., when background knowledge is Ø).
22
Completeness w.r.t Equations Have we missed any equality constraint? Yes! If F 1 = C 1 and F 2 = C 2 are constraints, F 1 + F 2 = C 1 + C 2 is too. However, it is redundant. Completeness Theorem: U: our constraint set. All linear constraints can be written as the linear combinations of the constraints in U.
23
Completeness w.r.t Inequalities Have we missed any inequalities constraint? Yes! If F = C, then F ≤ C+0.2 is also valid (redundant). Completeness Theorem: Our constraint set is also complete in the inequality sense.
24
Putting Them Together Background Knowledge Published Data Public Information Constraints on P( S | Q ) Constraints on P( S | Q ) Estimate P( S | Q ) Maximum Entropy Estimate Tools: LBFGS, TOMLAB, KNITRO, etc.
25
Inevitable Questions: Where do we get background knowledge? Do we have to be very very knowledgeable? For P (s | q) type of knowledge: All useful knowledge is in the original data set. Association rules: Positive: Q S Negative: Q ¬ S, ¬ Q S, ¬ Q ¬ S Bound the knowledge in our study. Top-K strongest association rules.
26
Knowledge about Individuals Knowledge 1: Alice has either s 1 or s 4. Constraint: Knowledge 1: Two people among Alice, Bob, and Charlie have s 4. Constraint: Alice: (i 1, q 1 ) Bob: (i 4, q 2 ) Charlie: (i 9, q 5 )
27
Evaluation Implementation: Lagrange multipliers: Constrained Optimization Unconstrained Optimization LBFGS: solving the unconstrained optimization problem. Pentium 3Ghz CPU with 4GB memory.
28
Privacy versus Knowledge Estimation Accuracy: KL Distance between P (MaxEnt) (S | Q) and P (Original) (S | Q).
29
Privacy versus # of QI attributes
30
Performance vs. Knowledge
31
Running Time vs. Data Size
32
Iteration vs. Data size
33
Conclusion Privacy-MaxEnt is a systematic method Model various types of knowledge Model the information from the published data Based on well-established theory. Future work Reducing the # of constraints Vague background knowledge Background knowledge about individuals
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.