Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie
Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion
Problem Settings Microdata: sensitive personal data held by an organization, e.g. medical records, transaction history. Often open to public access for reasons such as research.
Risk to Privacy An attacker knows the age 20 and zipcode of Alice. In order to infer Alice’s income, s/he issues 2 queries: q 0 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 20] AND Zipcode ∈ [15k, 15k] AND Income ∈ [80k, +∞) q’ 0 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 20] AND Zipcode ∈ [15k, 15k] AND Income ∈ (-∞, 80k) Table 1:
Solutions Output Perturbation: injecting a small random noise into each query result. ε-differential Privacy: Let Q be the set of previously answered queries. Given a new query q, the database determines whether {q} ∪ Q violates ε-differential privacy.
Output Perturbation Count Queries: SELECT COUNT(*) FROM T where pred(A 1 ) AND... AND pred(A d ), such that pred(A i ) has the format A i = * or A i ∈ [x i, y i ] Perturbed Answer: given a query q, D returns an answer q(T) + δ, where δ is a random variable subjects to Laplace distribution: f(δ) = (1/2λ) * e -|δ| / λ where λ is the noise magnitude.
ε-Differential Privacy Sibling Tables: two microdata tables T 1 and T 2 that have the same schema and cardinality and differ in only one tuple. e.g. we change Alice’s income from 85k to 30k. ε-Differential Privacy: Let Q = {q 1,..., q m } be any subset of the queries that have been answered by D, and R = {r 1,..., r m } be a set of arbitrary real numbers. D ensures ε-Differential Privacy, if the following inequality holds for any R and any pair of sibling tables T 1 and T 2 : Pr[ ∀i, q i (D) = r i | Δ 1 ] <= e ε * Pr[ ∀i, q i (D) = r i | Δ 2 ] where Δ i denotes the event that T i is the table where D is constructed.
ε-Differential Privacy: An Example A statistical database D is built on T 1. Q is the set of queries issued by an attacker, and S rst is the set of result returned by D. Assume D is constructed on another table T 2 where Alice’s income is arbitrarily modified, which may still return S rst. Pr[ D returns S rst | Alice’s income is NOT modified ] <= e ε * Pr[ D returns S rst | Alice’s income is modified ] e ε ≈ 1 + ε, which is close to 1. A smaller ε leads to better privacy.
Computation of ε-Differential Privacy L 1 Sensitivity: given a set Q of queries, its L1 sensitivity equals: S L1 (Q) = max T1, T2 ( ∑ q∈Q |q(T 1 ) - q(T 2 )| ) where T 1 and T 2 are any two sibling tables. An example: Q = {q 0, q 0 ’ }. T 1 is table 1, T 2 changes Alice's income to be 30K. We show that S L1 (Q) = 2. |q 0 (T 1 ) – q 0 (T 2 )| <= 1 and |q 0 ’ (T 1 ) – q 0 ’ (T 2 )| <= 1, so S L1 (Q) <= 2. q 0 (T 1 ) = 1, q 0 (T 2 ) = 0, q 0 ’ (T 1 ) = 0, q 0 ’ (T 2 ) = 1, so S L1 (Q) >= |1 - 0| + |0 - 1| = 2. So S L1 (Q) = 2.
Computation of ε-Differential Privacy Theorem 1: A statistical database D ensures ε-differential privacy, if and only if S L1 (Q) <= ελ. Lemma 1: Deciding whether S L1 (Q) is larger than a threshold is NP- hard. Proof: a reduction from the maximum 2-satisfiability (MAX-2-SAT) problem So the verification of ε-differential privacy is NP-hard.
Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion
Some Definitions Data Space/Query Region: We regard the data space Ω of a table T as a d-dimensional space, where the i-th dimension is A i. The region of a query q is a rectangle r in Ω such that if q has a predicate “A i ∈ [x i, y i ]”, the projection of r on A i equals [x i, y i ]. Popularity/Convergence: For any point p in the data space Ω, its popularity p(Q) is the number of query regions that cover p. The convergence of Q is the largest p(Q) of all points p ∈ Ω.
The Upper Bound of S L1 (Q) Lemma 2: For any set Q of queries, S L1 (Q) <= 2C(Q). Proof: This bound motivates a simple approach to ensure ε- differential privacy.
A Histogram Approach The above approach requires keeping values for all points, which is not practical. We can maintain a histogram H, which partitions the data space Ω into rectangular buckets. Each bucket B has a counter B.c to record the number of queries that intersect it. If B.c <= λε/2, the ε-differential privacy is preserved. If a new query intersects a bucket with counter greater than or equal to λε/2, it’s rejected.
A Histogram Approach: Simple Split The initial number of bucket is one, and a bucket B can be split in a way to minimize B’.c + B’’.c, if needed. The largest number of buckets θ is a system parameter. An example where the maximum permissible popularity λε/2 is 3:
A Histogram Approach: the Split Algorithm Algorithm Split (B) /* B is a bucket to be decomposed */ 1. U = the set of regions of the queries in Q that partially intersect B 2. if U ≠ ∅ ; 3. remove B from H 4. r ∩ = the intersection of all the regions in U 5. if r ∩ = ∅ ; 6. split B into buckets B’ and B’’ with the minimum B’.c + B’’.c using the cutting lines passing the boundaries of the regions in U 7. else 8. repetitively split B by the cutting lines passing the boundaries of r ∩ until a bucket has extent r ∩ 9. insert the new buckets into H with counters set to B.c
A Histogram Approach: A Complex Split Query q 4 : SELECT COUNT(*) FROM T where age = * AND INCOME ∈ [40000, 99999]
Limitation of Output Perturbation Volume of a query: the percentage of points in Ω that satisfy the query. For a solution that 1) ensures ε-differential privacy and 2) perturbs each answer with Laplace noise of magnitude λ, let θ be the max. number of queries that can be processed by such a solution, then: if each query has a volume at least s’ and at most 1-s’ (0 < s’ <= 1/2), θ < λε / s’. For queries with volume in (0, 1), the above solution can process at most n * λε queries.
Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion
Query Relaxation If the maximum number of supported queries is reached, new queries are all rejected. Instead of simply refusing a query, we may return a useful synthetic answer, which is based on previously answered queries, thus the privacy is not violated. This process is called relaxation. An example: q 1 ’: SELECT COUNT(*) FROM T WHERE Age ∈ [20, 51] AND Income ∈ [40K, 70K] q 1 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 50] AND Income ∈ [40K, 70K]
Query Relaxation: Compound Two disjoint sets P + and P - of queries constitute a compound P, if 1) for each point p in Ω, p(P + ) - p(P - ) equals 0 or 1. 2) All points p satisfying p(P + ) - p(P - ) = 1 form a rectangle r diff, which is the difference region of P. A synthetic answer of P is calculated by ∑ q∈P+ q(D) - ∑ q∈P- q(D)
Relaxation Error Relaxation Error E(P,q) can be calculated using the formula below: Let Q be a set of accepted queries and P a compound. A query q ∈ Q but not in P is a positive (negative) patch if after including it in P + (P - ), 1) P remains a compound and 2) E(P, q*) decreases.
Artificial Patches We can dynamically generate a query, force the database to process it normally, and use its perturbed answer to obtain a better synthetic answer for the denied query. 2d artificial queries are generated, each of which aligns with a boundary of r diff. Then each query is checked whether it’s a patch and it violates the ε-differential privacy or not.
Probabilistic Accuracy We return a synthetic answer ∑ q∈P+ q(D) - ∑ q∈P- q(D) as well as a relaxed query q*’. The synthetic answer has the expected value q*’(T), and its variance is 2λ 2 * | P + ∪ P - |, where λ is the noise magnitude. A tradeoff: more queries in P lowers the relaxation error, but increase the noise in the query results. So the user may specify an upper bound ξ of the size of a compound.
An Illustration of Relaxation
Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion
Experiment Settings Dataset: CENSUS Computer: 3G Pentium IV, 1G RAM. Parameters: Queries: select count(*) from CENSUS where A 1 ∈ [x 1, y 1 ] and A 2 ∈ [x 2, y 2 ]. The center z i of the range [x i, y i ] is chosen in 2 different ways: 1) Data: z i = t [A i ], where t is a random tuple. 2) Uniform: z i is a random value in the domain of A i. The workload of queries is 20K.
Experiment: Processing Capability Without Relaxation Two approaches: Disjoint: reject a query if its region intersects any of the previously answered query. Histogram.
Experiment: Processing Capability Without Relaxation Effects of ε and s: The upper bound of capacity: n * λε. Queries with larger regions cause faster growth of C(Q).
Experiment: Quality of Relaxation Effects of compound size: A larger compound raises the chance of finding a good compound. The compound size can be well below the bound ξ because of early termination.
Experiment: Quality of Relaxation Effects of ε: A greater ε allows more queries, thus a larger query set Q for relaxation, which enhances the relaxation quality.
Experiment: Quality of Relaxation Effects of s: Queries with larger regions cause faster growth of C(Q), which results in a smaller query set Q and a higher relaxation error.
Experiment: Computation Overhead Greater ε (s) results in higher (lower) query process capacity and the size of query set Q. Greater ξ ( θ) results in larger compounds (more buckets).
Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion
Conclusion & Future Works Propose an applicable solution (the histogram) to ensure ε-differential privacy. Use query relaxation to overcome the limitation of query processing capacity. Future works: Apply to other kinds of queries (SUM, MIN, MAX, etc.) Consider update of database. Other types of microdata besides relational tables.
THANKS