Zhengli Huang and Wenliang (Kevin) Du OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining Zhengli Huang and Wenliang (Kevin) Du Department of EECS Syracuse University
Data Mining/Analysis Data cannot be published directly because of privacy concern
Background: Randomized Response The true answer is “Yes” Do you smoke? Yes Head Biased coin: No Tail
RR for Categorical Data Si Si+1 Si+2 Si+3 q1 q2 q3 q4 True Value: Si M
A Generalization Several RR Matrices have been proposed [Warner 65] [R.Agrawal et al. 05], [S. Agrawal et al. 05] RR Matrix can be arbitrary Can we find optimal RR matrices?
What is an optimal matrix? Which of the following is better? Privacy: M2 is better Utility: M1 is better So, what is an optimal matrix?
Optimal RR Matrix An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). Privacy Quantification Utility Quantification A number of privacy and utility metrics have been proposed. We use the following: Privacy: how accurately one can estimate individual info. Utility: how accurately we can estimate aggregate info.
Optimization Methods Approach 1: Weighted sum: Approach 2 w1 Privacy + w2 Utility Approach 2 Fix Privacy, find M with the optimal Utility. Fix Utility, find M with the optimal Privacy. Challenge: Difficult to generate M with a fixed privacy or utility. Our Approach: Multi-Objective Optimization
Evolutionary Multi-Objective Optimization (EMOO) Genetic algorithms has difficulty of dealing with multiple objectives. We use the EMOO algorithm We use SPEA2.
Our SPEA2-based algorithm
EMOO Evolution Fitness Assignment (SPEA2) Crossover Mutation Strength Value S(M): the number of matrix dominated by M. Raw fitness F’(M): the sum of the strength of the RR matrices that dominate M. The lower the better. Density d(M): discriminate the matrices with the same fitness.
Diversity Worse M5 M4 M3 M2 Utility M1 Better Privacy
The Output of Optimization Pareto Fronts The optimal set is often plotted in the objective space and the plot is called the Pareto front. Utility (error) Privacy
Experiments For normal distribution with different δ
For First attribute of Adult data
For normal distribution (δ=0.75)
Summary We use an evolutionary multi-objective optimization technique to search for optimal RR matrices. The evaluation shows that our scheme achieves better performance than the existing RR schemes.