Download presentation
Presentation is loading. Please wait.
1
SAC’06 April 23-27, 2006, Dijon, France Towards Value Disclosure Analysis in Modeling General Databases Xintao Wu UNC Charlotte Songtao Guo UNC Charlotte Yingjiu Li Singapore Management Univ
2
SAC, Dijon, FranceApril 23-27, 2006 2 Outline Motivation General Location Model Value Disclosure Analysis Basic disclosure scenario Conditional disclosure scenario Combinatorial disclosure scenario Conclusion and Future Work
3
SAC, Dijon, FranceApril 23-27, 2006 3 Motivation Information Disclosure in general databases Identity Disclosure Value Disclosure SSNNameZipRaceAgeSexDividendsWagesInterests 128223Asian20M10k85k2k 228223Asian30F15k70k18k 328262Black20M50k120k35k....... n28223Asian20M80k110k15k
4
SAC, Dijon, FranceApril 23-27, 2006 4 Motivation Previous work Additive randomization approach Agrawal & Srikant, SIGMOD00, Agrawal &Aggawal PODS01 Kargupta et al. ICDM03, Du et al. SIGMOD05 Various methods from statistical databases Multiplicative rotation approach Chen et al. ICDM05 Kargupta et al. TKDE06 Limitation Conduct disclosure analysis on the data space Prune to potential attacking Our Modeling based approach First build an approximate statistical model Analyze disclosure on the parameter space Apply the model to generate data for future mining
5
SAC, Dijon, FranceApril 23-27, 2006 5 Application Database Application Testing Testing on the local development databases a small number of data samples cannot conduct performance testing Testing against the live production databases privacy disclosure incorrectly update the underlying databases. Generate mock databases for application software testing such that the generated data Valid Resembling to original data in terms of statistical distribution Privacy preserving
6
SAC, Dijon, FranceApril 23-27, 2006 6 ER Data DDL Catalog Schema & Domain Filter Schema’ Domain’ Disclosure Assessment Performance Assessment General Location Model Data Generator Synthetic database Synthetic database R R NR S S
7
SAC, Dijon, FranceApril 23-27, 2006 7 General Location Model SSNNameZipRaceAgeSexDividendsWagesInterests 128223Asian20M10k85k2k 228223Asian30F15k70k18k 328262Black20M50k120k35k....... n28223Asian20M80k110k15k Categorical Attributes (Multinomial Distribution) Categorical Attributes (Multinomial Distribution) Numerical Attributes (Multivariate Gaussian Distributions) Numerical Attributes (Multivariate Gaussian Distributions)
8
SAC, Dijon, FranceApril 23-27, 2006 8 General Location Model Given a dataset which contains n tuples Categorical attributes: Numerical attributes : The categorical part can be summarized by a contingency table with cells. The number of tuples in each cell, has a multinomial distribution For each cell d, the numerical attributes satisfy a conditionally multivariate normal distribution
9
SAC, Dijon, FranceApril 23-27, 2006 9 Parameter Fitting The MLE estimates of parameter as follows where is the set of tuples belonging to cell d
10
SAC, Dijon, FranceApril 23-27, 2006 10 Value Disclosure Attackers may be able to estimate or infer the value of a certain confidential numerical attribute of an entity or a group of entities with a level of accuracy than a threshold All numerical attribute values are generated from multi- variate normal distribution, specifically from SSNNameZipRaceAgeSexDividendsWagesInterests 28262Asian30M ……….. 28262Asian30M 28223White50F 28223White50F ………… 28223White50F
11
SAC, Dijon, FranceApril 23-27, 2006 11 Value Disclosure Analysis Basic Disclosure Scenario All numerical attributes are confidential The analysis is based on probability density contour. The disclosure is measured in terms of confidence interval or confidence region. Conditional Scenario Non-confidential + confidential attributes Combinatorial Scenario Linear combinations exist among both confidential and non-confidential attributes
12
SAC, Dijon, FranceApril 23-27, 2006 12 Privacy Measure Confidence Interval Agrawal & Srikant SIGMOD00 If the original value can be estimated with c% confidence to lie in the interval [a, b], then the interval width (b-a) defines the amount of privacy at c% confidence level Confidence Region In the p-dimensional case, a c% confidence region is determined by the probability density contour of data.
13
SAC, Dijon, FranceApril 23-27, 2006 13 Basic Disclosure Scenario Confidential attributes (X) ~ N(μ,Σ) The projection of this multidimensional ellipsoid on axis z i has bounds:
14
SAC, Dijon, FranceApril 23-27, 2006 14 Basic Disclosure Scenario Measure Privacy Heuristic method Use a hyper-rectangle to approximate the ellipsoid Measure privacy for one dimension Adjust parameters Original Interval Original Interval Dissimilarity Constrain (d) Dissimilarity Constrain (d) New Interval New Interval
15
SAC, Dijon, FranceApril 23-27, 2006 15 Conditional Scenario Confidential attributes (X) and Non-confidential attributes (S) E.g., the non-confidential values of Dividends and Wages can help predict confidential values of Interests Same method with conditional Parameters:
16
SAC, Dijon, FranceApril 23-27, 2006 16 Combinatorial Scenario RaceAgeSexDividendsWagesInterests Asian20M10k85k2k Asian30F15k70k18k Black20M50k120k35k Total Income 87k 103k 205k Many Potential Combinations exist, e.g. Dividends + Wages + Interests = Total Income Even if the level of security provided for a single confidential attribute is adequate, the level of security provided for linear combinations of confidential attributes could be very low.
17
SAC, Dijon, FranceApril 23-27, 2006 17 Combinatorial Scenario Canonical Correlation Analysis (CCA) A statistical procedure that is used to identify and quantify the relationship between two sets of variables, S and X. CCA can identify a linear combination of variables in one set, X, that have the highest correlation with a linear combination of variables in another set, S. It can be used to evaluate the level of security when estimating the linear combinations of the confidential attributes, X, using the non-confidential attributes, S.
18
SAC, Dijon, FranceApril 23-27, 2006 18 Combinatorial Scenario Canonical Correlation Analysis (CCA) λ 1 : represents the most general measure of inferential value disclosure for any combination 1− λ 1 : the worst-case security λ 1 ≤λ : no combinatorial disclosure exists Adjust parameters If λ i > λ then λ i = λ, keeping other eigenvalues, eigenvectors unchanged. Get a new Adjust : Adjust : optimization problem
19
SAC, Dijon, FranceApril 23-27, 2006 19 Conclusion Propose a model-based privacy preserving approach Investigate value disclosure in three scenarios
20
SAC, Dijon, FranceApril 23-27, 2006 20 Future Work How to conduct individual value disclosure analysis when individual privacy intervals are specified How the information loss due to modeling affects the utility of generated data
21
SAC, Dijon, FranceApril 23-27, 2006 21 Acknowledgement NSF Grant CCR-0310974 IIS-0546027 Personnel Xintao Wu, Songtao Guo, UNC Charlotte Yingjiu Li, Singapore Management Univ. More Info http://www.cs.uncc.edu/~xwu/ xwu@uncc.edu, xwu@uncc.edu
22
SAC, Dijon, FranceApril 23-27, 2006 22 Questions? Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.