Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org.

Similar presentations

Presentation on theme: "Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org."— Presentation transcript:

1 Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org

2 Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation ● Wrap-up summary

3 An Example Problem ● The U.S. Census Bureau collects lots of data ● If released in raw form, this data would provide – a wealth of valuable information regarding broad population patterns – Access to private information regarding individuals  How to allow analysts to extract population patterns without learning private information?

4 Privacy-Preserving Data Mining “ The study of how to produce valid mining models and patterns without disclosing private information. ” - F. Giannotti and F. Bonchi, “ Privacy Preserving Data Mining, ” KDUbiq Summer School, 2006. Several broad approaches … this talk  data transformation (the “census model”)

5 Data Transformation ( the “Census Model”) Private DataTransformed Data Data Miner Researcher

6 DT Objectives Minimize risk of disclosing private information Maximize the analytical utility of the transformed data DT is also studied in the field of Statistical Disclosure Control.

7 Some things DT does not address… Preventing unauthorized access to the private data (e.g. hacking). Securely communicating private data.  DT and cryptography are quite different. (Moreover, standard encryption does not solve the DT problem)

8 Assessing Transformed Data Utility How accurately does a transformation preserve certain kinds of patterns, e.g.: ● data mean, covariance ● Euclidean distance between data records ● Underlying generating distribution? How useful are the patterns at drawing conclusions/inferences?

9 Assessing Privacy Disclosure Risk Some efforts in the literature to develop rigorous definitions of disclosure risk –no widely accepted agreement This talk will take an ad-hoc approach: –for a specific attack, how closely can any private data record be estimated?

10 Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation ● Wrap-up summary

11 Some DT approaches ● Discussed in this talk: – Additive independent noise – Euclidean distance preserving transformation ● My current research ● Others: – Data swapping/shuffling, multiplicative noise, micro- aggregation, K-anonymization, replacement with synthetic data, etc…

12 Additive Independent Noise For each private data record, (x 1,…,x n ), add independent random noise to each entry: –(y 1,…,y n ) = (x 1 +e 1,…,x n +e n ) –e i is generated independently as N(0, d*Var(i)) –Increasing d reduces privacy disclosure risk

13 Additive Independent Noise d = 0.5

14 Additive Independent Noise Difficult to set d producing low privacy disclosure risk high data utility Some enhancements on the basic idea exist E.g. Muralidhar et al.

15 Talk Outline ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation (EDPDT) ● Wrap-up summary

16 EDPDT – High Data Utility! ● Many data clustering algorithms use Euclidean distance to group records, e.g. – K-means clustering, hierarchical agglomerative clustering ● If Euclidean distance is accurately preserved, these algorithms will produce the same clusters on the transformed data as the original data.

17 EDPDT – High Data Utility! Original dataTransformed data

18 EDPDT – Unclear Privacy Disclosure Risk ● Focus of the research... approach  – Develop attacks combining the transformed data with plausible prior knowledge. – How well can these attacks estimate private data records?

19 Two Different Prior Knowledge Assumptions ● Known input: The attacker knows a small subset of the private data records. – Focus of this talk. ● Known sample: The attacker knows a set of data records drawn independently from the same underlying distribution as the private data records. – happy to discuss “off-line”.

20 Known Input Prior Knowledge Underlying assumption: Individuals know a) if there is a record for them along the private data records, and b) know the attributes of the private data records.  Each individual knows one private record.  A small group of malicious individuals could cooperate to produce a small subset of the private data records.

21 Known Input Attack Given: {Y 1,…,Y m } (transformed data records) {X 1,…,X k } (known private data records) 1) Determine the transformation constraints i.e. which transformed records came from which known private records. 2) Choose T randomly from the set of all distance preserving transformations that satisfy the constraints. 3) Apply T -1 to the transformed data.

22 Know Input Attack – 2D data, 1 known private data record

23 Known Input Attack – General Case Y = MX ● Each column of X (Y) is a private (transformed) data record. ● M is an orthogonal matrix. [Y kn Y un ] = M[X known X unkown ] Attack: Choose T randomly from {T an orthogonal matrix: TX known = Y kn }. Produce T -1 (Y un ). 23

24 Known Input Attack -- Experiments 18,000 record, 16-attribute real data set. Given k known private data records, computed P k, the probability that the attack estimates one unknown private record with > 85% accuracy. P 2 = 0.16 P 4 = 1 … P 16 = 1

25 Wrap-Up Summary ● Introduction – Privacy preserving data mining – what problem is it aimed to address? – Focus of this talk: data transformation ● Some data transformation approaches ● My current research: Euclidean distance preserving data transformation

26 Thanks to … ● You: – for your attention ● Kun Liu: – joint research & some material used in this presentation ● Krish Muralidhar: – some material used in this presentation ● Hillol Kargupta: – joint research

27 Distance Preserving Perturbation Attributes Records

28 Distance Preserving Perturbation Y 2,5782,899Tax 1,3241,889Rent 83,82198,563Wages 10021001ID 0.4527 0.8887 0.0726 0.2559-0.0514-0.9653 -0.8542 0.4556-0.2507 × 8,432 10,151Tax -80,324-94,502Rent -22,613-26,326Wages 10021001ID = MX

29 Known Sample Attack [more]more

30 Known Sample Attack Experiments backup Fig. Known sample attack for Adult data with 32,561 private tuples. The attacker has 2% samples from the same distribution. The average relative error of the recovered data is 0.1081 (10.81%).

Download ppt "Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation Chris Giannella cgiannel AT acm DOT org."

Similar presentations

Ads by Google